-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path_16-introToBayes-CHICK.rmd
502 lines (336 loc) · 38.1 KB
/
_16-introToBayes-CHICK.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
# Introduction to Bayesian inference
<img src="images/chickens.jpg" alt="A picture of my chickens">
<p style="font-family: times, serif; font-size:.9em; font-style:italic"> These are the chickens from the data set used in this chapter. Actually, these are only the chickens that survived to fledge. Ouch.</p>
## Introduction {#intro-16}
In the second half of this decidedly worst stats text, we are going to introduce and apply Bayesian inference. In class, this shift aligns with the introduction of GLM about halfway through the semester for reasons that I hope will become obvious to the astute learner. If not, my rationale for introducing these concepts together is 1) both of these tools rely heavily upon knowledge of sampling distributions that we began to build earlier, and 2) we are beginning to move into the realm where we will become more reliant upon methods of estimation other than ordinary least squares (OLS), which is what we used to estimate models during the first half of this text and class (ANOVA, linear regression, and ANCOVA).
In [Chapter 12](https://danstich.github.io/worst-r/12-Chapter12.html) and [Chapter 13](https://danstich.github.io/worst-r/13-Chapter13.html) we introduced maximum likelihood estimation and extended this to include restricted maximum likelihood estimation in [Chapter 15](https://danstich.github.io/worst-r/15-Chapter15.html). The fact is most biologists will never understand the mechanical/mathematical underpinnings of those estimation methods either but they won't hesitate to use them, so why not also take a look at Bayesian estimation tools as well? Plus, as you will see, this approach will allow us to do a whole bunch of stuff we just can't do using MLE in most software packages.
During the next several chapters, I am hoping that we can dive in to the basic underpinnings of the Bayesian framework for scientific inference, along with maximum likelihood estimation, as we move through more complex extensions of linear models. The Bayesian framework has been around for a long time, but only recently has it become really broadly applicable to common analytical problems. There are some fundamental (and philosophical) differences between the use of maximum likelihood estimation (aka "frequentist") methods and the application of Bayes theorem for answering statistical questions. I am hoping we can touch on some of these during our discussions and show the real, practical strengths of maximum likelihood and Bayesian inference that might actually make you want to use one or the other for certain applications.
For better or worse, there is no way we can possibly do a comprehensive treatment of Bayesian statistics within the context of a survey-style, applied statistics course or textbook such as this. We can, however, set you up with some basic tools so you can apply Bayesian inference to commonly encountered situations (t-tests, regression, ANOVA, ANCOVA, GLM, LMM, and GLMM) that should allow you to explore these concepts on your own in the future. To achieve this, I would like for us to cover some Bayesian analogs to some of frequentist tests that we have considered so far this semester. For this reason, we will explore maximum likelihood and Bayesian estimation methods side by side while learning new techniques during class.
Even though we are switching out our estimation method in this Part, we'll continue to work with the `tidyverse`. Be sure to load it whenever you are ready to get started.
```{r warning = FALSE, message=FALSE}
library(tidyverse)
```
## Intro to Bayes Theorem
Bayes Theorem provides the mathematical framework through which Bayesian inference is applied to quantitative questions. The theorem, in the most basic sense, helps us understand the probability of some event given conditions we suspect are related to that event. This elegant theorem was first derived by Reverend Thomas Bayes in the 1700s. The equation was later reworked by Pierre-Simon Laplace to yield the modern version that we use today:
$$P(A|B) = \frac{P(B|A)\cdot P(A)}{P(B)}$$
which is read "the probability of A given B is equal to the probability of B given A times the probability of A, divided by the probability of B".
A common example application of this theorem that may be of interest to biology students is the probability of having cancer at a given age (thanks Wikipedia!). The example goes something like this:
Suppose we want to know the probability that an individual of age 50 has cancer. We might only have information about the marginal probability of their having cancer given that they are a human (let's say 1%), and the probability of their being 50 years old given population distribution of ages (let's just say 3% for the sake of demonstration). To calculate the conditional probability that an individual who is 50 years old has cancer, we would need one more piece of information: the probability that a person with cancer is 50 years old. The calculation is relatively straightforward if we know this number exactly, and we can derive an exact conditional probability. For this example, let's start by assuming the probability that people who have cancer are 50 years old is 2%. Now we can calculate the conditional probability that a person who is 50 years old has cancer:
Start with the theorem:
$$P(Cancer|Age 50) = \frac{P(Age 50|Cancer)\cdot P(Cancer)}{P(Age 50)}$$
Through substitution we get:
$$P(Cancer|Age 50) = \frac{(0.02\cdot 0.01)}{0.03}$$
And now we can solve for the conditional probability:
$$P(Cancer|Age 50) = 0.00667$$
Hopefully, you can see from this example and earlier learning about rules of probability why this is such an important theorem in statistical probability theory. In fact, this is one of the reasons for the recent resurgence in the use of Bayes Theorem in applied Bayesian inference in biological and ecological statistics during the past couple of decades. But, if it's so useful, then why has it only been heavily used recently?
It will quickly become obvious to you that the answer is "computers". To demonstrate this, let's consider a slightly more complex example.
Now, let's assume that we don't actually have exact information about the probability that individuals who have cancer are age 50. Let's instead assume that we only have a rough idea about that probability, and that we can put a loose distribution around it. Now, there is no longer an exact mathematical solution for Bayes Theorem but rather an infinite number of potential solutions. If we know the distribution, then we can discretize the distribution and find a finite number of solutions to the theorem that would allow us to describe the probability of our event of interest most of the the time. (This should sound like calculus. Because it is. Don't worry, you don't need to do calc here.) However, there are cases for which this problem becomes intractable without the use of computers, even when the distribution is known. You can imagine that this becomes considerably more complex if the form of the distribution is not known with certainty.
For the sake of demonstration, let's examine how this procedure changes if we have some uncertainty in one of our probabilities on the right-hand side of the equation:
Looking back at our example, we had:
$$P(Cancer|Age 50) = \frac{P(Age 50|Cancer)\cdot P(Cancer)}{P(Age 50)}$$
And, by substitution:
$$P(Cancer|Age 50) = \frac{0.02\cdot 0.01}{0.03}$$
Let's now assume that the proportion of people with cancer who are also 50 years old is now an unknown quantity that is drawn from a beta distribution that can be described by parameters $\alpha$ = 200, and $\beta$ = 10,000:
$$P(Cancer|Age 50) = \frac{Beta(200, 10000) \cdot 0.01}{0.03}$$
Realistically, this is a *much* tighter distribution than we would use for a situation like this, but we'll discuss that later. The point is that even a little uncertainty makes the process more complicated than if exact values are *known*.
We can make this distribution in R:
```{r}
# Simulate 10000 random values for p(Age 50 | Cancer)
p_50_cancer <- rbeta(1e4, 200, 1e4)
```
We can also look at this distribution:
```{r}
# Make the object into a df with
# a single column automatically
# named p_cancer_50
p_50c <- data.frame(p_50_cancer)
# Plot it
ggplot(p_50c, aes(p_50_cancer)) +
geom_histogram(bins = 20) +
xlab("P(50 | Cancer)")
```
Now, we have a working probability distribution for the probability of being age 50 given that one has cancer. We can plug this into Bayes Theorem and construct a probability distribution to solve for the inverse: the probability of having cancer given that the patient is age 50 $(P(Cancer|Age 50))$. Let's do it in R.
First, let's define our other marginal probabilities as variables in R.
```{r}
p_cancer <- 0.01 # Marginal probability of having cancer
p_50 <- 0.03 # Marginal probability of being age 50
```
Now we can solve the theorem for our finite number of observations:
```{r}
p_cancer_50 <- (p_50_cancer * p_cancer) / p_50
```
We can calculate descriptive statistics for the conditional probability so that we can describe the distribution.
```{r}
# Using a mean and standard deviation
mean(p_cancer_50)
sd(p_cancer_50)
```
We can calculate quantiles (95% CRI). Note that in Bayesian inference, we are going to call these 'credible intervals' (CRI) or 'high density intervals' (HDI) depending on assumptions related to normality, but they are functionally the same thing as 'confidence intervals' as far as we are concerned (tell no one I said that eveR).
```{r}
quantile(p_cancer_50, probs = c(0.025, 0.50, 0.975))
```
We can also look at the actual distribution:
```{r}
# Make the object into a df with
# a single column automatically
# named p_cancer_50
p_c50 <- data.frame(p_cancer_50)
# Plot it
ggplot(p_c50, aes(p_cancer_50)) +
geom_histogram(bins = 20) +
xlab("P(Cancer | 50)")
```
Finally, let's say now that there is uncertainty in all of our probabilities of interest:
```{r}
# First, let's define each of our
# probabilities as variables in R
p_50_cancer <- rbeta(1e4, 20, 1e3)
p_cancer <- rbeta(1e4, 10, 1e3)
p_50 <- rbeta(1e4, 30, 1e3)
```
We can continue as before. We solve the theorem for our finite number of observations:
```{r}
p_cancer_50 <- (p_50_cancer * p_cancer) / p_50
```
We calculate descriptive statistics for the conditional probability to describe the probability using a mean and standard deviation.
```{r}
mean(p_cancer_50)
sd(p_cancer_50)
```
Quantiles (95% CRI and median):
```{r}
quantile(p_cancer_50, probs = c(0.025, 0.50, 0.975))
```
And, have a look at our new distribution:
```{r}
# Make the object into a df with
# a single column automatically
# named p_cancer_50
p_c50 <- data.frame(p_cancer_50)
# Plot it
ggplot(p_c50, aes(p_cancer_50)) +
geom_histogram(bins = 20) +
xlab("P(Cancer | 50)")
```
Now you can see why computers are starting to matter, and we are not even doing Bayesian inference yet. Why is that?
Because we still haven't collected any data!!! All that we have done here is state some basic mathematical representations of our beliefs about certain conditional and marginal probabilities of specific events. Those beliefs may be useful representations, or they may be way off!
This mathematical formulation of our 'beliefs' is known as the prior distribution for the probability of the event of interest. "Want to know more about the prior distribution", you say? How convenient...
## The prior
The prior distribution, in simple terms, is the information that we have at our disposal *prior* to collecting any further data. Those data might come in the form of hard numbers collected through a pilot study, or they might come from some logical process based on deductive reasoning. We will discuss the fact that the latter form of knowledge can be really useful for establishing book-ends.
One of the really attractive aspects of Bayesian inference is that we have the ability to incorporate information from prior experiences into our statistical models. The advantage of this is that we can start off with some information, and then collect new information to update our beliefs. Why would we want to do this? Glad you ask:
**1. Improved inference** The use of an informed prior allows us to improve the precision of our parameter estimates by narrowing the scope of credible values that "the machine" considers for our estimates. A strong prior can keep our estimates within a certain range of realistic values, for example, if we don't have a ton of data.
**2. Adaptive research** Incorporation of information from previous studies allows us to continually update our scientific beliefs in an iterative way. If we have data from a similar study, or a previous year of study, then we can use that to inform inference moving forward to obtain more accurate and precise estimates of the parameters of interest, either by adjusting our prior or by including additional data directly.
**3. Hypothesis testing** We can use specific formulations of the prior distribution to test specific hypotheses about the probability of the event of interest. For example, if we suspect that the probability of a patient surviving an operation is strongly related to the age (or some other pre-existing condition) of the patient, then you could test different formulations of the prior and see which one results in a better model fit to your data. I tend to favor use of cross-validation criteria for Bayesian model selection to do this these days.
**4. Incorporation of uncertainty** If there is a lot of uncertainty in the event of interest, we can set a very "weak" or "diffuse" prior. When the prior is extremely diffuse (e.g. a uniform or "flat" prior), then Bayesian inference will yield results that are essentially identical to the results we expect to get from maximum likelihood estimation. The only noticeable difference may be increased precision or accuracy under Bayesian inference in some cases depending on the estimator that we use (some max likelihood estimators don't do well in some situations in which Bayesian does just fine).
So let's go through a couple examples of what a prior distribution actually looks like.
### The hospital example
For this example, let's assume that we are interested in the survival of a hospital patient. Survival will be denoted as a 'success', or 1, and mortality as a 'failure', or '0'. In this sense, we are dealing with a binomial outcome. But, remember, we can always represent binomial outcomes on the probability scale...right?
In this case, let's say that we are assuming *a priori* that survival might be due to random chance, or that it might be influenced by some factor of interest (we'll use "hospital" in the example below).
There are multiple approaches that we could take to formulating a prior distribution for this case.
```{r}
# A uniform distribution that indicates we
# have no knowledge about how survival
# varies between hospitals
flat <- runif(1e4, 0, 1)
# A diffuse prior that indicates we think
# survival is the same between hospitals but
# we don't want to make too strong a statement
diffuse <- rbeta(1e4, 5, 5)
# A peaked (strong) prior that indicates we
# are relatively certain ahead of time that
# survival is the same in both hospitals
strong <- rbeta(1e4, 500, 500)
# A strong prior that indicates we think
# survival is substantially different
# between hospitals
bimodal <- rbeta(1e4, .5, .5)
```
We can combine these into a dataframe for visualizing them, and then use the `pivot_longer()` function to stack the dataframe for plotting:
```{r}
priors <- data.frame(flat, diffuse, strong, bimodal) %>%
pivot_longer(cols = c(flat, diffuse, strong, bimodal))
```
We can look at these to compare them. Note that the x-axis is the same in all of the plots below, so changes in the location, shape, and spread are all controlled by the differences in the parameters used to specify each of these priors.
```{r}
ggplot(priors, aes(x = value, color = name, fill = name)) +
geom_histogram(bins = 20) +
facet_wrap(~name, scales = "free_y")
```
You can see how different each of these priors is from one another. Hopefully, you are also starting to think about the different kinds of hypotheses that we might test with these different priors. In this case, the issue that we are always trying to address is whether or not survival (or whatever) is due only to random chance. This could be likened to asking whether or not a coin that we toss is a fair coin, or if it has some bias (say for example that it is more likely to land heads up because it is heavier on one side).
Now that we have a prior distribution for our event of interest, we can go out into the world and collect some data about that event. We will then use those data to formulate a 'posterior' distribution that reflects some combination of our prior distribution and the data that we have collected. This process is commonly referred to as 'updating' our prior beliefs about the event of interest, and is the foundation that underlies Bayesian inference. How we get from the prior to a posterior is wholly dependent on the tools we use to obtain the solution to Bayes theorem, but most often this occurs through the use of Markov-chain Monte Carlo simulation. This approach allows us to work through Bayes theorem one set of values at a time to obtain a heuristic, simulation-based approach to solving for conditional probabilities. We will discuss this in some (but not too much!) detail as we move forward.
Before moving on, it is important to note that our prior beliefs can potentially have a strong influence on the posterior distribution. This has been the subject of much controversy in the application of Bayesian inference to modern scientific study. Our goal in using prior information should not be to dominate the posterior with our prior beliefs in biological and ecological studies, specifically. It should be to support improved inference through the inclusion of relevant information, and can be extremely helpful for situations in which data are somewhat deficient. We want our data to dominate the form of the posterior distributions that result from our analyses. If this is not the case, then we need to be explicit about this and should almost always attempt to evaluate the "sensitivity" of our posterior distribution(s) to the prior(s) we have chosen. This is a field of ongoing development in specific disciplines, and I encourage you to seek out the relevant literature on the matter if you intend to use Bayesian inference in your own research.
## The posterior
Estimation of the posterior predictive distribution is really the hallmark of Bayesian inference, and is the crux of any applied analysis that uses this framework to test hypotheses in biology and ecology. The posterior predictive distribution (more commonly called the 'posterior') is the estimated probability distribution of unobserved events conditional on some set of observations related to that event.
The posterior distribution can be estimated as the product of our prior distribution and the corresponding likelihood by re-arranging Bayes theorem:
$$posterior \propto prior \cdot likelihood$$
You'll recall from our early adventures into probability distributions and the moments of those distributions that every probability distribution that we work with has a 'likelihood function'. So, if our prior distribution was a beta distribution (let's say for a binomial response), then we would use the likelihood for the Beta distribution in the theorem above. For a given observation, we could calculate the value of the likelihood for that observation and solve the theorem exactly...sometimes...but not usually in practice. In order to do this, we need to know the form of the posterior ahead of time. There are a relatively limited set of conditions that allow us to know this ahead of time. Namely, we need to know that we are working with a 'conjugate' prior. Without getting too far afield, these are prior distributions for which the form of the posterior is defined and known because it is from the same family as the prior. In our example above, the beta distribution is a conjugate prior for the binomial likelihood, so the solution to Bayes theorem is, relatively speaking, trivial compared to other situations. This is the primary reason that computers are needed to implement modern Bayesian inference. Most of the time we do not know the form of the posterior distribution ahead of time, so we use MCMC sampling to approximate the distribution numerically.
In the simplest sense, the posterior distribution is a combination of our prior distribution and our data. So if you remember nothing else in the explanation, remember that.
## The farm data
In this section, we will apply Bayes theorem to update our prior beliefs about the probability of some event of interest in order to demonstrate how we can estimate a conditional probability for that event given some data. We will use the example to demonstrate how the prior and our data interact to form the posterior.
We start by reading in data.
```{r}
# Read in the data
birds <- read.csv("data/farmdata.csv")
# It's a short data set.
# Let's just print it to the console
birds
```
This data set contains information about chick survival from hatch to adulthood for each cohort of birds on my small-scale poultry farm through 2019. The file contains data on the starting number at hatch, the number of chicks fledged, and the species of bird.
We will use the data set to estimate chick survival from hatch to fledge. There are a number of ways we can do this. In practice, we will probably use MCMC sampling 99% of the time (haha, Bayesian joke). But we may run into situations in the real world in which we can estimate the posterior distribution by hand or in which we need an alternative algorithm for flexibility.
Let's start simple by estimating the mean expected survival of chicks on my farm across all cohorts and species.
For this book, we will introduce Bayesian estimation using a similar approach to what we learned for GLM. You need to **pause here** and **appreciate how ridiculously easy folks have made this for you**. I used to teach students how to code each of their models by hand in this class. We had to package the data in special ways, write out separate model files with pre-defined likelihoods, and manage the monstrous results lists by hand. Now, thanks to the functionality provided by packages such as [`RStan`](http://mc-stan.org/rstan/) and [`rstanarm`](https://cran.r-project.org/web/packages/rstanarm/index.html) you can do this without ever leaving the comfort of R. Do also note that the real power as a modeler still lays in the ability to formulate and specify these models explicitly. This opens up whole new possibilities for model and data structuring that can't be achieved in any single R package. But, that is for another course and a *much* better textbook (eventually I will add citations...or not...this is The Worst Stats Text eveR). We will discuss how the actual estimation works as we go along (and in a less diffuse fashion in class), but you'll need to spend a significant amount of time on your own if you are looking for a deep understanding of the mechanics. You'll probably want to build up to the [Stan User Manual](https://mc-stan.org/docs/2_25/reference-manual/index.html#overview) rather than starting there if this is your first exposure to Bayesian estimation methods.
The hardest part about getting started with Stan is installing the necessary R packages, but this is getting easier all the time.
We are going to install two R packages here: `RStan`, the interface to Stan software, and `rstanarm`, a package that provides Bayesian generalized linear models via Stan. This package allows us to fit everything from ANOVAs discussed in [Chapter 7](https://danstich.github.io/worst-r/7-1-anova.html) to generalized linear mixed models discussed in [Chapter 15](https://danstich.github.io/worst-r/15-Chapter15.html).
But, first we need to do a little work to get ready. Here's what we are going to do:
**Step 1**: Configure the C++ toolchain on your operating system following the instructions provided by the Stan Development Team on the [RStan wiki](https://github.com/stan-dev/rstan/wiki/Configuring-C---Toolchain-for-Mac)
**Step 2**: Install Rstan following the next step on the same wiki [here](https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started#installation-of-rstan).
**Step 3**: Install the `rstanarm` package by running: `install.packages("rstanarm")`.
Once, you've done that, don't forget to load it:
```{r, warning=FALSE, message=FALSE}
library(rstanarm)
```
## Running a Bayesian model with `rstanarm`
Any time we use a convenient wrapper in R, we sacrifice a little control, but for getting exposure to Bayesian methods I think this is a worthwhile sacrifice. Truth be told, I write all of my models out by hand because I often need to use non-standard models such as non-linear growth models and I need that control. The convenience and flexibility offered through the model-fitting functions in `rstanarm` alone will keep you more than busy until you reach a deeper level of understanding needed to write your own should you aspire to do so.
Let's go ahead and estimate a binomial logistic regression that estimates the probability that a `hatched` chick is also `fledged`.
For the sake of simplicity, we'll create a new column in the `birds` data first that contains the number of birds that `died` instead of `fledged`. We will also drop the geese from the data set with `filter()`.
```{r}
birds <- birds %>%
filter(species != "goose") %>%
mutate(died = hatched - fledged)
```
The last thing we'll do is tell R to set up some options for Stan to make things work faster. **We only need to do this once per R session.**
```{r}
rstan::rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
```
Now, we can fit our first Bayesian model.
Note that we use the `cbind()` function to combine the columns for birds that successfully `fledged` and those that `died`. We also tell R that the sampling distribution for our response is `binomial` (number of successes out of the total number) so it needs to estimate the response as a proportion or probability using the `logit` link function.
```{r}
chick_model <- stan_glm(formula = cbind(fledged, died) ~ species,
family = binomial(link = logit),
data = birds
)
```
This model tests the null hypothesis that the mean probability of chick survival from `hatched` to `fledged` is the same between `species` of bird on my farm. That means that this is an ANOVA-style GLM, but is estimated using Bayesian methods.
We can have a look at a quick `summary()` of our model in the usual way:
```{r}
summary(chick_model)
```
If you look closely at the output above, you should be able to recognize that just as with the `glm()` function we get a brief summary of our coefficients in the `Estimates` portion of the output. You should also notice that these estimates are on the logit scale, just as is the case with `glm()`. How neat.
### Interpreting the model summary
As for the rest of this summary. For now, we will focus on a few specific things:
**1.** Do our estimates make logical sense? Not are they right or wrong, but is our estimate crap or not? If the 95% CRI goes from zero to one on the real scale here, then we probably have some issues with estimation because this means that we learned nothing new from the data (crap).
**2.** We need to look at the value of `Rhat`($\hat{r}$, the potential scale-reduction factor). This statistic is a diagnostic that can help us determine whether or not the model has even converged on an estimate for our parameter(s) of interest. It assesses the degree of mixing (agreement) between the chains that we used to get our estimates. Here, our values are about 1, so we are satisfied. Larger values would indicate a problem. We will examine some graphical diagnostics of mixing below.
**3.** We need to pay attention to `n_eff`. This quantity is the 'number of effective samples' that we have taken from the posterior. As discussed in class, the draws we take from the posterior can be auto-correlated to varying degrees depending on the sampler used. `n_eff` tells us how many independent samples we can actually consider ourselves to have drawn from the posterior. To have have some degree of confidence in our parameter estimates, we want this to be at least several hundred large. More is better, but these models can take a long time to run as they build in complexity so there is a balance to be struck. If we are running long chains and we still are not achieving large `n_eff`, it is a pretty good indication that we need to increase our thinning rate or (more likely) consider an alternative model parameterization. We are good to go according to the output from the `chick_model` above.
## More diagnostics
In this section, we will examine some diagnostics to assess the convergence of our parameter estimates and identify any unusual trends in the chains we use for estimation. But, it takes a little while to get there.
First, a brief foray into how these models are estimated. Basically all Bayesian estimation methods rely on some variant of Monte Carlo (random) sampling. Depending on the algorithm, we draw a parameter value, plug it into the model likelihood and evaluate the likelihood of the parameter value relative to the data collected, and then make some decision about whether or not to keep that estimate as part of the posterior distribution (our final parameter estimate). We do this thousands of times, and all of the parameter values that we keep become part of the posterior parameter estimate. We use computer algorithms to determine which parameters values are considered, how they are chosen and whether they should be retained. These algorithms rely on pseudo-random or guided walks called "Markov chains". Collectively the approach is referred to as Markov Chain Monte Carlo estimation. For each **iteration** of the model, we retain one value of each parameter for each **chain**, and we run multiple chains (usually 3-4). We then assess model stability by examining how well the chains *mix* ($\hat{r}$) and whether there are pathological issues related to autocorrelation between samples (`n_eff`). Stan and other software programs output each value of each parameter for each chain and each run of a Bayesian model. Therefore, instead of just getting a mean and 95% confidence interval, we get thousands of individual estimates that we can use to calculate whatever descriptive statistics or derived quantities we choose (e.g. Difference = post(Group A) - post(Group B)).
### Trace plots
In the plot below, the x-axes are model iteration number. Each point on a line shows the parameter value drawn for each chain (of 4 that are run by default) for each iteration that we ran the model (default = 4,000 runs). We can see that there is a high degree of overlap between the four chains, which should inspire some confidence because it means that all of the chains converged on a similar space in each panel (parameter) below.
```{r, fig.height=3, fig.width=6}
library(bayesplot)
mcmc_trace(chick_model, facet_args = list(scales = "fixed"))
```
### Colinearity
Next, we can take a look at the **pairs plot** to see whether there are any obvious correlations of concern between our parameters. As anticipated, everything looks okay here.
```{r}
mcmc_pairs(
chick_model,
np = nuts_params(chick_model),
off_diag_args = list(alpha = 0.25)
)
```
### Divergence
Not like the movies, like "does not follow expectation". Divergent transitions are an indication that something has gone wrong in your model. You can read more about them here someday, once you understand the first three paragraphs in this section of the Stan Manual [(Chapter 15.5)] (https://mc-stan.org/docs/2_25/reference-manual/divergent-transitions.html). For now, understand that divergence in this sense is a bad thing and it probably means you have a problem with the model that you've created. It doesn't mean you have to throw your data in the garbage. There are some tools in Stan to account for this. But, this is usually a good sign that there is at least something in your model you could change to make it easier to estimate. This could mean changing the likelihood, transforming the data, or re-casting it in another way. Even though that can be scary, we still need to take a look!
By default, models fit using RStan will throw a warning if there are divergent transitions. You can get a quick of whether or not this is an issue like so:
```{r}
mcmc_nuts_divergence(nuts_params(chick_model), log_posterior(chick_model))
```
Again, no problems here as far as we are concerned. We will address any further issues related to divergence as they arise later in the textbook or class.
## Model selection
Let's say we now want to say something about statistical significance of our group effects. There are a couple of ways in which we could do this. First, we could use model selection just like we demonstrated in [Chapter 11.5](https://danstich.github.io/worst-r/11-5-a-priori.html)...sort of. Instead of using AIC (or DIC or BIC or WAIC) this time we will use a leave-one-out cross-validation information criterion, LOO-IC from the `loo` package [(see web vignette here)](https://cran.r-project.org/web/packages/loo/vignettes/loo2-example.html).
But first, we'll need another model to which we can compare `chick_model`. Just as we did for `lm()` and `glm()`, we can create a `null` model against which we can evaluate the performance of `chick_model` to see if including species improves our understanding of chick survival.
```{r}
null_model <- stan_glm(formula = cbind(fledged, died) ~ 1,
family = binomial(link = logit),
data = birds
)
```
Beautiful, now we can do some cross validation. **This will take a hot minute!** On the Windows computer I am using, I need to use a single core for this, but you can use more if your computer will let you. Note that you will get a message here telling you that the model is being refit because observation 3 is a bit of a stinker.
```{r, warning = FALSE, message = FALSE}
null_loo <- loo(null_model, k_threshold = 0.70, cores = 1)
chick_loo <- loo(chick_model, k_threshold = 0.70, cores = 1)
```
And, finally, we compare the models:
```{r}
loo_compare(null_loo, chick_loo)
```
And, based on the same rules of thumb that we use with AIC we can see that the model incorporating species is better supported than the null model, so we can reject the null.
## Summarizing results
But, what if we want to say something about statistical significance without doing model selection? We can look at the derived difference of survival probabilities for `chicken` and `duck` using our model, as well. To do this, we need to get the posterior survival probabilities for each species on the real scale. This is actually pretty painless for the current example.
I am going to save the output of that function to a new object called `ests` for "estimates". This object will have one column for each of the parameters estimated in the model: the intercept, or `(Intercept)` and `speciesduck`. This ought to look really similar to the notation used in the output of `glm()`!
```{r}
ests <- data.frame(as.matrix(chick_model))
names(ests) <- c("Chicken", "Duck")
```
If we are interested in the difference in survival between `chicken` and `duck`, we can now compare the two directly using the column for `speciesduck` in this new object, but we need to convert it from the link scale to the logit scale using the `invlogit()` function from the `rstanarm` package (in this case):
```{r}
survival_post <- data.frame(apply(ests, 2, invlogit))
```
And now, we can do all the descriptive statistics we like.
We could estimate the posterior means like this:
```{r}
apply(survival_post, 2, mean)
```
Or we could get quantiles of interest, for example 95% credible (high-density) intervals:
```{r}
apply(survival_post, 2, quantile, probs = c(0.025, 0.975))
```
We can even plot the predicted probability of survival for each species across MCMC iterations to visualize the posterior distributions as boxplots, histograms, or violins.
Pivot the posterior estimates into long format so we can feed it to `ggplot()` like our usual data sets and then we'll demonstrate.
```{r}
plotter <- survival_post %>%
pivot_longer(cols = c("Chicken", "Duck"),
names_to="Species",
values_to="Survival") %>%
group_by(Species)
```
Now, we can plot our posterior samples just like we did in earlier chapters.
```{r}
# Get group-specific means from raw data for plotting
birds %>%
group_by(species) %>%
summarize(means = mean(fledged/hatched), .groups = "keep")
# Save these to a quick df
raw_stats <- data.frame(
Species = c("Chicken", "Duck"),
means = c(0.864, 1)
)
# Make the plot of our predictions against raw data
ggplot(plotter, aes(x = Species, y = Survival,
color = Species, fill = Species)) +
geom_jitter(alpha = 0.05, width = 0.05) +
geom_violin(alpha = 0.05, draw_quantiles = 0.5) +
geom_point(mapping = aes(x = Species, y = means),
data = raw_stats, color = "black", size = 1) +
theme_bw()
```
As we can see from this, there is a lot of uncertainty in the estimated survival of ducks, due primarily to the small sample size but also because all of our cohorts experienced 100% survival.
Is the difference between these real? We could actually calculated it to find out. Let's store the difference between species as a new column in this data set.
```{r}
survival_post$difference <- survival_post$Chicken - survival_post$Duck
```
Now, we can go on to calculate the mean difference:
```{r}
mean(survival_post$difference)
```
And we can even compare the 95% high density interval (HDI) on the difference to see if it includes zero as a credible value for the true difference. If zero is contained within the 95% HDI, then we fail to reject the null hypothesis that there is no difference in survival between `Chicken` and `Duck`.
```{r}
quantile(survival_post$difference, c(0.025, 0.975))
```
A quick look at a histogram of the difference will show us that we have basically no evidence to reject the null based on overlap with zero:
```{r}
ggplot(survival_post, aes(difference)) + geom_histogram(bins = 20)
```
## Next steps {#next-16}
There you have it: your first Bayesian data analysis. Needless to say (perhaps?) this was a simple example. However, it demonstrates a basic application and some of the tangible advantages over some of the other methods we have been using so far. We will continue to layer complexity into our Bayesian models as we move forward.