05-heterogeneity.Rmd

# Between-study Heterogeneity {#heterogeneity}

![](schiffchen.jpg)

By now, we have already shown you how to pool effect sizes in a meta-analysis. In meta-analytic pooling, we aim to **synthesize the effects of many different studies into one single effect**. However, this makes only sense if we aren't comparing **Apples and Oranges**. For example, it could be the case that while the overall effect we calculate in the meta-analysis is **small**, there are still a few studies which report **very high** effect sizes. Such information is lost in the aggregate effect, but it is very important to know if all studies, or interventions, yield small effect sizes, or if there are exceptions.

It could also be the case that even some very **extreme effect sizes** were included in the meta-analysis, so-called **outliers**. Such outliers might have even distorted our overall effect, and it is important to know how our overall effect would have looked without them.

```{block,type='rmdinfo'}
The extent to which effect sizes vary within a meta-analysis is called **heterogeneity**. It is very important to assess heterogeneity in meta-analyses, as high heterogeneity could be caused by between-studies differences. For example, a continuous variable, like the proportion of male participants, might influence the effect - or there might be two or more **subgroups** of studies present in the data, which have a different true effect. Such information could be very valuable for **research**, because this might allow us to find certain interventions or populations for which effects are lower or higher.

Very high heterogeneity could even mean that the studies have nothing in common, and that there is no **"real" true effect behind our data**, meaning that it makes no sense to report the pooled effect at all [@borenstein2011].
```


## Heterogeneity statistics

There are **three types of heterogeneity measures** which are commonly used to assess the degree of heterogeneity. In the following examples, $k$ denotes the individual study, $K$ denotes all studies in our meta-analysis, $\hat \theta_k$ is the estimated effect of $k$ with a variance of $\hat \sigma^{2}_k$, and $w_k$ is the individual **weight** of the study (i.e., its *inverse variance*: $w_k = \frac{1}{\hat \sigma^{2}_k}$; see infobox in [Chapter 5.1.1](#fixed) for more details).

**1. Cochran's *Q* **

Cochran's *Q*-statistic is the **difference between the observed effect sizes and the fixed-effect model estimate** of the effect size, which is then **squared, weighted and summed** (a sort of weighted standard deviation around the fixed-effect summary effect). 

$$ Q = \sum\limits_{k=1}^K w_k (\hat\theta_k  - \frac{\sum\limits_{k=1}^K w_k \hat\theta_k}{\sum\limits_{k=1}^K w_k})^{2}$$

**2. Higgin's & Thompson's *I*^2^ **

$I^{2}$ [@higgins2002quantifying] is the **percentage of variability** in the effect sizes which is not caused by sampling error. It is derived from $Q$:

$$I^{2} = max \left\{0, \frac{Q-(K-1)}{Q}  \right\}$$

**3. Tau-squared**

$\tau^{2}$ is the between-study variance in our meta-analysis. It is an estimate of the variance of the underlying distribution of true effect sizes. As we show in [Chapter 5.2.1](#tau2), there are various proposed ways to calculate $\tau^{2}$.


```{block, type='rmdachtung'}
**Which measure should i use?**

Generally, when we assess and report heterogeneity in a meta-analysis, we need a measure which is **robust, and not to easily influenced by statistical power**.

**Cochran's *Q* ** increases both when the **number of studies** ($k$) increases, and when the **precision** (i.e., the sample size $N$ of a study) increases. Therefore, $Q$ and weather it is **significant** highly depends on the size of your meta-analysis, and thus its statistical power. We should therefore not only rely on $Q$ when assessing heterogeneity.

**I^2^** on the other hand, is not sensitive to changes in the number of studies in the analyses. $I^2$ is therefore used extensively in medical and psychological research, especially since there is a **"rule of thumb"** to interpret it [@higgins2003measuring]:

* *I*^2^ = 25%: **low heterogeneity**
* *I*^2^ = 50%: **moderate heterogeneity**
* *I*^2^ = 75%: **substantial heterogeneity**

Despite its common use in the literature, $I^2$ not always an adequate measure for heterogeneity either, because it still heavily depends on the **precision** of the included studies [@rucker2008undue; @borenstein2017basics]. As said before, $I^{2}$ is simply the amount of variability **not caused by sampling error**. If our studies become increasingly large, this sampling error tends to **zero**, while at the same time, $I^{2}$ tends to 100% simply because the single studies have greater $N$. Only relying on $I^2$ is therefore not a good option either.

**Tau-squared**, on the other hand, is **insensitive** to the number of studies, **and** the precision. Yet, it is often hard to interpret how relevant our tau-squared is from a practical standpoint.

```

<br><br>

---

## Assessing the heterogeneity of your pooled effect size

Thankfully, once you've already pooled your effects in meta-analysis using the `metafor()` function, it is very easy and straightforward to retrieve the **three most common heterogeneity measures** that we described before.

In [Chapter 5.2](#random), we already showed you how to conduct a **random-effect-model meta-analysis**. In this example, we stored our **results** in the object `m_re`, which we will use again here.


```{r, echo=FALSE,warning=FALSE,message=FALSE}
library(tidyverse)
library(knitr)
library(metaforest)
df <- curry
m_re <- rma(yi = df$d,     # The d-column of the df, which contains Cohen's d
            vi = df$vi)    # The vi-column of the df, which contains the variances
```

One way to get heterogeneity measures of my meta-analysis is to **print** the meta-analysis (in my case, `m_re`) output again. Just running the name of an object calls the `print()` method for that object; notice that these two commands do the same thing:

```{r}
m_re
print(m_re)
```

We see that this output **already provides us with all three heterogeneity measures** (and even one more, *H^2*, which we will not cover here).

* $\tau^{2}$, as we can see from the `tau^2` output, is **0.0570**.

* $I^{2}$ is printed next to `I^2`, and has the value **67.77%**.

* The value of $Q$ is displayed next to `Q` under `Test for heterogeneity:`.
As we can see, the value is **156.9109**. In our case, this is significant ($p < 0.001$; see `p-value`).

* We can also obtain a **prediction interval**, which we plotted in the forest plot, by running `predict(m_re)`. As we can see, the 95% prediction interval (`cr.lb` and `cr.ub`) ranges from **g=-0.2358** to **0.7143**. It is calculated simply as $\hat{\mu} \pm 1.96 * \hat{\tau}$: The estimated summary effect, with a 95% window based on the estimated standard deviation of the true effect.

How can we interpret the values of this example analysis? Well, all of our indicators suggest that **moderate to substantial heterogeneity is present in our data**. Given the **broad prediction interval**, which stretches well below zero, we also cannot be overly confident that the positive effect we found for our interventions is robust in every context. It might be very well possible that the intervention does not yield positive effects in some future scenarios; even a small negative effect might be possible based on the evidence the meta-analysis gives us. Very high effect sizes, on the other hand, are possible too.

<br><br>

---

## Detecting outliers & influential cases

As mentioned before, **between-study heterogeneity** can also be caused by one more studies with **extreme effect sizes** which don't quite **fit in**. Especially when the **quality of these studies is low**, or the **studies are very small**, this may **distort** our pooled effect estimate, and it's a good idea to have a look on the **pooled effect again once we remove such outliers from the analysis**. 

On the other hand, we also want to know **if the pooled effect estimate we found is robust**, meaning that the effect does not depend heavily on **one single study**. Therefore, we also want to know **whether there are studies which heavily push the effect of our analysis into some direction**. Such studies are called **influential cases**, and we'll devote some time to this topic in the [second part](#influenceanalyses) of this chapter.

```{block,type='rmdachtung'}
It should be noted that they are **many methods** to spot **outliers and influential cases**, and the methods described here are not comprehensive. If you want to read more about the underpinnings of this topic, we can recommend the paper by Wolfgang Viechtbauer and Mike Cheung [@viechtbauer2010outlier].
```

### Searching for extreme effect sizes (outliers) {#outliers}

A common method to detect outliers directly is to define a study as an outlier if the **study's confidence interval does not overlap with the confidence interval of the pooled effect**.

To detect such outliers in our dataset, the `filter` function in the `dplyr` package we introduced in [Chapter 3.3.3](#filter) comes in handy again. 

Using this function, we can search for all studies:

* for which the **upper bound of the 95% confidence interval of the study is lower than the lower bound of the pooled effect confidence interval** (i.e., extremely small effects)
* for which the **lower bound of the 95% confidence interval of the study is higher than the higher bound of the pooled effect confidence interval** (i.e., extremely large effects)

Here, i'll use my `m_re` meta-analysis output from [Chapter 5.2.2](#random) again. Let's see what the **upper and lower bound** of the pooled effect confidence interval are:

```{r}
m_re$ci.lb
m_re$ci.ub
```

The **pooled effect confidence interval** stretches from $g = 0.16$ to $g = 0.32$. We can use these values to filter out outliers now.

To filter out outliers, we will use a boolean (TRUE/FALSE) filter variable. We first calculate the 95% CI for each study effect size, using the standard error of the effect size (`sqrt(vi)`). Then, we create a new filter variable `outlier`, with the value `TRUE` if the CI of the effect size is outside of the CI of the pooled effect. We use some  logical operators here: `<`, `>`, and `|`. The first two mean "smaller than" and "bigger than", and are probably familiar. THe third one, `|`, means "or". So the statement `df$upperci < m_re$ci.lb | df$lowerci > m_re$ci.ub` means: The upper CI bound of the effect is smaller than the lower CI bound of the pooled effect, OR the lower CI bound of the effect is bigger than the upper CI bound of the pooled effect.

```{r}
# Calculate CI for all observed effect sizes
df$upperci <- df$d + 1.96 * sqrt(df$vi)
df$lowerci <- df$d - 1.96 * sqrt(df$vi)
# Create filter variable
df$outlier <- df$upperci < m_re$ci.lb | df$lowerci > m_re$ci.ub
# Count number of outliers:
sum(df$outlier)
```
We see that there are **eight potential outliers**. Let's examine the effect sizes of these outliers. We can use the `df$outlier` variable to select only the filtered outliers, and we can request the effect size-related variables to get some idea for whether the outliers are mostly on the positive or negative side:

```{r}
# Look at effect sizes for potential outliers
df[df$outlier, c("d", "upperci", "lowerci")]
```

Based on this output, it's hard to determine if the outliers might bias the estimate. Let's get a graphical representation of the histogram for the effect sizes, with different colours for the flagged outliers. For this, we will use the popular plotting package `ggplot2`, which can build any possible plot in cumulative steps:

```{r}
# Load ggplot
library(ggplot2)
# Make a basic plot, based on the data in df, and specify that the x-variable is
# the effect size, 'd', the colour and fill of the histogram bars are based on
# the value of 'outlier':
ggplot(data = df, aes(x = d, colour = outlier, fill = outlier)) +
  # Add a histogram with transparent bars (alpha = .2)
  geom_histogram(alpha = .2) +
  # Add a vertical line at the pooled effect value (m_re$b[1])
  geom_vline(xintercept = m_re$b[1]) +
  # Apply a black and white theme
  theme_bw()
```

It looks like the potential flagged outliers are pretty uniformly distributed. Thus, there is no clear indication of bias.

Note that we can plot essentially anything using `ggplot`, so it's an extremely useful package for understanding your data visually.

### Sensitivity analysis 

We can also do a sensitivity analysis, to check how much the pooled effect size changes if we omit all potential outliers:

```{r}
m_no_outliers <- rma(yi = df$d[!df$outlier],
                     vi = df$vi[!df$outlier])

# Print the rounded pooled effect and CI
cat("G = ", round(m_no_outliers$b[1], 2),
    ", 95% CI [", round(m_no_outliers$ci.lb, 2),
    ", ", round(m_no_outliers$ci.ub, 2), "] (no outliers)", sep = "")
cat("G = ", round(m_re$b[1], 2),
    ", 95% CI [", round(m_re$ci.lb, 2),
    ", ", round(m_re$ci.ub, 2), "]", sep = "")
```

Again, the conclusion is that these outliers hardly bias the pooled effect.

<br><br>

---