06-inferential.Rmd

# Inferential statistics {#Chapter6}

<img src="images/peppers.jpg" alt=""><br>
<p style="font-family: times, serif; font-size:.9em; font-style:italic">
These are hot peppers. Like hot peppers, statistics can cause pain and heartburn if you are not accustomed to them. Ready or not, let's get munching!!</p>

This week we will begin conducting our first statistical tests! We are going to start small and simple, and we will build complexity during the remainder of the semester. We will also start to make more use of some of the programming techniques that you have been developing, and we will build a foundation for moving into regression models in coming weeks.

We'll start with some simple methods for testing hypotheses about sampling distributions this week. Although relatively limited in scope within the fields of biology and ecology, these tend to be fairly robust tests, and can be powerful tools if studies are designed thoughtfully. For this week, we will focus on implementation of one-sample t-tests, two-sample t-tests, Wilcox tests, and frequency analysis using a $\chi^2$ test. Within the context of the assumptions of these tests we will also discuss the F-test and the Shapiro-Wilk test of normality. In short, you probably will learn more statistical tests in this preliminary chapter about statistical inference than you have in your college career to this point. Take your time and soak in all the mathy goodness. We'll need it!

For this Chapter, we will continue working with packages from the `tidyverse`. You can go ahead and put this in the top of your code for the chapter if you want to load it all at once:
```{r, message = FALSE, warning = FALSE}
library(tidyverse)
```

We will also need the grass carp data for this exercise, which we will load from `grasscarp.csv`. Remember that you can download all of the class data <a href = "">here</a> or you can get the individual `grasscarp.csv` file by clicking <a href = "https://raw.githubusercontent.com/danStich/worst-r/master/data/grasscarp.csv">here</a> and saving with `Ctrl + S` (Windows) or `Command + S` (Mac OS-X).

These data come from a long-term study of fish population responses to changes in their primary food source, the invasive hydrilla (*Hydrilla verticallata*). There are a whole bunch of columns in here! The important variables for this chapter are `Year` (year of fish collection), `Age` (the age of each fish), `Length` (total length of fish in mm), and `hydrilla` (hectares of hydrilla measured each `Year`).
    
## One-sample tests
Sometimes, we are interested in simply knowing whether or not the measurements we've obtained from an individual or a group are representative of a larger population. For example, we may have a 'control' group in an experiment and we want to know if the group is truly representative of the population average or some measurement we have collected from a different biological population. For these situations, we will rely on one-sample tests this week and we'll look at other (totally related) options moving forward.
    
### One sample t-test
We will examine parametric and non-parametric examples of one-sample tests here to demonstrate why and how we use them.

Let's start with a simple example of how we might do this, and what the results actually mean. We'll use some data from grass carp (*Ctenopharyngodon idella*) from Lake Gaston, Virginia and North Carolina, USA for this example. We will compare the size of grass carp at specific ages with their population density using a few different tools

Read in the data set:
```{r}
  grasscarp <- read.csv('data/grasscarp.csv')
```

Just for funsies, you could also read this in directly from the link to the raw data in the [GitHub repository for this book](https://github.com/danStich/worst-r/tree/master/data) if you have an internet connection:
```{r}
grasscarp <- read.csv('https://raw.githubusercontent.com/danStich/worst-r/master/data/grasscarp.csv')
```

Remember to check out the data set in your Environment tab so you understand how many observations there are and how many variables (as well as their types).

Let's start by asking a simple biological question: is the size of age-3 grass carp different from the average size of fish in this population?

First, let's create a sample that includes only age-3 fish. We will store this to a new vector called `age3_lengths`.
```{r}
age3_lengths <- grasscarp$Length[grasscarp$Age == 3]
```

Now, let's compare the `Length` of age-3 fish to the rest of the population using a one-sample t-test. To do this, we need to pass `age3_lengths` to the `t.test()` function as our observed `x`. We'll also specify that we want to compare the observed sample to the population mean (`mu`) that we specify on the fly. We will tell R to use a default confidence level (`conf.level`) of 95%. Finally, we will save the output of our test to a new object, creatively named `our_test`.
```{r}
# Run the test and save the output to an object
our_test = t.test(age3_lengths,
                  mu = mean(grasscarp$Length),
                  conf.level = 0.95
                  )

# Print the results of the object to the console
print(our_test)
```
    
Okay, so what does that mean???

> First, let's look at what we've done here.

We've conducted a one-sample t-test. 

The null hypothesis was that:

> (H~0~): the sample (age-3 fish) did not differ in `Length` from the mean of the population. 

This is because we stated no specific alternative hypothesis when we executed the t-test above. If we had used a different alternative hypothesis (i.e. `greater` or `less` in the argument `alternative`) then our null would be formalized as: "The length of age-3 fish is not significantly greater (or less) than the population mean". 

Finally, we specified the confidence level. Here, we are told R that we want to know the result with a confidence level of 95% (0.95). This corresponds to a Type-I error rate ($\alpha$) of 0.05. This means we are looking for _p_ < 0.05 to conclude that the sample is statistically different from the population mean. Since p < 0.001, we reject the H~0~ and conclude that age-3 fish are significantly shorter than the population mean.
    
#### Output

R returns the output of statistical tests as objects, and you can reference any part of those objects by name or index. The type of object that R returns, and how you access the parts depends on the type of test you ran and with what options.

Like so many other models objects, our one sample t-test is stored as a list:
```{r}  
str(our_test)
```      

We can just look at the names if there are specific pieces in which we are interested. For example, we might want to save the p-value (`p.value`):

```{r}    
# Shows us the names of the things inside the model list
  names(our_test) 

# This stores our p-value to an object for use
  p_out = our_test$p.value 
  
# And of course we can look at it
  print(p_out)    
  
```
    
Now we can go through the output as it is displayed by:

```{r}
# Print a summary of the test
  print(our_test)

```

The first line of the output gives us the actual data that with which we are working- nothing of interest here other than a quick sanity check until later on in the course.

The second line shows the 'statistics' that we are interested in: `t` is the calculated value of the test statistic for the t-test in this case. The `df`, or degrees of freedom, is the number of observations in the sample, minus the number of parameters that we are estimating (in this case, just one: the mean). Our `p-value` is the probability of observing data that are more extreme than what we observed if the null hypothesis is in fact true (i.e. the probability that rejection of the null is inappropriate). Again, because it is smaller than $\alpha$ we reject the null and accept the alternative hypothesis.

Our `alteranive hypothesis` (H~A~) was that the sample mean is not equal to population mean. We can specify other alternatives (and therefore nulls) in this and other models in R.

Finally, R reports the mean and the 95% confidence interval of `age3_lengths`. 
    
#### Assumptions
It's always important for us to think about the assumptions that we are making when (read *before*) conducting a statistical test. First, there are implicit assumptions that we make. For example, we assume that the data are representative of what we are trying to measure and were collected in a random manner with respect to other potentially confounding factors in this case. Then, there are explicit assumptions that we make for specific tests.

For the one-sample t-test, the assumption that we really care about is:

1. The data are normally distributed
  
The t-test is generally robust to violations of this assumption provided that sample sizes are large enough (Google "Central Limit Theorem", this is The Worst Stats Text eveR). But, it is always good to check. In particular, when we are working with small sample sizes like this example (n = ```r length(age3_lengths)```), we should really make sure that things look okay or find an alternative tool. 
    
#### Checking assumptions
<h5 id="multi"> Visual check for normality </h5>

One simple way to assess our assumption of normality is to look at a plot of the data. As you will see later, we are usually concerned with the *residuals*, but we can look at the actual data here because we have only one group and if it's normal so are its errors.

Have a quick look at these to see what we are working with using the histogram code from [Chapter 4](#Chapter4). I set the x-axis limits below using the maximum `Length` from the `grasscarp` data so we can see what part of the length range we've sampled here.

```{r, warning = FALSE, message= FALSE}
ggplot() + 
  geom_histogram(aes(age3_lengths), bins = 30) + 
  scale_x_continuous(limits=c(0, max(grasscarp$Length)), expand = c(0, 0)) + 
  scale_y_continuous(expand = c(0, 0)) + 
  xlab("Total length (mm)") +
  ylab("Count") +
  theme_classic() +
  theme(
    axis.title.x = element_text(vjust = -1),
    axis.title.y = element_text(vjust = 3),
    panel.grid = element_blank()
  )

``` 

Sure, looks totally normal to me? Wouldn't it be great if there were a statistical test for determining whether this sample is different from the normal? Great that you should ask.     
     
##### Tests of normality (Shapiro-Wilk)

The Shapiro-Wilk test is commonly used to test normality of a distribution as a check of assumptions. We can use this to test whether our data deviate from normal in the following manner:
 
```{r}
shapiro.test(age3_lengths)
```

First, note that the test statistic is the `W` statistic for this test.

Second, we have a p-value of ```r shapiro.test(age3_lengths)$p.value```. **Oh, no**! Wait, what does that mean? 

For this test, we actually don't want p < 0.05 if we are relying on assumptions of normality, so this is a "good" thing. But, it doesn't necessarily mean `age3_lengths` is normally distributed. It just means that we can't tell if the sample we have collected is different from normal (we fail to reject the null but can't "accept" it). I guess that is good enough, but that p-value is awfully close to 0.05 for my taste.

So, are we up that proverbial tributary without a paddle, or can we salvage the mess and move on with life? Don't worry, there's a statistical test for that, too.
    
   
### Wilcox test

You can think of the Wilcox test as a non-parametric analog of the t-test. In general, non-parametric tests tend to be slightly more "conservative" than the parametric alternatives because they require fewer assumptions. However, non-parametric tests can be useful where our data are not normal, or we don't feel we have sufficient data to say this with confidence (hmm...maybe don't conduct any tests in that case!).

For the Wilcox test, we are checking for shifts in the median (not the mean) of one or more samples. 

Why is this? The mean of a non-normal distribution is not always a useful descriptor of the probability mass under a distribution (it still describes 'central tendency' but does not necessarily describe the place where 'most of the data are'). But, the median always (as in always, always, always) describes central tendency of the data, so we can pretty much use it for describing any sample. This is because the median is defined as the "middle" value. That is, half of the data should fall on either side of the median if you lined up all of your data on the equator (wrong metaphor?).

...back to the Wilcox test.
 
```{r, eval=FALSE}
# First, do this and have a quick read:
  ?wilcox.test
```
    
We can use this test to see if the median length of age-3 fish is statistically different from the median value of length in the sample.

```{r}
    wilcox.test(
      age3_lengths,
      mu = median(grasscarp$Length),
      alternative = 'less',    # Note different alternative here!
      exact = FALSE            # Not much data, so not exact
    )
```
    
Interpreting the results is essentially the same as for the t-test, but without the degrees of freedom, so we won't belabor this. Importantly, the test, being robust to any distributional assumptions, should also (and does) tell us that the length of age-3 fish is significantly shorter than the population mean (or median - whichever you used).
    
    
## Two-sample tests 

Okay, with that out of the way, now we can do some tests that might be a little more meaningful to most people. We can use **two-sample** tests to determine whether two groups differ in some metric of interest. This lends itself naturally to use in controlled experiments that we conduct in laboratories, for example.
    
    
### The two-sample t-test

If you have been exposed to only one statistical test already it is probably the two-sample t-test. This is a test that is used to test for differences in a continuous dependent variable between two groups. The test statistic itself is pretty trivial to calculate. You can find a video of that [here](https://www.khanacademy.org/math/ap-statistics/two-sample-inference/two-sample-t-test-means/v/two-sample-t-test-for-difference-of-means). **Seriously, if you have never done a t-test, watch the 6-minute video now**. Otherwise, you may not understand what follows. I am not going to go into the math here because this is The Worst Stats Text eveR. The video will also help you understand how ANOVA and other tests work later. Understanding how these tests work will give you phenomenal cosmic powers when it comes to analyzing biological data. If you email asking me how a t-test works, I am going to send you this video.

Let's keep working with the `grasscarp` data for now for the sake of consistency. But, now we want to know if there is a difference in mean length of fish depending on whether their population density is high or low. To look at this, we'll need to make some groups in our `grasscarp` data that correspond to years of high and low density.

You can compare fish density between years quickly using the summary pipeline demonstrated in [Chapter 3](#Chapter3)
```{r, warning=FALSE, message=FALSE}
grasscarp %>%
  group_by(Year) %>%
  summarize(dens = unique(nha))
```

You can see that density was much higher in `2017` than in any of the preceding years. This is because hydrilla area was reduced by several hundred hectares (`ha`) between 2010 and 2014 (which was actually the reason we went out to collect more data in 2017). But, these are just means and we need to be able to account for the variability in these measurements to call it science.

So, let's build some groups based on high and low density years. First, we'll add a new categorical variable to `grasscarp` called "`density`", and we'll fill it all in with the word `"low"` because there is only one year when density was high.
```{r}
grasscarp$density <- "low"
```

Next, we'll change all of the observations for 2017 to `"high"` so we have low density and high density groupings in our `density` column. This way, we only have to change the variable for one year.
```{r}
grasscarp$density[grasscarp$Year == 2017] <- "high"
```

Then, we'll subset the data to look at a single age so our comparisons are fair between years. I picked `Age == 10` because 10 years is in the middle of the range of ages in the data set. You can try it with another age as long as there are enough data.
```{r}
mid_carps <- grasscarp %>% subset(Age == 10)
```

Now, we can conduct our two-sample t-test!

The syntax is pretty straightforward, and is similar to what we used above, except that now we have two groups so we will omit `mu` and specify the t-test as a formula with independent (x, `density`) and dependent (y, `Length`) variables. There is no pairing of our observations, so we specify `paired = FALSE`, and we tell R we don't want to assume that the variance of `Length` is equal between `density` groups.

```{r}
  t.test(Length ~ density,
         data = mid_carps,
         paired = FALSE,      # 2-sample test, not "paired"
         var.equal = FALSE,   # We make no variance assumption
         conf.level = 0.95    # Alpha = 0.05
    )
```

```{r, echo=FALSE, warning=FALSE, message=FALSE}
 test <- t.test(Length ~ density,
         data = mid_carps,
         paired = FALSE,      # 2-sample test, not "paired"
         var.equal = FALSE,   # We make no variance assumption
         conf.level = 0.95    # Alpha = 0.05
    )
  
```

The interpretation of the results is much the same as with the one-sample t-test, except that we are now testing the null hypothesis that there is no difference between groups.

We reject the null hypothesis, and we conclude that age-10 fish were significantly larger during periods of low population density than they were during years of high population density (*t* = ```r test$statistic```, df = ```r test$parameter```, *p* < 0.05). Makes perfect sense!

    
#### Assumptions
<h5 id="multi"> Equal variance <h5>

Now that we are using two samples, we should be cognizant that this test assumes equal variances in the independent variable between our two groups. If our variances are not equal, then we need to account for that (R actually assumes that the variances are unequal by default).  

Let's test to see if the variances were equal between age-10 fish in the `high` and `low` density groups. To do this, we will conduct an F-test on the ratio of the two variances. If the ratio of the variances is different than one, we reject the null that the variances are the same.
```{r}
var.test(Length~density, mid_carps)
```

Wow, this is way to easy. I hope that you are beginning to understand the __GLORY OF R__. This test could be a real pain in other software programs, and may not even be an option in many.

Back on topic...we fail to reject the null hypothesis that the variances were equal. In this case, we now feel validated in the use of a two-sample t-test regardless of what R uses as the default (yes, sarcasm intended).


##### Normality

Yes, we are still worried about this one because of the reasons given in the previous section. We can check this the same way as before. End of section.

### Two-sample Wilcox test

If we were in violation of normality, we would use the Wilcox test to test for differences in ranks. I will not go through the whole thing again here. As with the t-test, if you have not been exposed to doing a rank-sum test by hand you really should [**watch a video of how to do it**](https://www.youtube.com/watch?v=AM87jjnNt8U). It really is easy once you've seen it and the video demystify the test for you.

I will note that the syntax is very much the same to that of the t-test now. This will pretty much stay the same for the next 6 chapters. Thank R, not me. 

```{r, warning = FALSE, message = FALSE}
wilcox.test(Length~density, mid_carps)
```

As expected, this test also shows that the two samples differ significantly.

*Note: this is equivelent to the Mann-Whitney U-test you may have learned about elsewhere. Had these samples been paired, R would have defaulted to a signed-rank test, with which you may also be familiar.*


### Presenting your results

While it is important to report the test statistics, df, etc., it can be just as meaningful to give the sample means (reported in the `t.test`) and show a graph. **Remember**: don't make shitty graphs. Be proud of your results and show your readers what they mean.

In this case, a boxplot or a violin plot would work great. We haven't looked at violin plots yet, so let's give them a whirl! 

Violins are a lot like box plots except they give us a little better visual description of the shape of sampling distributions within groups. I added some `ggplot2` functions to control fill and color of the violins in the example below. You can check out [this blog post](https://www.datanovia.com/en/blog/ggplot-colors-best-tricks-you-will-love/#predefined-ggplot-color-palettes) for some other cool examples with other `ggplot` geometries. Play around with the plotting code above to change what you like. Remember, all of the customization achieved using the `theme()` function is the same across plot types.

Here is a quick, ugly violin plot with some basic options. Pretty easy to make, but also kind of makes you want to puke.
```{r}
ggplot(mid_carps, aes(x = density, y = Length)) + 
  geom_violin(aes(group = density), trim = FALSE)
```

Here is a much better plot. Not that much more difficult to make, and doesn't make you want to puke even if the code does a little bit.
```{r}
mid_carps %>%
ggplot(aes(x = density, y = Length, fill = density, color = density)) + 
  geom_violin(aes(group = density), trim = FALSE, size = .75) +
  scale_x_discrete(breaks=c("high", "low"), labels = c("High", "Low")) +
  scale_fill_grey(start = 0.9, end = 0.4) +
  scale_color_grey(start = 0.8, end = 0.3) +
  xlab("Density") +
  ylab("Total length (mm) at age 10") +
  labs(fill = "Density", color = "Density") +
  theme_bw() +
  theme(
    axis.title.x = element_text(vjust = -1),
    axis.title.y = element_text(vjust = 3),
    panel.grid = element_blank()
  )
```

There, isn't that fancy? I think that gives you a much more detailed understanding of how `Length` varies between `high` and `low` population density than a simple p-value. But, maybe its just me...

## Frequency analysis

Now, what if we didn't collect very good data, or we binned our data into low-resolution categories for the sake of ease in our study design? Often, and for a variety of reasons other than crappy data collection, we want to compare frequencies of events between two (or more) groups. We may even design studies specifically to test these kinds of hypotheses when we think about rates, for example. This is very common in studies of population genetics [definitely citations available for that one - go Google them]

The simplest way to test for differences in the frequency of a categorical response between two groups is (some would argue) the $\chi^2$ test. The $\chi^2$ is another one of those that you should really work out by hand because it is used in a variety of settings "under the hood" of more complex routines. Here is your token [video link showing an example](https://www.youtube.com/watch?v=V4SRgabFbz0). Watch it. Please.

### Worked example

Let's say we want to know if the number of grass carp in a given age group (say age 10) varies between years. These fish are sterile hybrids, so we would expect that the number of fish in each age would change drastically with increasing time since the year of initial stocking (1995).

First, make a table showing the number of fish in each `Age` by `Year` with the `grasscarp` data.
```{r}
agexyear <- with(grasscarp, table(Age, Year))

print(agexyear)
```

Basically what we are going to do is analyze the proportion of total fish in each column by age.

You should see some pretty obvious patterns here. We have a couple of things to think about now. First, this is the kind of question you don't need statistics for. Second, we have a whole bunch of empty groups, and these are not random with respect to year. Some of these come from ages that were not yet available in years 2006 - 2010 and some from patterns in fish stocking. The large number of empty pairings and the fact that most age classes had fewer than five fish in any year prior to 2017 means we should probably break the data down a little further. This stinks because we lose resolution, but that is the cost.

For the sake of demonstration, let's summarize the data by `high` and `low` density again and we'll look at the number of fish collected in each age class during high and low density years.

```{r, warning=FALSE, message=FALSE}
freqs <- grasscarp %>% # Pass grass carp data frame to group_by()
    filter(Age %in% c(10:15)) # Select only shared age range

head(freqs)
```

We will test the null hypothesis that there is no difference in the number of age-10 fish between high and low densities.

```{r}
# Run the test
chi_test <- with(freqs, chisq.test(x = density, y = Age))

# Have a look
print(chi_test)
```

And, bam! We see that there is a difference in the frequency of fish collected in each age class in high and low density years. Shocker. 

Data visualization techniques for contingency table analyses like this seem to have generally lagged behind theory in terms of wide-spread implementation. There is a base R `mosaicplot` that plots relative freqencies. You can interpret the width of the bars as the proportion of total observations in each age class. Likewise, the height of the vertical segnmens corresponds to proportion of `high` or `low` `density` observations in each `Age`.

```{r}
mosaicplot(Age ~ density, data = freqs)
```

The need for improved graphical representation for these data types is recognized. There have been recent efforts to extend the philosophies used `ggplot()` to contingency analysis by r developers (see <a href="http://vita.had.co.nz/papers/prodplots.pdf">Wickham and Hofmann 2011</a>). It was even the topic of a recent master's thesis (<a href="https://lib.dr.iastate.edu/cgi/viewcontent.cgi?article=7144&context=etd">see Grant 2017</a>). But as far as I know the ideas from these works have not been integrated into `ggplot2` or the `tidyverse` yet. Sorry about the citations. I don't know what I was thinking. This is supposed to be the Worst Stats Text eveR.


## Next steps {#next6}
In this chapter, we introduced inferential statistics and walked through examples of a few simple statistical tests for comparing samples to one another using two-sample tests, or to a single value using a one-sample test. In [Chapter 7](#Chapter7) we will continue to build on these tools as we press on to linear models and the rest of statistics.