91-appendixA.Rmd

\cleardoublepage

# (APPENDIX) Appendix {-}

# Statistical Background {#appendixA}

## Basic statistical terms

### Mean

The mean, also known as (AKA) the average, is the most commonly reported measure of center.  It is commonly called the "average" though this term can be a little ambiguous. The mean is the sum of all of the data elements divided by how many elements there are. If we have $n$ data points, the mean is given by: $$\overline{x} = \frac{x_1 + x_2 + \cdots + x_n}{n}$$

Note that you will often see shorthand notation for a sum of numbers using $\sum$ notation. For example, we could rewrite the formula for $\bar{x}$ as:

$$\bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n}=\frac{\sum_{i = 1}^n x_i}{n}$$

We've simply replaced the subscript on each $x$ with a generic index $i$, and use the $\sum$ notation to indicate we are summing $x$s with indices (i.e. subscripts) that go from 1 to $n$. When summing numbers in statistics, we're almost always dealing with indices that start with the value 1 and go up to a value equal to a sample size (e.g. $n$), so often you will see an even more shorthand version, where it's assumed you're summing from $i = 1$ to $i = n$:

$$\bar{x} = \frac{\sum x_i}{n}$$

### Median

The median is calculated by first sorting a variable's data from smallest to largest.  After sorting the data, the middle element in the list is the **median**. If the middle falls between two values, then the median is the mean of those two values.

### Standard deviation

We will next discuss the **standard deviation** of a sample dataset pertaining to one variable.  The formula can be a little intimidating at first but it is important to remember that it is essentially a measure of how far to expect a given data value is from its mean:

$$Standard \, deviation = \sqrt{\frac{(x_1 - \overline{x})^2 + (x_2 - \overline{x})^2 + \cdots + (x_n - \overline{x})^2}{n - 1}} = \sqrt{\frac{\sum_{i= 1}^n(x_i - \overline{x})^2 }{n - 1}}$$

### Five-number summary

The **five-number summary** consists of five values:  minimum, first quantile AKA 25^th^ percentile, second quantile AKA median AKA 50^th^ percentile, third quantile AKA 75^th^, and maximum.  The quantiles are calculated as

- first quantile ($Q_1$): the median of the first half of the sorted data
- third quantile ($Q_3$): the median of the second half of the sorted data

The _interquartile range_ is defined as $Q_3 - Q_1$ and is a measure of how spread out the middle 50% of values is. The five-number summary is not influenced by the presence of outliers in the ways that the mean and standard deviation are.  It is, thus, recommended for skewed datasets.

### Distribution

The **distribution** of a variable/dataset corresponds to generalizing patterns in the dataset.  It often shows how frequently elements in the dataset appear.  It shows how the data varies and gives some information about where a typical element in the data might fall.  Distributions are most easily seen through data visualization.

### Outliers

**Outliers** correspond to values in the dataset that fall far outside the range of "ordinary" values.  In regards to a boxplot (by default), they correspond to values below $Q_1 - (1.5 * IQR)$ or above $Q_3 + (1.5 * IQR)$.

Note that these terms (aside from **Distribution**) only apply to quantitative variables.


## Normal distribution discussion