forked from moderndive/ModernDive_book
-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy path91-appendixA.Rmd
52 lines (26 loc) · 3.29 KB
/
91-appendixA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
\cleardoublepage
# (APPENDIX) Appendix {-}
# Statistical Background {#appendixA}
## Basic statistical terms
### Mean
The mean, also known as (AKA) the average, is the most commonly reported measure of center. It is commonly called the "average" though this term can be a little ambiguous. The mean is the sum of all of the data elements divided by how many elements there are. If we have $n$ data points, the mean is given by: $$\overline{x} = \frac{x_1 + x_2 + \cdots + x_n}{n}$$
Note that you will often see shorthand notation for a sum of numbers using $\sum$ notation. For example, we could rewrite the formula for $\bar{x}$ as:
$$\bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n}=\frac{\sum_{i = 1}^n x_i}{n}$$
We've simply replaced the subscript on each $x$ with a generic index $i$, and use the $\sum$ notation to indicate we are summing $x$s with indices (i.e. subscripts) that go from 1 to $n$. When summing numbers in statistics, we're almost always dealing with indices that start with the value 1 and go up to a value equal to a sample size (e.g. $n$), so often you will see an even more shorthand version, where it's assumed you're summing from $i = 1$ to $i = n$:
$$\bar{x} = \frac{\sum x_i}{n}$$
### Median
The median is calculated by first sorting a variable's data from smallest to largest. After sorting the data, the middle element in the list is the **median**. If the middle falls between two values, then the median is the mean of those two values.
### Standard deviation
We will next discuss the **standard deviation** of a sample dataset pertaining to one variable. The formula can be a little intimidating at first but it is important to remember that it is essentially a measure of how far to expect a given data value is from its mean:
$$Standard \, deviation = \sqrt{\frac{(x_1 - \overline{x})^2 + (x_2 - \overline{x})^2 + \cdots + (x_n - \overline{x})^2}{n - 1}} = \sqrt{\frac{\sum_{i= 1}^n(x_i - \overline{x})^2 }{n - 1}}$$
### Five-number summary
The **five-number summary** consists of five values: minimum, first quantile AKA 25^th^ percentile, second quantile AKA median AKA 50^th^ percentile, third quantile AKA 75^th^, and maximum. The quantiles are calculated as
- first quantile ($Q_1$): the median of the first half of the sorted data
- third quantile ($Q_3$): the median of the second half of the sorted data
The _interquartile range_ is defined as $Q_3 - Q_1$ and is a measure of how spread out the middle 50% of values is. The five-number summary is not influenced by the presence of outliers in the ways that the mean and standard deviation are. It is, thus, recommended for skewed datasets.
### Distribution
The **distribution** of a variable/dataset corresponds to generalizing patterns in the dataset. It often shows how frequently elements in the dataset appear. It shows how the data varies and gives some information about where a typical element in the data might fall. Distributions are most easily seen through data visualization.
### Outliers
**Outliers** correspond to values in the dataset that fall far outside the range of "ordinary" values. In regards to a boxplot (by default), they correspond to values below $Q_1 - (1.5 * IQR)$ or above $Q_3 + (1.5 * IQR)$.
Note that these terms (aside from **Distribution**) only apply to quantitative variables.
## Normal distribution discussion