Skip to content

Commit

Permalink
Version update
Browse files Browse the repository at this point in the history
  • Loading branch information
akuyper committed Feb 27, 2019
1 parent d979ae2 commit cc98886
Show file tree
Hide file tree
Showing 786 changed files with 97,326 additions and 12,038 deletions.
40 changes: 20 additions & 20 deletions 03-visualization.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ knitr::opts_chunk$set(
fig.height = 4,
fig.align='center',
warning = FALSE
)
)
options(scipen = 99, digits = 3)
Expand All @@ -29,7 +29,7 @@ set.seed(76)

We begin the development of your data science toolbox with data visualization. By visualizing our data, we gain valuable insights that we couldn't initially see from just looking at the raw data in spreadsheet form. We will use the `ggplot2` package as it provides an easy way to customize your plots. `ggplot2` is rooted in the data visualization theory known as _The Grammar of Graphics_ [@wilkinson2005].

At the most basic level, graphics/plots/charts (we use these terms interchangeably in this book) provide a nice way for us to get a sense for how quantitative variables compare in terms of their center (where the values tend to be located) and their spread (how they vary around the center). Graphics should be designed to emphasise the findings and insight you want your audience to understand. This does however require a balancing act. On the one hand, you want to highlight as many meaningful relationships and interesting findings as possible; on the other you don't want to include so many as to overwhelm your audience.
At the most basic level, graphics/plots/charts (we use these terms interchangeably in this book) provide a nice way for us to get a sense for how quantitative variables compare in terms of their center (where the values tend to be located) and their spread (how they vary around the center). Graphics should be designed to emphasize the findings and insight you want your audience to understand. This does however require a balancing act. On the one hand, you want to highlight as many meaningful relationships and interesting findings as possible; on the other you don't want to include so many as to overwhelm your audience.

As we will see, plots/graphics also help us to identify patterns and outliers in our data. We will see that a common extension of these ideas is to compare the *distribution* of one quantitative variable (i.e., what the spread of a variable looks like or how the variable is *distributed* in terms of its values) as we go across the levels of a different categorical variable.

Expand All @@ -54,13 +54,13 @@ library(readr)



---
***



## The Grammar of Graphics {#grammarofgraphics}

We begin with a discussion of a theoretical framework for data visualization known as "The Grammar of Graphics," which serves as the foundation for the `ggplot2` package. Think of how we construct sentences in english to form sentences by combining different elements, like nouns, verbs, particles, subjects, objects, etc. However, we can't just combine these elements in any arbitrary order; we must do so following a set of rules known as a linguistic grammar. Similarly to a linguistic grammar, "The Grammar of Graphics" define a set of rules for contructing *statistical graphics* by combining different types of *layers*. This grammar was created by Leland Wilkinson [@wilkinson2005] and has been implemented in a variety of data visualization software including R.
We begin with a discussion of a theoretical framework for data visualization known as "The Grammar of Graphics," which serves as the foundation for the `ggplot2` package. Think of how we construct sentences in English to form sentences by combining different elements, like nouns, verbs, particles, subjects, objects, etc. However, we can't just combine these elements in any arbitrary order; we must do so following a set of rules known as a linguistic grammar. Similarly to a linguistic grammar, "The Grammar of Graphics" define a set of rules for constructing *statistical graphics* by combining different types of *layers*. This grammar was created by Leland Wilkinson [@wilkinson2005] and has been implemented in a variety of data visualization software including R.

### Components of the Grammar

Expand Down Expand Up @@ -165,7 +165,7 @@ There are other components of the Grammar of Graphics we can control as well. A
- `stat`istical transformations: this includes smoothing, binning values into a histogram, or no transformation at all (known as the `"identity"` transformation).
-->

Other more complex components like `scales` and `coord`inate systems are left for a more advanced text such as [R for Data Science](http://r4ds.had.co.nz/data-visualisation.html#aesthetic-mappings) [@rds2016]. Generally speaking, the Grammar of Graphics allows for a high degree of customization of plots and also a consistent framework for easily updating and modifiying them.
Other more complex components like `scales` and `coord`inate systems are left for a more advanced text such as [R for Data Science](http://r4ds.had.co.nz/data-visualisation.html#aesthetic-mappings) [@rds2016]. Generally speaking, the Grammar of Graphics allows for a high degree of customization of plots and also a consistent framework for easily updating and modifying them.

### ggplot2 package

Expand All @@ -180,7 +180,7 @@ Let's now put the theory of the Grammar of Graphics into practice.



---
***



Expand All @@ -198,7 +198,7 @@ We will discuss some variations of these plots, but with this basic repertoire o



---
***



Expand Down Expand Up @@ -367,7 +367,7 @@ With medium to large data sets, you may need to play around with the different m
-->


---
***


## 5NG#2: Linegraphs {#linegraphs}
Expand Down Expand Up @@ -438,11 +438,11 @@ Much as with the `ggplot()` code that created the scatterplot of departure and a

### Summary

Linegraphs, just like scatterplots, display the relationship between two numerical variables. However it is preferred to use lingraphs over scatterplots when the variable on the x-axis (i.e. the explanatory variable) has an inherent ordering, like some notion of time.
Linegraphs, just like scatterplots, display the relationship between two numerical variables. However it is preferred to use linegraphs over scatterplots when the variable on the x-axis (i.e. the explanatory variable) has an inherent ordering, like some notion of time.



---
***



Expand Down Expand Up @@ -491,7 +491,7 @@ The remaining bins all have a similar interpretation.

### Histograms via geom_histogram {#geomhistogram}

Let's now present the `ggplot()` code to plot your first histogram! Unlike with scatterplots and linegraphs, there is now only one variable being mapped in `aes()`: the single numerical variable `temp`. The y-aesthetic of a histogram gets computed for you automatically. Furthemore, the geometric object layer is now a `geom_histogram()`
Let's now present the `ggplot()` code to plot your first histogram! Unlike with scatterplots and linegraphs, there is now only one variable being mapped in `aes()`: the single numerical variable `temp`. The y-aesthetic of a histogram gets computed for you automatically. Furthermore, the geometric object layer is now a `geom_histogram()`

```{r weather-histogram, warning=TRUE, fig.cap="Histogram of hourly temperatures at three NYC airports."}
ggplot(data = weather, mapping = aes(x = temp)) +
Expand Down Expand Up @@ -524,7 +524,7 @@ Observe in both Figure \@ref(fig:weather-histogram-2) and Figure \@ref(fig:weath

Using the first method, we have the power to specify how many bins we would like to cut the x-axis up in. As mentioned in the previous section, the default number of bins is 30. We can override this default, to say 40 bins, as follows:

```{r, warning=FALSE, message=FALSE, fig.cap= "Histogram with 60 bins."}
```{r, warning=FALSE, message=FALSE, fig.cap= "Histogram with 40 bins."}
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram(bins = 40, color = "white")
```
Expand Down Expand Up @@ -558,13 +558,13 @@ Histograms, unlike scatterplots and linegraphs, present information on only a si



---
***



## Facets {#facets}

Before continuing the 5NG, let's briefly introduce a new concept called *faceting*. Faceting is used when we'd like to split a particular visualization of variables by another variable. This will create mutiple copies of the same type of plot with matching x and y axes, but whose content will differ.
Before continuing the 5NG, let's briefly introduce a new concept called *faceting*. Faceting is used when we'd like to split a particular visualization of variables by another variable. This will create multiple copies of the same type of plot with matching x and y axes, but whose content will differ.

For example, suppose we were interested in looking at how the histogram of hourly temperature recordings at the three NYC airports we saw in Section \@ref(histograms) differed by month. We would "split" this histogram by the 12 possible months in a given year, in other words plot histograms of `temp` for each `month`. We do this by adding `facet_wrap(~ month)` layer.

Expand All @@ -574,7 +574,7 @@ ggplot(data = weather, mapping = aes(x = temp)) +
facet_wrap(~ month)
```

Note the use of the tilde `~` before `month` in `facet_wrap()`. The tilde is required and you'll receive the error `Error in as.quoted(facets) : object 'month' not found` if you don't include it before `month` here. We can also specify the number of rows and columns in the grid by using the `nrow` and `ncol` arguments inside of `facet_wrap()`. For example, say we would like our facetted plot to have 4 rows instead of 3. Add the `nrow = 4` argument to `facet_wrap(~ month)`
Note the use of the tilde `~` before `month` in `facet_wrap()`. The tilde is required and you'll receive the error `Error in as.quoted(facets) : object 'month' not found` if you don't include it before `month` here. We can also specify the number of rows and columns in the grid by using the `nrow` and `ncol` arguments inside of `facet_wrap()`. For example, say we would like our faceted plot to have 4 rows instead of 3. Add the `nrow = 4` argument to `facet_wrap(~ month)`

```{r facethistogram2, fig.cap="Faceted histogram with 4 instead of 3 rows."}
ggplot(data = weather, mapping = aes(x = temp)) +
Expand All @@ -601,7 +601,7 @@ Observe in both Figure \@ref(fig:facethistogram) and Figure \@ref(fig:facethisto



---
***



Expand Down Expand Up @@ -732,7 +732,7 @@ It is important to keep in mind that the definition of an outlier is somewhat ar

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Which months have the highest variability in temperature? What reasons can you give for this?

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** We looked at the distribution of a numerical variable over a categorical variable here with this boxplot. Why can't we look at the distribution of one numerical variable over the distribution of another numerical variable? Say, temperature across pressure, for example?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** We looked at the distribution of the numerical variable `temp` split by the numerical variable `month` that we converted to a categorical variable using the `factor()` function. Why would a boxplot of `temp` split by the numerical variable `pressure` similarly converted to a categorical variable using the `factor()` not be informative?

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Boxplots provide a simple way to identify outliers. Why may outliers be easier to identify when looking at a boxplot instead of a faceted histogram?

Expand All @@ -745,7 +745,7 @@ Side-by-side boxplots provide us with a way to compare and contrast the distribu



---
***



Expand Down Expand Up @@ -985,7 +985,7 @@ Barplots are the preferred way of displaying the distribution of a categorical v



---
***



Expand Down Expand Up @@ -1096,7 +1096,7 @@ ggplot(data = early_january_weather, mapping = aes(x = time_hour, y = temp)) +
geom_line()
```

These two code segments were a preview of Chapter \@ref(wrangling) on data wrangling where we'll delve further into the `dplyr` package. Data wrangling is the process of transforming and modifying existing data to with the intent of making it more appropriate for analysis purposes. For example, the two code segments used the `filter()` function to create new data frames (`alaska_flights` and `early_january_weather`) by choosing only a subset of rows of existing data frames (`flights` and `weather`). In this next chapter, we'll formally introduce the `filter()` and other data wrangling functions as well as the *pipe operator* `%>%` which allows you to combine multiple data wrangling actions into a single sequential *chain* of actions. On to Chapter \@ref(wrangling) on data wrangling!
These two code segments were a preview of Chapter \@ref(wrangling) on data wrangling where we'll delve further into the `dplyr` package. Data wrangling is the process of transforming and modifying existing data with the intent of making it more appropriate for analysis purposes. For example, the two code segments used the `filter()` function to create new data frames (`alaska_flights` and `early_january_weather`) by choosing only a subset of rows of existing data frames (`flights` and `weather`). In this next chapter, we'll formally introduce the `filter()` and other data wrangling functions as well as the *pipe operator* `%>%` which allows you to combine multiple data wrangling actions into a single sequential *chain* of actions. On to Chapter \@ref(wrangling) on data wrangling!

```{r echo=FALSE, fig.cap="ModernDive flowchart", out.width='110%', fig.align='center'}
# knitr::include_graphics("images/flowcharts/flowchart/flowchart.004.png")
Expand Down
Loading

0 comments on commit cc98886

Please sign in to comment.