-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathIMD_D3_M3_Basic_GGPlot.Rmd
254 lines (180 loc) · 16 KB
/
IMD_D3_M3_Basic_GGPlot.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
---
title: "Day 3: Data Viz - Module 3"
author: "Ellen Cheng & Kate Miller"
output:
html_document
---
#### Building a Basic ggplot {.tabset}
```{r, echo = FALSE, eval = TRUE, warning = FALSE, message = FALSE}
viz_dat <- readRDS("Data/viz_dat.RDS")
arch_dat <- subset(viz_dat, unit_code == "ARCH") # subset of data that is for Arches National park
```
<details open><summary class='drop'>Data Used in this Section</summary>
```{r, echo = TRUE, eval = FALSE, warning = FALSE, message = FALSE}
# This section requires the `viz_dat` and `arch_dat` data frames created earlier in the module. If you don't have these data frames in the global environment, run the code below.
viz_dat <- readRDS("viz_dat.RDS") # this file is provided with the training materials. If this file is not in your current working directory, provide the file path in the function argument
arch_dat <- subset(viz_dat, unit_code == "ARCH") # subset of data that is for Arches National park
```
</details>
<br>
<details open><summary class='drop'>Packages Used in this Section</summary>
```{r, echo = TRUE, message = FALSE, warning = FALSE}
# Packages used in this section
library(tidyverse)
library(ggthemes)
```
</details>
<br>
<details open><summary class='drop'>Why `ggplot2`?</summary>
If you search the Internet for information on plotting in `R`, you will quickly learn that `ggplot2` is the most popular `R` package for plotting. It takes a little effort to learn how the pieces of a ggplot object fit together. But once you get the hang of it, you will be able to create a large variety of attractive plots with just a few lines of `R` code.
[Reference documents]("http://r-statistics.co/ggplot2-cheatsheet.html") and [cheatsheets](https://nyu-cdsc.github.io/learningr/assets/data-visualization-2.1.pdf") can be very helpful while you are learning to use the `ggplot2` package.
So let's get started! We will build our first ggplot object step-by-step to demonstrate how each component contributes to the final plot. Our first ggplot object will re-create the line graph of trends in visits to Arches National Park.
</details>
<br>
<details open><summary class='drop'>Step 1. Create a "Master" Template for the Plot</summary>
- identify the data (must be a data frame or tibble)
- set default aesthetic mappings
An aesthetic mapping describes how variables in the data frame should be represented in the plot. Any column of data that will be used to create the plot must be mapped to an aesthetic. Examples of aesthetics include x, y, color, fill, point size, point shape, and line type.
From the 'arch_dat' data we will map 'visitors' (in 1000's) to the y variable and 'year' to the x variable. This will set up default axes and axes titles for the plot.
<details open><summary class='drop2'>Master template with default aesthetics</summary>
```{r p1_step1, echo = TRUE, warning = FALSE, message = FALSE}
# NOTES:
# 1. The first argument provided to the `ggplot()` function is assumed to be the data frame, so it's not necessary to include the argument name `data =`.
# 2. We are assigning the plot to the name `p` so we can build on this `ggplot()` master template in the next step.
# 3. The parentheses around the line of code is a shortcut for `print`. Without it, the plot would assign to `p` but not print in the plots pane.
(p <- ggplot(data = arch_dat, aes(x = year, y = visitors/1000)))
summary(p) # summary of the information contained in the plot object
p$data # the plot object is self-contained. The underlying data frame is saved a list element in the plot object.
```
</details>
</details>
<details open><summary class='drop'>Step 2. Add Geometry Layer(s)</summary>
The geometry layer determines how the data will be represented in the plot. We can generally think of this as the plot type. For example, `geom_hist()` creates a histogram and `geom_line()` creates a line graph.
If we had set default data and aesthetic mappings, the geometry layer will use that information unless additional (overriding) data or aesthetics are set in the layer itself. If we had NOT set default data and aesthetics, they will need to be set within the layer. At a minimum we need to specify the data frame and which data frame columns hold the `x` and (for most plot types) `y` aesthetics. The ggplot object will use its own default values for any other aesthetics (e.g., color, size...) if not defined by the user.
Each geometry layer has its own set of required and accepted aesthetics. For example,`geom_point()` (point plot) REQUIRES `x` and `y` aesthetics and ACCEPTS aesthetics that modify point color and size. It does NOT accept the `linetype` aesthetic. We cover plot aesthetics in more detail in the section titled 'Customizing a ggplot Object'.
<details open><summary class='drop2'>Add a line graph geometry layer</summary>
```{r p1_step2, echo = TRUE, warning = FALSE, message = FALSE}
# NOTE: We combine ggplot object layers with `+`, not with `%>%`. With a ggplot object we are not piping sequential results but rather layering pieces. In this example we are building our ggplot object in a particular order, but we could actually reorder the pieces in any way (as long as the master template is set first). The only difference would be that if different layers have conflicting information, information in later layers overrides information in earlier layers.
p + geom_line()
```
</details>
If we had NOT set default data and `x` and `y` aesthetics in the plot's master template we could have set them within the geometry layer like this:
```{r p1_step2b, echo = TRUE, eval = FALSE, warning = FALSE, message = FALSE}
# Example of setting aesthetic mappings within the layer only
ggplot() +
geom_line(data = arch_dat, aes(x = year, y = visitors/1000))
```
<details open><summary class='drop2'>Add multiple geometry layers</summary>
We can include multiple geometry layers in a single ggplot object. For example, to overlay a point plot on the line graph we would add a `geom_point()` layer. If we had set default data and aesthetic mappings, this additional geometry would use that information unless additional (overriding) information were set in the `geom_point()` layer itself.
```{r p1_step2c, echo = TRUE, warning = FALSE, message = FALSE}
# This works because we had defined default data and aesthetics in `p`, so don't need to provide additional information to the geometry layers
p + geom_line() + geom_point()
```
The code below will NOT add the points layer because `geom_point()` has no data or aesthetics defined, and we did not define default values in the ggplot object.
```{r, echo = TRUE, eval = FALSE}
ggplot() +
geom_line(data = arch_dat, aes(x = year, y = visitors/1000)) +
geom_point()
```
</details>
<details open><summary class='drop2'>Geometry layers are shortcuts for the `layer()` function</summary>
Geometry layers can alternatively be specified with the generic plot `layer()` function. Geometry layers are basically the `layer()` function with some pre-defined (default) arguments. For example, `geom_point()` is the same as `layer(geom = "point", stat = "identity", position = "identity")`. See `?layer` for more information.
```{r p1_step2e, echo = TRUE, eval = FALSE, warning = FALSE, message = FALSE}
p + layer(geom = "line", stat = "identity", position = "identity") # this creates the same line graph we had created using `geom_line()`
```
</details>
</details>
<details open><summary class='drop'>Step 3. Use `scale_...()` Functions to Fine-Tune Aesthetics</summary>
All aesthetic mappings can be fine-tuned with a `scale_...()` function. For example, the aesthetic mapping for `x` can be modified with `scale_x_continuous()` if the data are continuous, `scale_x_discrete()` if the data are categorical, and `scale_x_date()` if the data class is data/time. `ggplot2` provides default scale arguments for all aesthetic mappings. If we don't explicitly define a scale for an aesthetic, `ggplot2` will use its default scale settings.
<details open><summary class='drop2'>Modify the y-axis with `scale_y_continous()`</summary>
```{r p3_step1, echo = TRUE, warning = FALSE, message = FALSE}
# NOTES:
# 1. Our plot code is getting long, so we will separate out the components to improve readability
# 2. We will assign this plot object to the name `p2` so we can just add to `p2` in the next step instead of re-typing all these lines
# 3. `ggplot2` also provides shortcuts for some scaling arguments. For example, if we just want to set y-axis limits we can add the layer `ylim (600, 1500)` instead of setting limits via `scale_y_continuous()`
# Check out `?scale_y_continuous` for ways to modify the y-axis
(p2 <- p +
geom_line() +
geom_point() +
scale_y_continuous(
name = "Number of visitors, in 1000's", # y-axis title
limits = c(600, 1500), # minimum and maximum values for y-axis
breaks = seq(600, 1500, by = 100)) # label the axis at 600, 700, 800... up to 1500
)
```
</details>
</details>
<details open><summary class='drop'>Step 4. Use `labs()` to Specify Plot Labels</summary>
This useful layer allows us to set various plot labels, such as the plot title and subtitle, x- and y-axis titles, and alternative text and plot tags. Some of these labels can be set in other ways rather than via `labs()`. For example, we had already set the y-axis title in the first argument of `scale_y_continous()`.
If we want to remove a default label from the plot, we can set the argument to NULL. For example, `labs(x = NULL)` removes the default x-axis title.
<details open><summary class='drop2'>Add plot labels</summary>
```{r p4_step1, echo = TRUE, warning = FALSE, message = FALSE}
# NOTE: See `?ggtitle`, `?xlab`, and `?ylab` for alternative ways to set plot labels
(p3 <- p2 +
labs(
title = "Visitor Counts at Arches National Park, 2000 - 2015", # plot title
subtitle = "Visits to Arches National Park have increased each year since 2004", # plot subtitle
x = "Year") # x-axis title. We had already set the y-axis title with `scale_y_continous()`.
)
```
</details>
</details>
<details open><summary class='drop'>Step 5. Set a Complete Theme and Plot Element Themes</summary>
Use `theme()` to customize all non-data components of the plot. Conveniently, the `ggplot2` package provides [complete themes](https://ggplot2.tidyverse.org/reference/ggtheme.html) that change several elements of the plot's appearance at once. The default complete theme is `theme_grey`. See `?theme` for all the ways we can modify the non-data plot elements. This help page also provides a hyperlink to a help page for complete themes (I'm sure there is a way to get there more directly, but I couldn't figure it out).
<details open><summary class='drop2'>Explore `ggplot2` complete themes</summary>
```{r p5_step1, echo = TRUE, warning = FALSE, message = FALSE}
p3 +
theme_minimal() # try out different themes! theme_classic(), theme_void(), theme_linedraw()...
```
</details>
<details open><summary class='drop2'>Create a FiveThirtyEight-inspired plot</summary>
For more fun, install and load the `ggthemes` [package](https://mran.microsoft.com/snapshot/2016-12-03/web/packages/ggthemes/vignettes/ggthemes.html), which allows us to apply plot styles inspired by the Economist and the popular FiveThirtyEight website, among other choices.
```{r p5_step2, echo = TRUE, warning = FALSE, message = FALSE}
p3 +
ggthemes::theme_fivethirtyeight()
```
</details>
<details open><summary class='drop2'>Modify theme elements</summary>
In addition to setting complete themes, we can individually modify element themes. If the element themes layer is set AFTER the complete themes layer, the element themes will override any conflicting default arguments in the complete theme. Some of these element these can alternatively be set in a `scale_...()` function. For example, to move the y-axis to the right side of the plot, use `scale_y_continuous(position = "right")` or theme()
It's a bit difficult to understand the help page for `?themes`. I usually resort to an Internet search to figure out how to set an element theme.
```{r p5_step3, echo = TRUE, warning = FALSE, message = FALSE}
p3 +
ggthemes::theme_fivethirtyeight() +
theme(
plot.subtitle = element_text(color = "red", size = 14, hjust=0.5)) # change plot subtitle font color and size, and center the subtitle
```
</details>
</details>
<details open><summary class='drop'>Step 6. Split a Plot into Many Subplots</summary>
We can use `facet_grid()` or `facet_wrap()` to split a ggplot object into multiple subplots, each representing a different group level. This is a nice way to simplify data-heavy plots with a single line of code.
With facet_grid we can specify different grouping variables for a row of plots versus a column of plots. For example, we can say that each row of subplots should represent a different region and each column of subplots should represent a different unit type. With facet_grid, facet column titles are horizontal at the top of each facet and facet row titles will be vertical to the right of each facet. With facet_wrap, subplots are "wrapped" in order from left-to-right and top-to-bottom, just like text is wrapped on a page. The facet titles are horizontal at the top of each facet.
<details open><summary class='drop2'>Create a faceted plot</summary>
```{r p6_step1, echo = TRUE, warning = FALSE, message = FALSE, fig.height =8.5}
# NOTES:
# 1. For this plot we need to use the `viz_dat` dataset so we can facet by park unit name
# 2. In the `facet_wrap` arguments, `ncol = 1` sets a single column so plots are stacked on top of each other and `scales = "free_y"` allows the y-axis range to differ across the subplots (this is useful if the range of y-axis values is very different across subplots and we are more interested in comparing trends than in comparing absolute numbers among the subplots)
subdat <- subset(viz_dat, unit_code %in% c("ARCH", "GRCA", "YOSE")) # plot data for Arches National Park (ARCH), Grand Canyon National Park (GRCA), and Yosemite National Park (YOSE)
ggplot(subdat, aes(x = year, y = visitors/1000)) +
geom_line() +
geom_point() +
labs(
title = "Annual Visitor Counts at Three National Parks, 2000 - 2015",
x = "Year",
y = "Number of visitors, in 1000's") +
theme_bw() +
theme(plot.title = element_text(hjust=0.5)) + # center the plot title
facet_wrap(vars(unit_name), ncol = 1, scales = "free_y")
```
We were wondering if 2015 was a very high year for visitor counts in parks other than Arches National Park. It looks like it was for Grand Canyon and Yosemite National Parks too!
</details>
</details>
<details open><summary class='drop'>Summary of Useful ggplot Object Components</summary>
These are not the ONLY ggplot components, but they are the ones most useful to know (IMHO)
- **Data** - a data frame (or tibble). Specified as the first `ggplot()` argument. Any aesthetics set at this level are part of the "master" template for the plot.
- **Aesthetic mapping** - the variables represented in the plot (and how), e.g., x, y, color, fill, size, shape, line type. These aesthetics, except x & y, can also be set to a constant value by specifying outside the aes(). Aesthetics can be defined in the master template or in a geometry layer.
- **Geometry layer** - the plot/mark layers, e.g., geom_point, geom_line, geom_boxplot. Can specify multiple such layers in a plot.
- **Scales** - used to fine-tune aesthetics. Takes the form 'scale_xxx_yyy', where 'xxx' refers to the aesthetic and 'yyy' how the aesthetic values will be specified, e.g., discrete, continuous, manual. The associated legend for the aesthetic can also be modified here.
- **Labs** - sets the plot labels, including plot title, subtitle, caption, and axis labels. Importantly, this is also where you would set alternative text and plot tags, for Section 508 Compliance.
- **Themes** - complete themes (e.g., theme_bw, theme_classic) can be used to control all non-data display. Themes can also be used to more specifically modify aspects of the panel background, legend elements, label text size and color, etc.
- **Facets** - subset the data by grouping variables and generate a separate subplot for each grouping variable, putting the graphs all on the same page.
</details>