Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrangling chapter review #235

Merged
merged 72 commits into from
Sep 27, 2021
Merged
Changes from 1 commit
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
5b1fc62
updating the wide/long sections as per reviewer E suggestions
leem44 Jun 27, 2021
fb7c142
adding convert argument to separate function
leem44 Jun 28, 2021
6b31e5c
updating mutate section to account for new convert in separate sectio…
leem44 Jun 28, 2021
c03342a
editing piping section to include when you might use temporary object…
leem44 Jun 28, 2021
bea4ae2
updating piping section as per Reviewer Cs suggestion
leem44 Jun 28, 2021
8c4e76c
adding why wide is bad
leem44 Jun 28, 2021
7b729c2
minor changes
leem44 Jun 28, 2021
bb989f6
changing column width in lang_long table so we can read all the rows
leem44 Jun 28, 2021
5f2d235
adding example for tibble
leem44 Jun 28, 2021
4a11419
adding summarize_if
leem44 Jun 28, 2021
31cdcdf
adding section about select helpers
leem44 Jun 29, 2021
a0f9749
added section on summarize +across
leem44 Jun 29, 2021
9b9f0c6
doing a pass through the chapter and editing grammar/spelling/logic
leem44 Jun 29, 2021
80df294
removing summarize_if since we have across
leem44 Jun 29, 2021
59dedd9
minor change
leem44 Jun 29, 2021
4a7b8a6
updating numbers in pivot longer table to match data frame
Jul 8, 2021
b94e2dc
merging with remote branch
Jul 8, 2021
fd487c5
adding image explanation for separate function
Jul 8, 2021
23a7442
moving section down to additional resources
Jul 29, 2021
c5b8533
wrapping text
Jul 29, 2021
c5dbe60
making corrections to the formatting
Jul 29, 2021
b74c232
fixing code box in wrong place
Aug 16, 2021
c74feab
updating numbers in pivot longer table to match data frame
Jul 8, 2021
48ad252
updating the wide/long sections as per reviewer E suggestions
leem44 Jun 27, 2021
c7206bb
adding convert argument to separate function
leem44 Jun 28, 2021
86e1b0d
updating mutate section to account for new convert in separate sectio…
leem44 Jun 28, 2021
3b6c87e
editing piping section to include when you might use temporary object…
leem44 Jun 28, 2021
0e969d5
updating piping section as per Reviewer Cs suggestion
leem44 Jun 28, 2021
6d5536a
adding why wide is bad
leem44 Jun 28, 2021
d7bdff8
minor changes
leem44 Jun 28, 2021
e23cc4d
changing column width in lang_long table so we can read all the rows
leem44 Jun 28, 2021
2bfdbcd
adding example for tibble
leem44 Jun 28, 2021
955d111
adding summarize_if
leem44 Jun 28, 2021
6a415d1
adding section about select helpers
leem44 Jun 29, 2021
316b1c0
added section on summarize +across
leem44 Jun 29, 2021
6f1d40d
doing a pass through the chapter and editing grammar/spelling/logic
leem44 Jun 29, 2021
d0be545
removing summarize_if since we have across
leem44 Jun 29, 2021
05da9ae
minor change
leem44 Jun 29, 2021
6547353
adding image explanation for separate function
Jul 8, 2021
ae49617
moving section down to additional resources
Jul 29, 2021
70d903c
wrapping text
Jul 29, 2021
e7f7172
making corrections to the formatting
Jul 29, 2021
81eb7e3
fixing code box in wrong place
Aug 16, 2021
16f55f7
renamed wrangling
Aug 16, 2021
54b40c5
editing map paragraph since we added summarize and across
Aug 16, 2021
84b50a4
adding section on rowwise
Aug 16, 2021
f2bdb3e
addressing reviewer Ds comments (adding table of summary wrangling fu…
Aug 16, 2021
2490eac
changing text up to 3.4.3: reordering vector, data frame, list sectio…
Aug 17, 2021
467066f
going through 3.4.3 - 3.5.1 and fixing the writing
Aug 17, 2021
2fe8478
fixing the writing for clarity, updating the explanation of pull
Aug 17, 2021
a217d0e
one last editing pass in the second half of the chapter
Aug 17, 2021
89d2ffd
split functions and operators across two bullets and alphabetized the…
ttimbers Sep 19, 2021
d3a29bc
combined row and observation figure into one, as the explanation seem…
ttimbers Sep 19, 2021
7af9dbb
swapped example to be character, not integer - because we call it an …
ttimbers Sep 19, 2021
ccb1d0e
fixed typo calling vector year that I changed to region
ttimbers Sep 19, 2021
919701c
a few more image changes to go with the character vector example
ttimbers Sep 19, 2021
0acd051
reviewed up until Tidy data
ttimbers Sep 19, 2021
2ecf1e0
changed some headers to title case to be consistent
ttimbers Sep 19, 2021
5551639
worked on tidying from wider to longer wording and removed loading da…
ttimbers Sep 20, 2021
0df1d18
fixed wrong tidy image
ttimbers Sep 22, 2021
3fbf22e
wording changes to tidy data section
ttimbers Sep 22, 2021
6858f74
wording changes up to the end of mutate
ttimbers Sep 22, 2021
a34e1ac
improved image size for fig 02-plot
ttimbers Sep 22, 2021
93d366e
simplified mutate as a new column example
ttimbers Sep 22, 2021
cd611ba
small plot changes related to simplifuing the mutate example
ttimbers Sep 22, 2021
3a651da
wording changes for the pipe section
ttimbers Sep 23, 2021
d03b1d0
edited wording in rowwise section
ttimbers Sep 23, 2021
15ec161
reorganized and simplied the summarize/purrr map/rowwise section. Sti…
ttimbers Sep 25, 2021
ec9a2af
added NA section for summarize + across
ttimbers Sep 26, 2021
2bc1b8f
Fixed images for pivoting and added images for aggregating
ttimbers Sep 26, 2021
ee93e23
tried to keep most text and all code to 80 characters
ttimbers Sep 26, 2021
5d2abc1
merging dev into wrangling
ttimbers Sep 27, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
worked on tidying from wider to longer wording and removed loading da…
…ta from the canlang package as I don't think we want to add that complexity...
ttimbers committed Sep 20, 2021
commit 55516393525d3f41d91c7a3c2e7ef6b1904a4422
36 changes: 36 additions & 0 deletions data/region_data.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
region,households,area,population,dwellings
Belleville,43002,1354.65121,103472,45050
Lethbridge,45696,3046.69699,117394,48317
Thunder Bay,52545,2618.26318,121621,57146
Peterborough,50533,1636.98336,121721,55662
Saint John,52872,3793.42158,126202,58398
Brantford,52530,1086.27106,134203,54419
Moncton,61769,2625.1211,144810,66699
Guelph,59280,604.00365,151984,63324
Trois-Rivières,72502,1052.80206,156042,77734
Saguenay,72479,3078.79919,160980,77968
Kingston,67915,2142.32855,161175,77173
Greater Sudbury,70445,4372.1229,164689,76619
Abbotsford - Mission,62631,651.99511,180518,65967
Kelowna,81383,3144.90019,194882,88374
Barrie,72534,967.67675,197059,76336
St. John's,85015,850.46041,205955,92353
Sherbrooke,95577,1506.36002,212105,106082
Regina,94955,4408.86418,236481,101719
Saskatoon,115283,6218.50503,295095,124766
Windsor,132912,1032.38176,329144,140408
Victoria,162716,704.4339,367770,172559
Oshawa,138962,908.06142,379848,142462
Halifax,173459,5963.13705,403390,187478
St. Catharines - Niagara,168485,1425.34399,406074,180606
London,206448,2677.86088,494069,220452
Kitchener - Cambridge - Waterloo,200495,1106.65072,523894,210896
Hamilton,293345,1404.6567,747545,306034
Winnipeg,306550,5410.82907,778489,321484
Québec,361891,3475.38576,800296,382308
Edmonton,502143,9857.77908,1321426,537634
Ottawa - Gatineau,535499,7168.96442,1323783,571146
Calgary,519693,5241.70103,1392609,544870
Vancouver,960894,3040.41532,2463431,1027613
Montréal,1727310,4638.24059,4098927,1823281
Toronto,2135909,6269.93132,5928040,2235145
Binary file modified docs/_main_files/figure-html/02-plot-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_main_files/figure-html/test-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/dataframe/dataframe.001.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/obs_and_var/obs_and_var.001.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/img/pivot_longer_with_table.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/separate_function.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/vec_vs_list/vec_vs_list.001.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/vector/vector.001.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
11 changes: 11 additions & 0 deletions docs/reference-keys.txt
Original file line number Diff line number Diff line change
@@ -432,3 +432,14 @@ r-and-the-irkernel
r-packages
latex
moving-files-to-your-computer
fig:img-separate
tab:summary-functions-table
what-is-a-list
separate
using-select-helpers-to-extract-columns
filter-and
aggregating-data-with-group_by-summarize
iterating-over-columns-of-a-data-frame
using-summarize-and-across-to-iterate
iterating-over-rows-in-a-data-frame-with-rowwise
going-from-wide-to-long-using-pivot_longer
2 changes: 1 addition & 1 deletion docs/search_index.json

Large diffs are not rendered by default.

1,881 changes: 1,303 additions & 578 deletions docs/wrangling.html

Large diffs are not rendered by default.

79 changes: 52 additions & 27 deletions wrangling.Rmd
Original file line number Diff line number Diff line change
@@ -105,7 +105,7 @@ You can create vectors in R using the concatenate `c()` function. To create the
vector `region` as shown in Figure \@ref(fig:02-vector) we write:

``` {r}
year <- c("Toronto", "Montreak", "Vancouver", "Calgary", "Ottawa")
year <- c("Toronto", "Montreal", "Vancouver", "Calgary", "Ottawa")
year
```

@@ -221,7 +221,7 @@ Tidy data satisfy the following three criteria [@wickham2014tidy]:
- each row is a single observation,
- each column is a single variable, and
- each value is a single cell (i.e., its row and column position in the data
frame is not shared with another value)
frame) is not shared with another value.

In Figure \@ref(fig:02-tidy-image), we have a tidy data set that satisfies these
three criteria.
@@ -242,44 +242,65 @@ upfront. Luckily there are many well-designed `tidyverse` data
cleaning/wrangling tools to help you easily tidy your data. Let's explore them
below!

### Going from wide to long (or tidy!) using `pivot_longer`
### Going from wide to long using `pivot_longer`

One common step to get data into a tidy format is to combine columns that are
stored in separate columns but are really part of the same variable.
Data is often stored this way because this format is usually more intuitive for
One task that is commonly performed to get data into a tidy format
is to combine values that are stored in separate columns,
but are really part of the same variable, into one.
Data is often stored this way because this format is sometimes more intuitive for
human readability and understanding, and humans create data sets.
In Figure \@ref(fig:02-wide-to-long),
the table on the left is in an untidy, "wide" format because the year values
(2006, 2011, 2016) are stored as column names.
And as a consequence,
the values for population for the various cities
over these years are also split across several columns.
For humans, this table is easy to read, which is why you will often
find data stored in this wide format.
However, this format is difficult to work with
when performing data visualization
or statistical analysis using R.

For example, if we wanted to
find the latest year it would be challenging because the
year values are stored as column names instead of as values in a single column.
So before we could apply a function to find the latest year
(for example, by using `max`),
we would have to first first extract the column names to get them as a vector
and then apply a function to extract the latest year.
The problem only gets worse if you would like to find the value for the
population for a given region for the latest year.
Both of these tasks are greatly simplified once the data is tidied.

For example, in Figure \@ref(fig:02-wide-to-long), the table on the left is in an
untidy, "wide" format because the year values (2006, 2011, 2016) are listed as
the column headers. For humans, this table is easy to read, which is why you will often
find data stored in this wide format. However, for R, to do any visualization or
analysis this format is difficult to work with. For example, if we wanted to
find the maximum year it's hard to do when the year values are not in their own
column (since R often applies functions, such as `max` column-wise).
Another problem with data in this format is that we don't know what the
numbers under each year actually represent. Do those numbers represent
population size? Land area? It's not clear. We can reshape this data set to a
"long" format by creating a column called "year" and a column called
population size? Land area? It's not clear.
To solve both of these problems,
we can reshape this data set to a tidy data format
by creating a column called "year" and a column called
"population," which is the table on the right of Figure \@ref(fig:02-wide-to-long).
Note that this transformation makes the data "longer".

``` {r 02-wide-to-long, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Going from wide to long data", fig.retina = 2, out.width = "1150"}
knitr::include_graphics("img/wide_to_long.jpeg")
```

The function `pivot_longer` combines columns, and often makes the data frame longer
and narrower. To learn how to use `pivot_longer`, we will work with the
The function `pivot_longer` combines columns,
and is usually used during tidying data
when we need to make the data frame longer and narrower.
To learn how to use `pivot_longer`, we will work through an example with the
`region_lang_top5_cities_wide.csv` data set. This data set contains contains the
counts of how many Canadians cited each language as their mother tongue for five
major Canadian cities (Toronto, Montréal, Vancouver, Calgary and Edmonton) from
the 2016 Canadian census. We will load the `tidyverse` package so we can use our
wrangling functions and the `canlang` package since it contains the
`region_lang` and `region_data` data sets that we will use later in the chapter.
the 2016 Canadian census.
To get started,
we will load the `tidyverse` package so we can access our data reading
and wrangling functions in R.

Our data set is stored in an untidy format, as shown below:

``` {r 02-tidyverse, warning=FALSE, message=FALSE}
library(tidyverse)
library(canlang)
lang_wide <- read_csv("data/region_lang_top5_cities_wide.csv")
lang_wide
```
@@ -647,13 +668,17 @@ filter(official_langs, region == "Calgary" | region == "Edmonton")

### Using `filter` to extract rows with `%in%`

Suppose we want to see the populations of our five cities. The `region_data`
data set from the `canlang` package contains statistics for number of
households, land area, population and number of dwellings for different regions
according to the 2016 Canadian census.
Suppose we want to see the populations of our five cities. Let's read in the
`region_data.csv` file that comes from the 2016 Canadian census,
as it contains statistics for number of households, land area, population
and number of dwellings for different regions.

``` {r}
region_data
```{r, include = FALSE}
write_csv(canlang::region_data, "data/region_data.csv")
```

``` {r message = FALSE}
region_data <- read_csv("data/region_data.csv")
```

To get the population of our five cities we can filter the data set using the