Merge pull request #559 from UBC-DSCI/db-create-column

Updates to DB section to match Python
UBC-DSCI · Nov 12, 2023 · 22baaf4 · 22baaf4
2 parents c9b277a + b4df504
commit 22baaf4
Showing 1 changed file with 20 additions and 8 deletions.
diff --git a/source/reading.Rmd b/source/reading.Rmd
@@ -616,26 +616,38 @@ response for us. So `dbplyr` does all the hard work of translating from R to SQL
 we can just stick with R! 
 
 With our `lang_db` table reference for the 2016 Canadian Census data in hand, we 
-can mostly continue onward as if it were a regular data frame. For example, 
-we can use the `filter` function
-to obtain only certain rows. Below we filter the data to include only Aboriginal languages.
+can mostly continue onward as if it were a regular data frame. For example, let's do the same exercise
+from Chapter \@ref(intro): we will obtain only those rows corresponding to Aboriginal languages, and keep only
+the `language` and `mother_tongue` columns.
+We can use the `filter` function to obtain only certain rows. Below we filter the data to include only Aboriginal languages.
 
 ```{r}
 aboriginal_lang_db <- filter(lang_db, category == "Aboriginal languages")
 aboriginal_lang_db 
 ```
 
 Above you can again see the hints that this data is not actually stored in R yet:
-the source is a `lazy query [?? x 6]` and the output says `... with more rows` at the end
+the source is `SQL [?? x 6]` and the output says `... more rows` at the end
 (both indicating that R does not know how many rows there are in total!),
-and a database type `sqlite 3.36.0` is listed.
+and a database type `sqlite` is listed.
+We didn't use the `collect` function because we are not ready to bring the data into R yet. \index{collect}
+We can still use the database to do some work to obtain *only* the small amount of data we want to work with locally
+in R. Let's add the second part of our database query: selecting only the `language` and `mother_tongue` columns
+using the `select` function.
+
+```{r}
+aboriginal_lang_selected_db <- select(aboriginal_lang_db, language, mother_tongue)
+aboriginal_lang_selected_db
+```
+
+Now you can see that the database will return only the two columns we asked for with the `select` function.
 In order to actually retrieve this data in R as a data frame,
 we use the `collect` function. \index{filter}
 Below you will see that after running `collect`, R knows that the retrieved
 data has 67 rows, and there is no database listed any more.
 
 ```{r}
-aboriginal_lang_data <- collect(aboriginal_lang_db)
+aboriginal_lang_data <- collect(aboriginal_lang_selected_db)
 aboriginal_lang_data
 ```
 
@@ -649,14 +661,14 @@ For example, look what happens when we try to use `nrow` to count rows
 in a data frame: \index{nrow}
 
 ```{r}
-nrow(aboriginal_lang_db)
+nrow(aboriginal_lang_selected_db)
 ```
 
 or `tail` to preview the last six rows of a data frame:
 \index{tail}
 
 ```{r, eval = FALSE}
-tail(aboriginal_lang_db)
+tail(aboriginal_lang_selected_db)
 ```
 
 ```