-
Notifications
You must be signed in to change notification settings - Fork 55
/
Copy path16_references.Rmd
429 lines (276 loc) · 16.6 KB
/
16_references.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
\backmatter
# (PART) Appendices {-}
# Appendix A: GitHub {.unnumbered}
\index{GitHub}
GitHub can be a very useful platform to arrange, store, and share the code of your analytics projects even if it is typically used for collaborative software development. If you are unfamiliar with Git or GitHub, the steps below will assist you in getting started.
## Initiate a new repository {.unnumbered}
1. Log in to your GitHub account and click on the plus sign in the upper right corner. From the drop-down menu select `New repository`.
2. Give your repository a name, for example, `bigdatastat`. Then, click on the big green button, `Create repository`. You have just created a new repository.
3. Open Rstudio, and and navigate to a place on your hard-disk where you want to have the local copy of your repository.
4. Then create the local repository as suggested by GitHub (see the page shown right after you have clicked on `Create repository`: "…or create a new repository on the command line"). In order to do so, you have to switch to the Terminal window in RStudio and type (or copy and paste) the commands as given by GitHub. This should look similar to the following code chunk:
```{bash eval = FALSE}
echo "# bigdatastat" >> README.md
git init
git add README.md
git commit -m "first commit"
git remote add origin \
https://github.com/YOUR-GITHUB-ACCOUNTNAME/bigdatastat.git
git push -u origin master
```
Remember to replace `YOUR-GITHUB-ACCOUNTNAME` with your GitHub account name, before running the code above.
5. Refresh the page of your newly created GitHub repository. You should now see the result of your first commit.
6. Open `README.md` in RStudio, and add a few words describing what this repository is all about.
## Clone this book's repository {.unnumbered}
1. In RStudio, navigate to a folder on your hard-disk where you want to have a local copy of this book's GitHub repository.
2. Open a new browser window, and go to https://github.com/umatter/BigData.
3. Click on `Clone or download` and copy the link.
4. In RStudio, switch to the Terminal, and type the following command (pasting the copied link).
```{bash eval=FALSE}
git clone https://github.com/umatter/BigData.git
```
You now have a local copy of the repository which is linked to the one on GitHub. You can see this by changing to the newly created directory, containing the local copy of the repository:
```{bash eval=FALSE}
cd BigData
```
Whenever there are some updates to the book's repository on GitHub, you can update your local copy with:
```{bash eval=FALSE}
git pull
```
(Make sure you are in the `BigData` folder when running `git pull`.)
## Fork this book's repository {.unnumbered}
1. Go to https://github.com/umatter/BigData, and click on the 'Fork' button in the upper-right corner (follow the instructions).
2. Clone the forked repository (see the cloning of a repository above for details). Assuming you called your forked repository `BigData-forked`, you run the following command in the terminal (replacing `<yourgithubusername>`):
```
git clone https://github.com/`<yourgithubusername>`/BigData-forked.git
```
3. Switch into the newly created directory:
```
cd BigData-forked
```
4. Set a remote connection to the *original* repository:
```
git remote add upstream https://github.com/umatter/BigData.git
```
You can verify the remotes of your local clone of your forked repository as follows:
```
git remote -v
```
You should see something like
```
origin https://github.com/<yourgithubusername>/BigData-forked.git (fetch)
origin https://github.com/<yourgithubusername>/BigData-forked.git (push)
upstream https://github.com/umatter/BigData.git (fetch)
upstream https://github.com/umatter/BigData.git (push)
```
5. Fetch changes from the original repository. New material has been added to the original book repository, and you want to merge it with your forked repository. In order to do so, you first fetch the changes from the original repository:
```
git fetch upstream
```
6. Make sure you are on the master branch of your local repository:
```
git checkout master
```
7. Merge the changes fetched from the original repo with the master of your (local clone of the) forked repository:
```
git merge upstream/master
```
8. Push the changes to your forked repository on GitHub:
```
git push
```
Now your forked repo on GitHub also contains the commits (changes) in the original repository. If you make changes to the files in your forked repo, you can add, commit, and push them as in any repository. Example: open `README.md` in a text editor (e.g. RStudio), add `# HELLO WORLD` to the last line of `README.md`, and save the changes. Then:
```
git add README.md
git commit -m "hello world"
git push
```
# Appendix B: R Basics {.unnumbered}
This appendix provides an overview of various key R properties, including data types and data structures.
## Data types and memory/storage {.unnumbered}
Data loaded into RAM can be interpreted differently by R depending on the data *type*. Some operators or functions in R only accept data of a specific type as arguments. For example, we can store the numeric values `1.5` and `3` in the variables `a` and `b`, respectively.
```{r}
a <- 1.5
b <- 3
a + b
```
R interprets this data as type `double` (class 'numeric'):
```{r}
typeof(a)
class(a)
object.size(a)
```
If, however, we define `a` and `b` as follows, R will interpret the values stored in `a` and `b` as text (`character`).
```{r eval=FALSE}
a <- "1.5"
b <- "3"
a + b
```
```{r}
typeof(a)
class(a)
object.size(a)
```
Note that the symbols `1.5` take up more or less memory depending on the data-type they are stored in. This directly links to how data/information is stored/represented in binary code, which in turn is reflected in how much memory is used to store these symbols in an object as well as what we can do with it.
### Example: Data types and information storage {.unnumbered}
Given the fact that computers only understand `0`s and `1`s, different approaches are taken to map these digital values to other symbols or images (text, decimal numbers, pictures, etc.) that we humans can more easily make sense of. Regarding text and numbers, these mappings involve *character encodings* (in which combinations of `0`s and `1`s represent a character in a specific alphabet) and *data types*.
Let's illustrate the main concepts with the simple numerical example from above. When we see the decimal number `139` written somewhere, we know that it means 'one-hundred-and-thirty-nine'. The fact that our computer is able to print `139` on the screen means that our computer can somehow map a sequence of `0`s and `1`s to the symbols `1`, `3`, and `9`. Depending on what we want to do with the data value `139` on our computer, there are different ways of how the computer can represent this value internally. Inter alia, we could load it into RAM as a *string* ('text'/'character') or as an *integer* ('natural number') or *double* (numeric, floating point number). All of them can be printed on screen but only the latter two can be used for arithmetic computations. This concept can easily be illustrated in R.
We initiate a new variable with the value `139`. By using this syntax, R by default initiates the variable as an object of type `double`. We then can use this variable in arithmetic operations.
```{r}
my_number <- 139
# check the class
typeof(my_number)
# arithmetic
my_number*2
```
When we change the *data type* to 'character' (string) such operations are not possible.
```{r error=TRUE}
# change and check type/class
my_number_string <- as.character(my_number)
typeof(my_number_string)
# try to multiply
my_number_string*2
```
If we change the variable to type `integer`, we can still use math operators.
```{r}
# change and check type/class
my_number_int <- as.integer(my_number)
typeof(my_number_int)
# arithmetics
my_number_int*2
```
Having all variables in the correct type is important for data analytics with various sample sizes.
However, because different data types must be represented differently internally, different types may take up more or less memory, affecting performance when dealing with massive amounts of data.
We can illustrate this point with `object.size()`:
```{r}
object.size("139")
object.size(139)
```
## Data structures {.unnumbered}
For the time being, we have only looked at individual bytes of data. A single dataset can contain gigabytes of data and both text and numeric values. R has several classes of objects that provide different data structures. The data types and data structures used to store data can both affect how much memory is required to hold a dataset in RAM.
### Vectors vs. Factors in R {.unnumbered}
Vectors are collections of values of the same type. They can contain either all numeric values or all character values.
<!-- ```{r numvec, echo=FALSE, out.width = "5%", fig.align='center', fig.cap= "(ref:capnumvec)", purl=FALSE} -->
<!-- include_graphics("img/02_numvec.png") -->
<!-- ``` -->
<!-- (ref:capnumvec) Illustration of a numeric vector (symbolic). Figure by @murrell_2009 (licensed under CC BY-NC-SA 3.0 NZ). -->
For example, we can initiate a character vector containing information on the hometowns of persons participating in a survey.
```{r}
hometown <- c("St.Gallen", "Basel", "St.Gallen")
hometown
object.size(hometown)
```
Unlike in the data types example above, storing these values as type `numeric` to save memory is unlikely to be practical.
R would be unable to convert these strings into floating point numbers. Alternatively, we could consider a correspondence table in which each unique town name in the dataset is assigned a numeric (id) code. We would save memory this way, but it would require more effort to work with the data. Fortunately, the data structure 'factor' in basic R already implements this idea in a user-friendly manner.
Factors are sets of categories. Thus, the values are drawn from a fixed set of possible values.
<!-- ```{r factor, echo=FALSE, out.width = "5%", fig.align='center', fig.cap= "(ref:capfactor)", purl=FALSE} -->
<!-- include_graphics("img/02_factor.png") -->
<!-- ``` -->
<!-- (ref:capfactor) Illustration of a factor (symbolic). Figure by @murrell_2009 (licensed under CC BY-NC-SA 3.0 NZ). -->
Considering the same example as above, we can store the same information in an object of type class `factor`.
```{r}
hometown_f <- factor(c("St.Gallen", "Basel", "St.Gallen"))
hometown_f
object.size(hometown_f)
```
At first glance, the fact that `hometown f` consumes more memory than its character vector sibling appears strange.
But we've seen this kind of 'paradox' before. Once again, the more sophisticated approach has an overhead (here not in terms of computing time but in terms of structure encoded in an object). `hometown_f` has more structure (i.e., a number-to-'factor level'/category label mapping).
This additional structure is also data that must be saved somewhere. This disadvantage, as in previous examples of overhead costs, diminishes with larger datasets:
```{r}
# create a large character vector
hometown_large <- rep(hometown, times = 1000)
# and the same content as factor
hometown_large_f <- factor(hometown_large)
# compare size
object.size(hometown_large)
object.size(hometown_large_f)
```
### Matrices/Arrays {.unnumbered}
Matrices are two-dimensional collections of values of the same type, arrays are higher-dimensional collections of values of the same type.
<!-- ```{r matrix, echo=FALSE, out.width = "10%", fig.align='center', fig.cap= "(ref:capmatrix)", purl=FALSE} -->
<!-- include_graphics("img/02_matrix.png") -->
<!-- ``` -->
<!-- (ref:capmatrix) Illustration of a numeric matrix (symbolic). Figure by @murrell_2009 (licensed under CC BY-NC-SA 3.0 NZ). -->
For example, we can initiate a three-row/two-column numeric matrix as follows.
```{r}
my_matrix <- matrix(c(1,2,3,4,5,6), nrow = 3)
my_matrix
```
And a three-dimensional numeric array as follows.
```{r}
my_array <- array(c(1,2,3,4,5,6), dim = 3)
my_array
```
### Data frames, tibbles, and data tables {.unnumbered}
Remember that in R, data frames are the most common way to represent a (table-like) dataset. Each column can contain a vector of a specific data type (or a factor), but all columns must be the same length. In the context of data analysis, each row of a data frame contains an observation, and each column contains a characteristic of that observation.
<!-- ```{r df, echo=FALSE, out.width = "10%", fig.align='center', fig.cap= "(ref:capdf)", purl=FALSE} -->
<!-- include_graphics("img/02_df.png") -->
<!-- ``` -->
<!-- (ref:capdf) Illustration of a data frame (symbolic). Figure by @murrell_2009 (licensed under CC BY-NC-SA 3.0 NZ). -->
The previous implementation of data frames in R made it difficult to work with large datasets.^[This was not an issue in the early days of R because datasets that were rather large by today's standards (in the Gigabytes) could not have been handled properly by normal computers anyway (due to a lack of RAM).] Several newer R implementations of the data-frame concept were introduced with the aim to speed up data processing. One is known as `tibble`, and it is implemented and used in the `tidyverse` packages. The other is known as `data table`, and it is implemented in the `data table`-package. Most of the shortcomings of the original 'data.frame' implementation, however, have been addressed in subsequent R versions, making traditional `data.frames`, `tibbles`, and `data.tables` more similarly suitable for working with large datasets (for in-memory processing).
Here is how we define a `data.table` in R:
```{r}
# load package
library(data.table)
# initiate a data.table
dt <- data.table(person = c("Alice", "Ben"),
age = c(50, 30),
gender = c("f", "m"))
dt
```
### Lists {.unnumbered}
Similar to data frames and data tables, lists can contain different types of data in each element. For example, a list could contain several other lists, data frames, and vectors with differing numbers of elements.
<!-- ```{r list, echo=FALSE, out.width = "10%", fig.align='center', fig.cap= "(ref:caplist)", purl=FALSE} -->
<!-- include_graphics("img/02_list.png") -->
<!-- ``` -->
<!-- (ref:caplist) Illustration of a data frame (symbolic). Figure by @murrell_2009 (licensed under CC BY-NC-SA 3.0 NZ). -->
This flexibility can easily be demonstrated by combining some of the data structures created in the examples above:
```{r}
my_list <- list(my_array, my_matrix, dt)
my_list
```
## R-tools to investigate structures and types {.unnumbered}
package | function | purpose
-------- | ---------- | ---------------------------------------------
`utils` | `str()` | Compactly display the structure of an arbitrary R object.
`base` | `class()` | Prints the class(es) of an R object.
`base` | `typeof()` | Determines the (R-internal) type or storage mode of an object.
# Appendix C: Install Hadoop {.unnumbered}
You might wish to install Hadoop locally on your computer in order to perform the Hadoop example in Chapter 6.
The next few stages assist you in configuring everything. Please be aware that between the time I wrote this book and the time you read it, Hadoop may have undergone some changes. Consult https://hadoop.apache.org/ for further details on releases and to install the most recent version.
However, the instructions for installing Hadoop in the following should be nearly comparable. Please take note that the steps below assume you are using *Ubuntu Linux*. See the README file in https://github.com/umatter/bigdata for additional hints regarding the installation of software used in this book.
```{bash eval=FALSE}
# download binary
wget https://dlcdn.apache.org/hadoop/common/hadoop-2.10.1/hadoop-2.10.1.tar.gz
# download checksum
wget \
https://dlcdn.apache.org/hadoop/common/hadoop-2.10.1/hadoop-2.10.1.tar.gz.sha512
# run the verification
shasum -a 512 hadoop-2.10.1.tar.gz
# compare with value in mds file
cat hadoop-2.10.1.tar.gz.sha512
# if all is fine, unpack
tar -xzvf hadoop-2.10.1.tar.gz
# move to proper place
sudo mv hadoop-2.10.1 /usr/local/hadoop
# then point to this version from hadoop
# open the file /usr/local/hadoop/etc/hadoop/hadoop-env.sh
```
```{bash eval=FALSE}
# in a text editor and add (where export JAVA_HOME=...)
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
# clean up
rm hadoop-2.10.1.tar.gz
rm hadoop-2.10.1.tar.gz.sha512
```
After running all of the steps above, run the following line in the terminal to check the installation
```{bash eval=FALSE}
# check installation
/usr/local/hadoop/bin/hadoop
```
```{r include=FALSE}
system("rsync -r ~/Dropbox/Teaching/HSG/BigData/BigData/_bookdown_files/bigdata_files ~/Dropbox/Teaching/HSG/BigData/BigData/docs/bigdata_files")
```
# (PART) References and Index {.unnumbered}
`r if (knitr::is_html_output()) '
# References {-}
'`