-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path02-data-structures.Rmd
524 lines (385 loc) · 18.5 KB
/
02-data-structures.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
# Data structures {#Chapter2}
<img src="images/matrixSmolt.jpg" alt="The best fish"><br>
<p style="font-family: times, serif; font-size:.9em; font-style:italic">
Contrast how you see a fish and how computers see fish. Our job is to bridge the gap. No problem...</p>
<br>
In this chapter, we will introduce basic data structures and how to work with them in R. One of our challenges is to understand how R sees our data.
R is what is known as a "high-level" or "interpreted" programming language, in addition to being "functional" and "object-oriented". This means the pieces that make it up are a little more intuitive to the average user than most low-level languages like C or C++. The back-end of R is, in fact, a collection of low-level code that builds up the functionality that we need. This means that R has a broad range of uses, from data management to math, and even GIS and data visualization tools, all of which are conveniently wrapped in an "intuitive", "user-friendly" language.
Part of this flexibility comes from the fact that R is also a "vectorized" language. Holy cow, R is so many things. But, why do you care about this? This will help you wrap your head around how objects are created and stored in R, which will help you understand how to make, access, modify, and combine the data that you will need for any approach to data analysis. It is maybe easiest to see by taking a look at some of the data structures that we'll work with.
We will work exclusively with objects and functions created in base R for this Chapter, so you do not need any of the class data sets to play along.
## Vectors {#vectors}
The vector is the basic unit of information in R. Pretty much everything else we'll concern ourselves with is made of vectors and can be contained within one. Wow, what an existential paradox *that* is.
Let's take a look at how this works and why it matters. Here, we have defined `a` as a variable with the value of `1`.
```{r}
a <- 1
```
...or have we?
```{r}
print(a)
```
What is the square bracket in the output here? It's an index. The index is telling us that the first element of `a` is `1`. This means that `a` is actually a "vector", not a "scalar" or singular value as you may have been thinking about it. You can think of a vector as a column in an Excel spreadsheet or an analogous data table. By treating every object (loosely) as a vector, or an element thereof, the language becomes much more general.
So, even if we define something with a single value, it is still just a vector with one element. For us, this is important because of the way that it lets us do math. It makes vector operations so easy that we don't even need to think about them when we start to make statistical models. It makes working through the math a zillion times easier than on paper! In terms of programming, it can make a lot of things easier, too.
An **atomic vector** is a vector that can hold one and only one kind of data. These can include:
+ Character
+ Numeric
+ Integer
+ Logical
+ Factor
+ Date/time
And some others, but none with which we'll concern ourselves here.
If you are ever curious about what kind of object you are working with, you can find out by exposing the data structure with `str()`:
Let's go play with some!
```{r}
str(a)
```
Examples of atomic vectors follow. Run the code to see what it does.
### Integers and numerics {-#nums}
First, we demonstrate one way to make a vector in R. The `c()` function ("combine") is our friend here for the quick-and-dirty approach.
In this case, we are making an object that contains a sequence of whole numbers, or integers.
```{r, eval=FALSE}
# Make a vector of integers 1-5
a <- c(1, 2, 3, 4, 5)
# One way to look at our vector
print(a)
```
Here is another way to make the same vector, but we need to pay attention to how R sees the data type. A closer look shows that these methods produce a **numeric** vector (`num`) instead of an **integer** vector (`int`). For the most part, this one won't make a huge difference, but it can become important when writing or debugging statistical models.
```{r}
# Define the same vector using a sequence
a <- seq(from = 1, to = 5, by = 1)
str(a)
```
We can change this by explicitly telling R how to build our vector:
```{r, eval = FALSE}
a <- as.vector(x = seq(1, 5, 1), mode = "numeric")
```
Notice that I did not include the argument names in the call to `seq()` because these are commonly used default arguments. But, you can find out what they are by running `?seq`.
### Characters and factors {-#strings}
**Characters** are anything that is represented as text strings. If I want to make a vector of character strings, I need to close the elements in quotes like I do below. Otherwise, R will go look for objects with these names.
```{r}
b <- c("a", "b", "c", "d", "e") # Make a character vector
b # Print it to the console
str(b) # Now it's a character vector
```
They are readily converted (sometimes automatically) to **factors**:
```{r}
b <- as.factor(b) # But we can change if we want
b
str(b) # Look at the data structure
```
**Factors** are a special kind of data type in R that we may run across from time to time. They have **levels** that can be ordered numerically. By default, R assigns factor levels (1, 2, 3 ...) in alpha numeric order, not by the order in which levels first appear in the data. This is not important except that it becomes useful for coding variables used in statistical models. But even then R does most of this behind the scenes and we won't have to worry about it for the most part. In fact, in a lot of cases we will want to change factors to numerics or characters so they are easier to manipulate.
This is what it looks like when we code a factor as number:
```{r, eval=FALSE}
as.numeric(b)
```
```{r, eval = FALSE}
# What did that do?
?as.numeric
```
> Aside: we can ask R what functions mean by adding a question mark as we do above in a couple of instances. And not just functions: we can ask it about pretty much any built-in object. The help pages take a little getting used to, but once you get the hang of it... In the mean time, the internet is your friend and you will find a multitude of online groups and forums with a quick search.
### Logical vectors {-#logicals}
Most of the `logical` vectors we deal with are yes/no or comparisons to determine whether a given piece of information matches a condition. Here, we use a logical check to see if the object `a` we created earlier is the same as object `b`. If we store the results of this check to a new object `c`, we get a new logical vector filled with `TRUE` and `FALSE`, one for each element in `a` and `b`.
```{r, message=FALSE, warning=FALSE}
# The "==" compares the numeric vector to the factor one
c <- a == b
c
str(c)
```
We now have a logical vector. For the sake of demonstration, we could perform any number of logical checks on a vector using built-in R functions (it does not need to be a logical like `c` above).
We can check for missing values.
```{r}
is.na(a)
```
We can make sure that all values are finite.
```{r}
is.finite(a)
```
The exclamation `!` point means "not" in to computers.
```{r}
!is.na(a)
```
We can see if specific elements meet a criterion.
```{r}
a == 3
```
We can just look at unique values.
```{r}
unique(b)
```
The examples above are all simple vector operations. These form the basis for data manipulation and analysis in R.
## Vector operations {#operations}
A lot of data manipulation in R is based on logical checks like the ones shown above. We can take these one step further to actually perform what one might think of as a "query" to select certain elements of a vector that satisfy some condition.
For example, we can reference specific elements of vectors directly. Here, we specify that we want to print the third element of `a`.
```{r}
# This one just prints it
a[3]
```
We might want to store that value to a new object `f` that is easier to read and type out.
```{r}
# This one stores it in a new object
# f is way easier to type than a
f <- a[3]
```
<br>
> **Important**
If it is not yet obvious, we have to assign the output of functions to new objects for the values to be usable in the future. In the example above, `a` is never actually *changed*. This is a common source of confusion early on.
Going further, we could select vector elements based on some condition. On the first line of code below, we tell R to show us the indices of the elements in vector `b` that match the character string `c`. Out loud, we would say, "`b` where the value of `b` is equal to `c`" in the first example. We can also use built-in R functions to just store the indices for all elements of `b` where `b` is equal to the character string `"c"`.
```{r}
b[b == "c"]
which(b == "c")
```
Perhaps more practically speaking, we can do elementwise operations on vectors easily in R. Here are a bunch of different things that you might be interested in doing with the objects that we've created so far. Give a few of these a try.
```{r, eval=FALSE}
a * .5 # Multiplication
a + 100 # Addition
a - 3 # Subtraction
a / 2 # Division
a^2 # Exponentiation
exp(a) # Same as "e to the a"
log(a) # Natural logarithm
log10(a) # Log base 10
```
If we change b to `character`, we can do string manipulation, too!
```{r}
# Convert b to character
b <- as.character(b)
```
We can append text. Remember, the examples below will just print the result. We would have to overwrite `b` or save it to a new object if we wanted to be able to use the result somewhere else later.
```{r}
# Paste an arbitrary string on to b
paste(b, "AAAA", sep = "")
# We can do it the other way
paste("AAAA", b, sep = "")
# Add symbols to separate
paste("AAAA", b, sep = "--")
# We can replace text
gsub(pattern = "c", replacement = "AAAA", b)
# Make a new object
e <- paste("AAAA", b, sep = "")
# Print to console
e
# We can strip text
# (or dates, or times, etc.)
substr(e, start = 5, stop = 5)
```
We can check how many elements are in a vector.
```{r}
# A has a length of 5,
# try it and check it
length(a)
# Yup, looks about right
a
```
And we can do lots of other nifty things. We can also bind multiple vectors together into a rectangular `matrix`. Say what?
## Matrices {#matrices}
Matrices are rectangular objects that we can think of as being made up of vectors.
We can make matrices by binding vectors that already exist.
```{r}
cbind(a, e)
```
Or we can make an empty one to fill.
```{r}
matrix(0, nrow = 3, ncol = 4)
```
Or we can make one from scratch.
```{r}
mat <- matrix(seq(1, 12), ncol = 3, nrow = 4)
```
We can do all of the things we did with vectors to matrices, but now we have more than one column, and a second dimension in the form of "rows" that we can also use to these ends:
```{r, eval=FALSE}
ncol(mat) # Number of columns
nrow(mat) # Number of rows
length(mat) # Total number of entries
mat[2, 3] # Value of row 2, column 3
str(mat)
```
See how number of rows and columns is defined in data structure? With rows and columns, we can assign column names and row names.
```{r}
colnames(mat) <- c("first", "second", "third")
rownames(mat) <- c("This", "is", "a", "matrix")
# Take a look
mat
```
We can also do math on matrices just like vectors, because matrices are just vectors smooshed into two dimensions (it's totally a word).
```{r}
mat * 2
```
All the same operations we did on [vectors](#operations) above...one example.
More on matrices as we need them. We won't use these a lot in this module, but R relies heavily on matrices to do linear algebra behind the scenes in the models that we will be working with.
## Dataframes {#dataframes}
Dataframes are like matrices, only not. They have a row/column structure like matrices and are also rectangular in nature. But, they can hold more than one data type!
Dataframes are made up of [atomic vectors](#atomics).
This is probably the data structure that we will use most in this book, along with atomic vectors.
Let's make a dataframe to see how it works.
```{r}
# Make a new object 'a' from a sequence
a <- seq(from = .5, to = 10, by = .5)
# Vector math: raise each 'a' to power of 2
b <- a^2
# Replicates values in object a # of times
c <- rep(c("a", "b", "c", "d"), 5)
# Note, we don't use quotes for objects,
# but we do for character variables
d <- data.frame(a, b, c)
```
Now we can look at it:
```{r}
print(d)
```
Notice that R assigns names to dataframes on the fly based on object names that you used to create them unless you specify elements of a data frame like this. They are not `colnames` as with [matrices](#matrices), they are `names`. You can set them when you make the dataframe like this:
```{r}
d <- data.frame(a = a, b = b, c = c)
```
Now can look at the names.
```{r}
# All of the names
names(d)
# One at a time: note indexing, names(d) is a vector!!
names(d)[2]
```
We can change the names.
```{r, eval=FALSE}
# All at once- note quotes
names(d) <- c("Increment", "Squared", "Class")
# Print it to see what this does
names(d)
# Or, change one at a time..
names(d)[3] <- "Letter"
# Print it again to see what changed
names(d)
```
We can also rename the entire dataframe.
```{r}
e <- d
```
Have a look:
```{r}
# Head shows first six
# rows by default
head(e)
```
```{r}
# Or, we can look at any
# other number that we want
head(e, 10)
```
We can make new columns in data frames like this!
```{r, eval=FALSE}
# Make a new column with the
# square root of our increment
# column
e$Sqrt <- sqrt(e$Increment)
e
```
Looking at specific elements of a dataframe is similar to a matrix, with some added capabilities. We'll do this with a real data set so it's more fun. There are a whole bunch of built-in data sets that we can use for examples. Let's start by looking at the `iris` data.
```{r}
# This is how you load built-in
# data sets
data("iris")
```
Play with the functions below to explore how this data set is stored in the environment, and how R sees it. This is a good practice to get into in general.
```{r, eval=FALSE}
# We can use ls() to see
# what is in our environment
ls()
# Look at the first six rows
# of data in the object
head(iris)
# How many rows does it have?
nrow(iris)
# How many columns?
ncol(iris)
# What are the column names?
names(iris)
# Have a look at the data structure-
# tells us all of the above
str(iris)
# Summarize the variables
# in the dataframe
summary(iris)
```
Now let's look at some specific things.
```{r}
# What is the value in 12th row
# of the 4th column of iris?
iris[12, 4]
# What is the mean sepal length
# across all species in iris?
mean(iris$Sepal.Length)
```
What about the `mean` of `Sepal.Length` just for `setosa`?
A couple of new things going on here:
1. We can refer to the columns as atomic vectors within the dataframe if we want to. Sometimes we have to do this...
2. Note the logical check for species
What we are saying here is, "Hey R, show me the mean of the column `Sepal.Length` in the dataframe `iris` where the species name is `setosa`"
```{r}
mean(iris$Sepal.Length[iris$Species == "setosa"])
```
We can write this out longhand to make sure it's correct (it is).
```{r}
logicalCheck <- iris$Species == "setosa"
lengthCheck <- iris$Sepal.Length[iris$Species == "setosa"]
```
We can also look at the whole data frame just for `setosa`. We will quickly switch over to using syntax that is a litter easier to understand for this, but this approach is at the core of pretty much all of those.
```{r}
# Note that the structure of species
# is preserved as a factor with three
# levels even though setosa is the
# only species name in the new df
setosaData <- iris[iris$Species == "setosa", ]
str(setosaData)
```
Finally, once we are working with dataframes, plotting becomes much easier to understand, and we can ease into some rudimentary, clunky R plots.
```{r}
# Some quick plotting code
# Once we have a nice dataframe like
# these ones, we can actually step into
# The world of exploratory analyses.
# Make a histogram of sepal lengths
hist(setosaData$Sepal.Length)
# Bi-plot
plot(setosaData$Sepal.Width, setosaData$Sepal.Length)
# Boxplots
boxplot(Sepal.Width ~ Species, data = iris)
```
Much, **MUCH** more of this to come as we continue.
## Lists
**Lists** are the ultimate data type in R. They are actually a [vector](vectors) that can hold different kinds of data, like a [dataframe](#dataframes). In fact, a dataframe is just a spectacularly rectangular list. Each element of a list can be any kind of object (an atomic vector, a matrix, a dataframe, or even another list!!).
Much of the real, filthy R programming relies heavily on lists. We will have to work with them at some point in this class, but we won't take a ton of time on them here. Lists are something we can ease back into later once the world stops spinning so fast from all the R.
Let's make a list - just to see how they work. Notice how our index operator has changed from `[ ]` to `[[ ]]` below? And, at the highest level of organization, we have only one dimension in our list, but any given element `myList[[i]]` could hold any number of dimensions.
```{r}
# Create an empty list with four elements
myList <- vector(mode = "list", length = 4)
# Assign some of our previously
# created objects to the elements
myList[[1]] <- a
myList[[2]] <- c
myList[[3]] <- mat
myList[[4]] <- d
```
Have a look at the list:
```{r}
# Print it
# Cool, huh?
myList
```
You can assign names when you create the list like we did for dataframes, too. You can do this manually, or R will do it on the fly for you. You can also reassign names to a list that you've already created.
```{r, eval=FALSE}
# No names by default
names(myList)
# Give it names like we did with
# a dataframe
names(myList) <- c("a", "c", "mat", "d")
# See how the names work now?
myList
# We reference these differently [[]]
myList[[1]]
# But we can still get into each object
# Play around with the numbers to see what they do!
myList[[2]][5]
# Can also reference it this way!
myList$c[1]
```
Very commonly, model objects and output are stored as lists. In fact, most objects that require a large amount of diverse information in R pack it all together in one place using lists, that way we always know where to find it and how as long as the objects are documented. Conceptually, every object in R, from your workspace on down the line, is a list **AND** an element of a list. It seems like a lot to take in now, but will be very useful in the future.
## Next steps {#next2}
For more practice with the data structures and R functions we covered here, you can check out this <a href="https://www.youtube.com/watch?v=h_Nruq9-NQw">walk-through of basic R commands</a> from the How To R YouTube Channel.
In the Chapter 3(#Chapter3), we will begin using functions from external R packages to read and work with real data.