-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathIMD_D4_M3_Control_Structures.Rmd
676 lines (570 loc) · 37.8 KB
/
IMD_D4_M3_Control_Structures.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
#### Control Structures
What are control structures? The simplest way to think about this is that these are programming elements that control the flow of data in your code depending on the value of logical conditions. There are six basic control structures and you can find them with `help("Control")`.
There are two control structures we are going to focus on: `for` loops and `if/else` statements. `for` loops are a control structure that allow you to repeat or iterate a section of code multiple times (*without* copy-pasting it over and over!). `if/else` statements control whether or not a section of code is run based on certain conditions. If we have time we are going to touch on `while` at the end, mostly for fun.
#### if/else
<details open><summary class='drop'>if/else Statement Overview</summary>
Most of you are probably familiar with this. It has 3 pieces (in order):
1. A conditional statement ("if") to evaluate as a (TRUE / FALSE), e.g. `(x>1)` 2. An action to perform if it is `TRUE`
3. An action to perform if it is `FALSE`
On any individual data point or set of data, only two of these code pieces will be applied. The conditional statement (1) will always be evaluated. Then only the appropriate `T` or `F` action (2 or 3) will be applied - the other will be skipped. There are a two different ways to invoke this in R and they do different things:
This first method `if(){}else{}` is a true control structure. This is useful for allowing your code to do different things for different data situations. However, it only evaluates a single TRUE/FALSE value.
`if(...logical statement...){...TRUE...}else{...FALSE...}` (hereafter `if/else`)
You will also see `ifelse` show up often.
`ifelse()` is the 'vectorized' version - this is most useful when you need to change the value in a vector i.e. (recoding). However you can make it work on single logical values. This is not recommended.
`ifelse(test= , yes= , no= )` (hereafter `ifelse`)
</details>
<br>
<details open><summary class='drop'>if/else</summary>
<details open><summary class='drop2'>if/else Examples</summary>
First thing to know about the `if/else` construction in R that it is very sensitive to the bracket placement, else must always be between a set of open brackets like this `}else{`. If it isn't, you will end up with errors that are hard to track down.
R will not comprehend this:
```{r pseudocode.1, echo=T, eval=F}
if(...){
"thing to do if TRUE"}
else{
"thing to do if FALSE"
}
```
R will only comprehend this (for now):
```{r pseudocode.2, echo=T, eval=F}
if (...) {
"thing to do if TRUE"
} else {
"thing to do if FALSE"
}
```
This statement expects a single boolean as the result of `if()` statement. Sometimes I like to use this to create switches or bypasses when I am writing loops/iteration. For example: If I have a dataframe that is non-empty, then do the analysis. If the data frame is empty do something else.
```{r Day4Flagging.1, echo=T, eval=T}
d <- c(1:4, 1:4 * 100, NA)
if (all(is.finite(d))) {
"no NA, NaN, or Inf present"
} else {
"NA present"
}
```
</details>
<details open><summary class='drop2'>if/else Nesting</summary>
This also nests nicely and we will get back to that later, but here is what that formally looks like:
```{r Day4Flagging.1b, echo=T, eval=T}
d <- c(1:4, 1:4 * 100)
if (all(is.na(d))) {
"NA present"
} else if (all(d < 100)) {
"all items less than 100"
} else if (any(d > 100)) {
paste(sum(d > 100), "items more than 100")
} else {
"I'm stumped"
}
```
This also works:
```{r Day4Flagging.1b2, echo=T, eval=T}
d <- c(1:4, 1:4 * 100)
if (all(is.na(d))) {
"NA present"
}
if (all(d < 100)) {
"all items less than 100"
}
if (any(d > 100)) {
paste(sum(d > 100), "items more than 100")
} else {
"I'm stumped"
}
```
Sarah - "*You do not need to say `else` each time, you can leave that off and it is fine. In fact you see that a lot in the underlying code, just execute `dplyr::bind_row` in the console and see for yourself.*"
Thomas - {hyperventilating} "*Iknow, but I can't handle that - it makes it hard to read for me.*"
</details>
</details>
<br>
<details open><summary class='drop'>ifelse()</summary>
<details open><summary class='drop2'>ifelse() Examples</summary>
Let's see how `ifelse` works to recode some data. Let me build this up starting from the data:
```{r Day4Recode.1, echo=T, eval=T}
d <- c(1:4, 100*1:4)
d
```
A simple vector of numbers.
Now lets evaluate that dataframe as a logical expression:
```{r Day4Recode.1a, echo=T, eval=T}
q_logic<- d < (mean(d) / 10)
q_logic
```
This will evaluate the expression and will return `TRUE` for all values of `d` that are less than the mean of `d` divided by 10. Do you remember PEMDAS? Well in R it is PEMDAS**L**. Logicals are evaluated after the math expression.
Under the hood, this vector of logicals is what `ifelse` is *actually* evaluating. In fact, we can just drop that into the ifelse directly if we want.
```{r Day4Recode.1b, echo=T, eval=T}
ifelse(q_logic, yes="verify", no="OK") #yes is the true condition no is the false condition
```
It goes through and replaces every `TRUE` value in that vector of logicals with "verify" and all of the `FALSE` values with "OK".
In reality, we seldom develop this kind of statement step by step like that. We would normally just write it on a single line.
```{r Day4Recode.1c, echo=T, eval=T}
q<-ifelse(d < (mean(d) / 10), "verify", "OK")
```
</details>
<br>
<details open><summary class='drop2'>ifelse() Missing Data</summary>
Now, let's see what happens if we have some missing data.
```{r Day4Recode.2, echo=T, eval=T}
d <- c(1:4, 100*1:4, NA)
ifelse(d < mean(d, na.rm = T) / 10, "verify", "OK")
```
That `NA`, gets passed through. Maybe we want that, maybe we don't. In `mean()` and many other functions we were able to set `na.rm=TRUE` and omit the `NA` value, but reading through `help(ifelse)`, we don't really have that option. One option is that we can try to have two nested `ifelse` conditions.
</details>
<br>
<details open><summary class='drop2'>ifelse() Nesting</summary>
'Nesting' refers to the situation where you use a function, and then you use it again within itself. This frequently shows up in iteration tasks.
```{r Day4Recode.3, echo=T, eval=T}
d <- c(1:4, 100*1:4, NA)
ifelse(is.na(d) == T, "Missing",
ifelse(d < mean(d, na.rm = T) / 10, "verify", "OK")
)
```
`is.na()` returns a logical vector with `F` for all finite elements and `T` for all `NA` and `NaN` (but not `Inf`). In this nested structure, the `F` cases are futher evaluated and labeled with the second statement. Nesting is useful, but chances are if you find yourself nesting more than 2 or 3 conditions deep, there is probably a better way to do things.
</details>
<br>
<details open><summary class='drop2'>case_when</summary>
That better way is `dplyr::case_when()`. If you look at the above code we were using to demonstrate if/else constructions we flagrantly violated 'DRY' principles. You might also remark that it could get a little confusing if we added anymore `else`s or cases to those statements. You would be correct on all counts. *When your only application of an if/else construct is to recode/reclassify data and you have more than 1 set of criteria for recoding,* you should really be using `case_when()`. It behaves like an if/else statement, but under the hood it is a pair of clever `for` loops, which we will talk about shortly.
```{r case_when, echo=T, eval=T}
d <- c(1:4, 100*1:4, NA)
dplyr::case_when(
is.na(d) ~ "Missing",
d < mean(d, na.rm = T) / 10 ~ "Verify",
d > mean(d, na.rm = T) / 10 ~ "OK"
)
```
There are a few things to note here. First this is a clear statement of what is happening and there are few {} or () to track. Also, there is no 'else' option. Because there is no `else` option we have to specify all of the cases of interest. If we do not fully specify the cases the resultant vector will add an `NA` for those cases that are not specified. It will evaluate each logical statement in order. Finally, we don't show it here, but it passes NA values into the output if they are present. These are not bad things, sometimes these are very useful behaviors. You just need to know what they are doing so you can decide when to take advantage of them. We will come back to this function in a `for` loop below if we have time.
<br>
</details>
</details>
<details open><summary class='drop2'>Wrap up</summary>
Wrapping up, I most commonly use the `if/else` constructions to create a switch in the code for special handling of undesirable data. However, I usually only go to this if there isn't a more direct solution. Also, getting the `{}` placement right can be infuriating. In my mind, `ifelse` gives a false sense of security. What all is in that 'else'?
<br>
</details>
</details>
#### Iteration
<details open><summary class='drop'>Iteration Overview</summary>
This brings us to iteration. Iteration is a control structure that tells code to repeat itself until some condition is satisfied. It is really good at doing the same thing over and over again on datasets that are pretty similar to one another. In other words, you can get a lot of work done fast. For today, we will focus on the `for` loop as our main approach to iteration.
Iteration consists of a few pieces:
1. The data you need to do something to
2. The method of iteration (`for`)
3. The task you need to do
4. The result of your task
*What are some kinds of repetitive tasks you find yourself doing that you are wanting to do more easily in R?*
<ul>
<li>...</li>
<li>...</li>
<li>...</li>
</ul>
</details>
<br>
<details open><summary class='drop'>for() Loops</summary>
The first iterator you may be familiar with is the `for` loop. Before we dive in, I want to pause and note that you will hear us tell you that there are better ways to iterate than a `for` loop. However, `for` loops are a fundamental skill for several reasons.
<ul>
<li> They reinforce a fundamental understanding of iterative processing.</li> <li> Everyone (probably) falls back on a for loop (eventually). </li>
<li> They are the precursor to writing your own functions. </li>
</ul>
How does it work? You give it a vector of numbers or characters, tell it what you want it to do, and it will do that thing for each item in the vector. In simplest terms that looks like:
```{r pseudocode.3, echo=T, eval=F}
for(i in vec){
do thing 1
do thing 2
...
append result to results of previous iterations
}
```
A few things to pick apart here:
`i` - i is an arbitrary variable, you could call it q. It will take on the value for the first item in the vector on the first step of iteration. q is then available in the loop as a variable and it takes on each successive value in vec with each step of the iteration.
`vec` - vector is that vector of items aka the conditions for iteration. These could be numeric or they could be characters or strings. Numeric vectors are common is you need to run through all of the cases. Character vectors are great if you have groups of data you want to work through. All 'vector` does it gives it some instructions on what it should work on and effectively when to stop.
Let's make sure we understand those two things
```{r ForLoop.0, echo=T, eval=T}
vec<-1:9 #vector over which we will iterate
for (i in vec) { # for the number 1 through 9
i
}
```
Nothing appears to have happened in the console. Why? Because I did not actually tell it to return something from that loop.
Also, that isn't entirely true... *something* happened:
```{r ForLoop.0b, echo=T, eval=T}
i
```
`i` was created in the environment and it has a value of 9 -- the last value of `vec`. Logically, that means the `for` loop must have made it to the end of `vec`.
An important thing to learn here is that `for` will not return information after each step on its own accord. You have to tell it to do that.
</details>
<br>
<details open><summary class='drop'>Getting information out of a for loop</summary>
Getting information out of a `for` loop is harder than it sounds and doing it right can make all of the difference in your code execution. First let's just get that information into the console so we can see it.
```{r ForLoop.0a, echo=T, eval=T}
for (i in vec) { # for the number 1 through 9
print(i+1)
}
```
We can see it clearly walking though that loop and adding 1 to each value. But, what we really want is a vector of values out of it.
```{r ForLoop.1a, echo=T, eval=T}
#alternatively predefine a container to receive the data
OutDat <- NULL # define an empty variable.
for (i in 1:9) {
OutDat[i] <- i + 1 # store that in the ith position of the variable OutDat
}
OutDat
```
This has now put the sum of `i+1` into the ith position of the vector `OutDat`
</details>
<details open><summary class='drop'>for loops for real IMD data tasks</summary>
<details open><summary class='drop2'>Preparing the data and the loop</summary>
So let's make a useful `for` loop. We are going to use raw hobo air temperature files. Just a little bit of background here, there are ~5 columns in most of these files.
<ul>
<li>Col. 1 is a sequential index column</li>
<li>Col. 2 is a date and time string</li>
<li>Col. 3 is the air temperature in degrees F</li>
<li>Col. 4 may be light measured as lux or lumens per ft2 (if it was measured)</li>
<li>Col. >5 are comment columns</li>
</ul>
We will want to do 4 things with those data. These should sound familiar:
<ul>
<li>Look at the data</li>
<li>We want to calculate some summary statistics.</li>
<li>We want to do a rudimentary QAQC check and flag potentially bad data</li>
<li>We want to save it to a file so we can do some further visual assessment.</li>
</ul>
```{r ForLoop.2a, echo=TRUE, eval=T}
library(magrittr)
library(ggplot2)
#create a list of file names that we will read in
fPaths <- paste0(
"https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/",
c(
"APIS01_20548905_2021_temp.csv",
"APIS02_20549198_2021_temp.csv",
"APIS03_20557246_2021_temp.csv",
"APIS04_20597702_2021_temp.csv",
"APIS05_20597703_2021_temp.csv"
)
)
fPaths
```
Note, `paste0()` is a special version of `paste()` that assumes no spaces between arguments.
Let's get some idea of what we are dealing with for the data so we can plan the code. It is almost impossible to read in data from files blindly so let's preview it.
```{r ForLoop.2b, echo=TRUE, eval=T}
#read in and inspect the first element (for training purposes)
temp<-read.csv(fPaths[1], skip = 1, header = T) #hobo data has an extra line at the top we need to skip
head(temp)
```
Gross, but we can fix that. Because I know what is in there and what the order is, let's redo this with headers set to `F`. (Note, I am changing `skip=2` otherwise this will treat the headers as data.)
```{r ForLoop.2c, echo=TRUE, eval=T}
temp<-read.csv(fPaths[1], skip = 2, header = F) #Set header to F so read.csv doesn't treat data as a header.
names(temp)[1:3]<-c("idx", "DateTime", "T_F") #let's just assign some column names really quick
head(temp)
str(temp) #look at data types
rm(temp) #do a little cleanup
```
We are only interested in the first 3 columns today.
So, things we are aware of for coding so far:
<ul>
-Three columns of data is all we want right now<br>
-Two columns are numeric/integer<br>
-Lastly, because I know these data, one of the files is empty except for the headers.<br>
</ul>
</details>
<br>
<details closed><summary class='drop2'>A Loop to Calculate Summary Statistics</summary>
What do we want to do?
<ul>
We are going to create a simple summary statistics dataframe.<br>
The loop will need to handle the file with no data somehow.<br>
We will need to initialize a empty variable so you can get data out of the for loop. We will use a list for reasons discussed later on.
</ul>
```{r ForLoop.2f, echo=TRUE, eval=T}
len<-length(fPaths) #calculate the number of iterations
SummaryData <- list() #Generate an empty list that we will put output into. Always rerun prior to loop so that it is empty.
for (i in 1:len) {
# 1 Read in and generate the data to use
# 1a extract the filename from a filepath and parse the metadata
fName <- basename(fPaths[i])
Site <- stringr::str_sub(fName, 1, 6)
Year <- as.numeric(stringr::str_sub(fName, -13, -10))
# 1b read in the data, change headers,
d <- read.csv(fPaths[i], skip = 1, header = T)[1:3] %>%
setNames(., c("idx", "DateTime", "T_F"))
# 2 summarize and 3 output the data
SummaryData[[i]] <- data.frame( #output the data
# 2 what to summarize for all files
Site = Site,
Year = Year,
FileName = fName,
# Variable output
# 2a What to summarize so long as there is 1 non-NA record
if (any(is.numeric(d[, 3]))) { # check that there is at least 1 non-NA record
list(
T_F_mean = mean(d$T_F, na.rm = T),
T_F_min = min(d$T_F, na.rm = T),
T_F_max = max(d$T_F, na.rm = T),
nObs = nrow(d),
Flag = NA
)
# 2b what to summarize if the data are missing.
} else {
list(
T_F_mean = NA,
T_F_min = NA,
T_F_max = NA,
nObs = nrow(d),
nDays = NA,
Flag = "Hobo File Present But No Data"
)
}
)
}
dplyr::bind_rows(SummaryData)
```
Lets take some time to pick this apart and understand what it is doing.
1. **The read and generate data step.** Is a pretty standard part of a loop where you define the data you want to use. This might be some variables that you are going to reuse late or actual data you are reading in. Here, we are just parsing some metadata from the filename.
2. **The getting stuff done step.** This step is setting up where the data goes -- "list element i" -- and what data to put there. The nifty thing is that we are calling an if/else statement within the construction of the dataframe. There are dataframe elements that will be fixed and dataframe elements that will be variable.
3. **The output step.** Here, the output step is integrated into the creation of the output dataframe.
</details>
<details closed><summary class='drop2'>A QA/QC Data Flagging Loop</summary>
The next thing we want to do is some data QA/QC. Here we want to apply some rules from our SOP for flagging data that requires follow-up verification. We don't know if the data are good or bad, but we do know we don't want to waste time trying to flag it in an Excel spreadsheet.
```{r ForLoop.2d, echo=TRUE, eval=F}
OutPath <- paste0(getwd(), "/hobo_ouputs/") # or put your preferred path here (we will use this later)
# set up some QA thresholds
TempUpper <- 40
TempLower <- (-40)
for (i in fPaths) {
# 1 Read in/generate data
fName <- basename(i)
d <- read.csv(i, skip = 1, header = T)[1:3] %>%
setNames(., c("idx", "DateTime", "T_F"))
# 2 Generate your output in this case, some QA/QC flagging
d_com <- d %>%
dplyr::mutate(
# checks for too big of a temp change
TChangeFlag = ifelse(c(abs(diff(T_F, lag = 1)) > 10, FALSE), "D10", NA),
# flag if a temp is too high or too low
Flag = dplyr::case_when(
is.na(T_F) ~ "MISSING",
T_F > TempUpper ~ "High",
T_F < TempLower ~ "Low"
)
) %>%
# merge so you can have multiple flags applied
tidyr::unite("Flag", c(TChangeFlag, Flag), sep = "_", remove = TRUE, na.rm = TRUE)
# 3 output the data
dir.create(OutPath, showWarnings = F)
write.csv(d_com[, c("DateTime", "T_F", "Flag")],
paste0(OutPath, gsub(".csv", "_QC.csv", fName)),
row.names = F, quote = F
)
}
```
We used some new functions there, but let's start at a higher level and work our way down to the lower level functions.
What changed with the `for` loop?
A for loop can accept any vector as input. It could be a vector of names or a could be a vector of numbers. In the first loop we did, I wanted that vector of numbers so I had an automatic index to use. In this case, I just need the name, I don't need that index.
1. This is mostly the same as before.
2. There is new stuff here but this boils down to: "I want to create 2 variables and then merge them. `mutate()` is creating the variables `Flag` and `TChangeFlag`. `Flag` is using a dplyr version of ifelse called `case_when()` that allows you to specify what to return if certain conditions are true. `TChangeFlag` is the first derivative of the temperature data - rate of change of the data. If the rate of chagne is too high, then we flag it. This could have been done with `case_when()`, but we are talking about control structures today.
3. I want to point out the 'on the fly' file name generation. This takes some practice in `paste()`. This is just string manipulation. All this is doing is slightly modifying the old string, and attaching a new string to create the desired filepath.
</details>
<details open><summary class='drop2'>A Plotting Loop</summary>
Let's do our last common for loop. That is the 'barf out a bunch of plots' loop. I am visual and can quickly diagnose data by looking at it. I like to do this so I have some PDFs on hand.
```{r ForLoop.2g, echo=TRUE, eval=F}
OutPath <- paste0(getwd(), "/hobo_ouputs/") # or put your preferred path here (we will use this
for (i in fPaths) {
#1 Read in/generate data
fName <- basename(i)
d <- read.csv(i, skip = 1, header = T)[1:3] %>%
setNames(., c("idx", "DateTime", "T_F")) %>%
dplyr::mutate(., DateTime2 = as.POSIXct(DateTime, "%m/%d/%y %H:%M:%S", tz = "UCT"))
#2 Do the thing you want to do, in this case make and save plots
p <- ggplot(d, aes(x = DateTime2, y = T_F)) +
geom_point(size = 0.5) +
ggtitle(fName)
#3 Output the results outside of the for loop
dir.create(OutPath, showWarnings = F)
ggsave(
filename = gsub(".csv", "_plot.pdf", fName),
path = OutPath, device = "pdf",
plot = p, width = 5, height = 5, units = "in"
)
}
```
Let's break this down because practice makes perfect - we will get into the nitty gritty details in class, but this is the basic outline of what is happeneing.
0. We are iterating based on the vector `fPaths`. There will be as many iterations as there are items in `fPaths`. In our case `fPaths` is a vector of file paths. This means that on each iteration, `i` will become a different file path.
1. Read in the data and do a little data wrangling.
2. Create a simple plot so we can look at the data.
3. Use the `ggsave()` function to save a pdf. What I want to point out here is that we don't need to build the full file path outside of ggsave. We specify the file name as discussed before and feed it a path separately. ggsave glues them together by calling some base R functions.
</details>
</details>
<br>
<details open><summary class='drop'>Recap and final thoughts on for loops and control structures</summary>
What did we do?
<ul>
<li>Learned the basic structure of most for loops. Probably the most important thing.</li>
<li>Used some for loops to do some actual tasks on multiple datasets.</li>
<li>Read files into and wrote files out of a for loop.</li>
<li>Wrote data calculated in a for loop to a variable in the environment.</li>
<li>Used `ifelse()` logic to flag some data for checking.</li>
<li>Use `if...else` to create a switch for NA handling in a dataframe in a for loop.</li>
<li>Maybe learned a few new functions.</li>
<li>...?</li>
</ul>
At this stage you may be thinking. Ok, but you have read the data in multiple times and repeated a lot of code. Couldn't we just wrap this in one giant `for` loop?
+ We could have written the loop to read in the data, store it in memory, and then loop only on the stored data for flagging, statistics, and plotting. (We are actually going to do that next.) But, loading stuff in R takes up memory. Depending on the size of your datasets, you may not want unused information floating around hogging memory. Continuous data files can become large objects. So, it depends.
+ There are only rare instances where you want to wrap stuff up in a complicated `for` loop. The more complex the loop the harder it is to debug. Complex `for` loops are also hard to build flexibility into (e.g. handling `NA` values). In general you should shoot for one task or task family per loop.
</details>
#### Preallocation
<details open><summary class='drop'>Preallocating Output Objects</summary>
Before we move on let's take time to discuss a more advanced R topic: preallocating the for loop output. This primarily refers to dataframe-like loop output variables and it makes a huge difference in how quickly your code executes. There are many ways to initialize this, I have shown two ways -- there are more.
In the first method we defined `OutDat<-NULL` and in the body of the loop proceeded to assign each iteration's output to a specific location in `OutDat` using indexing `OutDat[i]<-i+1`. We did something similar in the above example where we created a list `SummaryData <- list()` and then added an new list element for each iteration's output `SummaryData[[i]] <- data.frame(...)`. These are both acceptable ways to capture output.
But, most of us think in dataframe most of the time so we expect a dataframe out. This means we need an extra step to ensure we have dataframe output. This leads some people to say 'hey, why don't we just put it into a dataframe to begin with?' The most natural thing in the world is to do:
```{r Preallocation, echo=T, eval=F}
#alternatively predefine a container to receive the data
nIter=10000
OutDat <- NULL # define an empty variable.
system.time(for (i in 1:nIter) {
d <- data.frame(x=i + 1,y=i)
OutDat<-rbind(OutDat,d)# append the previous steps' outputs with this step's output
}) #4.47 seconds on my system
#similarly you could try append() - it works
nIter=10000
OutDat <- NULL # define an empty variable.
system.time(for (i in 1:nIter) {
d <- data.frame(x=i + 1,y=i)
OutDat<-append(OutDat,d)# append the previous steps' outputs with this step's output
}) #4.93 seconds on my system
```
Instant dataframe and everyone is happy. Or are they? They aren't. This is called `growing` an object. Patrick Burns also refers to this as the "Second Circle of R Hell" in his book *The R Inferno* (Dante's *Inferno*... the R community was a wonderful sense of humor). Look it up, it is great (no really, it is). To do this 10,000 times requires 4.96 seconds, 1,000,000 time took... well I quit after 13 minutes. If this were a strictly linear process it should have taken around 4 minutes to do 1,000,000 iterations - clearly it did not.
Why? When you use `rbind()` or `append()` in this manner it copies the dataframe each time and tells R to go look for/request another giant block of memory instead of the one that is already there. This slows your processing way down. So.... don't do that. Instead you should preallocate your dataframe, especially if it is going to be big, it even makes a difference on smaller dataframes.
What does preallocate mean? `x<-data.frame()` just tells R you need an object that is a dataframe. What you really want to do is tell R the exact dimensions of the object you will need. If you want dataframe output, just generate a blank dataframe the same dimensions as your output. This is sometimes simple. The number of rows is probably the number of iterations `vec` and you already are defining columns so you could just start out with `x<-data.frame(Mean=rep(NA,vec), Min=rep(NA,vec))` and use the iteration indexing to tell R which row to drop stuff in.
Our example above, becomes:
```{r Preallocation.1, echo=T, eval=F}
nIter=10000 # number of iterations
OutDat <- data.frame(x=rep(NA,nIter),y=rep(NA,nIter)) # define an preallocated dataframe
system.time(for (i in 1:nIter) {
OutDat[i,] <- data.frame(x=i + 1, y=i)
}) #2.42 seconds on my system
```
</details>
In this simple example preallocating the dataframe speeds up the loop execution by 50%. In this example a 2.5 second gain seems small. However, across larger or multiple datasets this can become a significant time cost if you don't preallocate.
<details open><summary class='drop'>Preallocation Wrapup</summary>
<ul>
*When do we preallocate?* Pretty much just when we use `for` loops to create data objects. It is important to note: `apply` family and `purrr` will take care of that preallocation for you.
*Should you always preallocate?* If you are using a `for` loop to output a data object, probably. <br>
*Do you really need to?* Not until your code execution takes longer than you are willing to wait.<br>
*Why not install more memory?* That will appear to solve the problem. But datasets will just get bigger over time.<br>
*So I shouldn't use a for loop?* No, you can when you need to, but there is an art to it. This course is all about equipping you with tools to help avoid *some* frustration.<br>
</ul>
</details>
#### apply Family
<details open><summary class='drop'>apply Family AKA better ways to iterate in R</summary>
Now that you have the basics down, we are going to forge ahead with better and better ways to iterate. One family of functions that often replaces `for()` is the `apply` family of functions. We will get you started here and then dive deeper during the avanced R course on "Functional Programming" next week. For now:
`apply` - operates on things resembling a dataframe. Okay, really it operates on a matrix where all values are the same type (all number or all letters). It can operate by row or by column and the output will be a
`mapply` - Is a way to vectorize a function that isn't vectorized. I don't think I have used this in the past year, but it is in the apply family and you might encounter it.
`lapply` - Apparently, this is a special case of mapply and it operates on lists of things and stores the output in a list of things. sapply and vapply are variations on this theme.
`sapply` - A 'wrapper' for lapply that will coerce the list output into a matrix or array.
`tapply` - apply by groups. This lets you apply functions by a group. So, quick summary statistics, but you will quickly outgrow this and there are better ways to do this.
</details>
<br>
<details open><summary class='drop2'>mtcars</summary>
To explore using simple functions in `apply`, let's look at `mtcars`. This is a wonderful little built in datast that I like to use for developing and demoing code (see `help(mtcars)` for description). It has continuous and categorical variables that cover a good bit of what I end up needing to do. Another reason to know about `mtcars` is that a lot of answers on stackoverflow will call this dataset.
```{r ApplyFamily.0, echo=T, eval=T}
str(mtcars)
head(mtcars)
```
</details>
<br>
<details open><summary class='drop'>apply()</summary>
First let's calculate the mean across the rows. It doesn't make a whole lot of sense as to why we would do this because the columns contain values with different units, but we can if we set `margin=1`.
```{r ApplyFamily.1, echo=T, eval=T}
apply(mtcars,MARGIN=1, FUN=mean) #calculate mean across the row
```
Note: `apply()` will operate on all the data it is provided. If your data had a non-numeric element in the first column, for the function `mean()` to work you will have to exclude that column from the data provided to `apply`.
Apply is preserving/using the row name. That is nice in this case. The make/model of the vehicle was stored as row names so is not technically data being passed to `mean` for analysis. We get out a vector with the (meaningless) average value for each make and model of vehicle.
It might make more sense to calculate the mean value for each of the attributes of a car. We can do that by setting `margin=2`.
```{r ApplyFamily.2, echo=T, eval=T}
apply(mtcars, MARGIN=2, FUN=mean)
```
That returns the mean of each variable. What is happending is the apply is disassembling the data frame and passing it line by line (`margin=1`) or row by row (`margin=2`) into `mean()` and collecting the output into a named vector.
</details>
<br>
<details open><summary class='drop'>lapply()</summary>
<details open><summary class='drop2'>lapply() overview</summary>
Let's change gears a little bit and return to those hobo data to take a look at `lapply()`. First, a quick review of what is a list. A list is an R object that can hold a bunch of different elements. A list is ok if one element is a string and the next element is a character or if one element has 100 columns and 10 rows while the next element has 10 columns and 100 rows. It even can hold a dataframe as one element and a matrix as another element without any trouble. *A list says you're ok just the way you are.*
We are going to use our list to read in data that are mostly the same. If we wanted to store them in a single dataframe: 1) it would get unwieldy and 2) you would have to write code that 'fixed' inconsistencies between the dataframes. By storing them as independent list elements we don't necessarily need to write code that 'fixes' the inconsistencies.
We discussed this earlier, in the previous `for` loop we operated on each of our hobo csvs one at a time. What we are going to do this time is open them all into a single list object with 5 different elmenets (1 for each csv file).
</details>
<br>
<details open><summary class='drop2'>lapply() to read in data</summary>
You should still have fPaths hanging out somewhere. This is that list of hobos.
```{r ApplyFamily.3, echo=T, eval=F}
fPaths
```
What we will do next is iterate through that list of names and store it in a list. We do that with lapply. What does lapply do? It takes in a list or vector of data and returns list of data of the same length. `lapply(X, FUN,...)` has 3 arguments.
+ `X` the data or the vector to FUN
+ `FUN` what to do
+ `...` elipsis - this is a bit of an odd one. This is where you put whatever other arguments you want passed to FUN. Looking at the examples below will help.
Get the data
```{r ApplyFamily.4, echo=T, eval=F}
HoboList<-lapply(fPaths,read.csv, skip=1, header=T)
HoboList
```
This is passing, sequentially, file paths into `read.csv()`. Just like our for loop more or less. It is also passing `skip` and `header` values to `read.csv()`. It returned a list of all the csvs with everything in them.
Simplify the Data
```{r ApplyFamily.5, echo=T, eval=F}
HoboList<-lapply(HoboList,"[",,1:3)
HoboList
```
This then takes that list as input and passes each of the dataframes contained within it to the indexing function. That is right, indexing `[]` is a function and we can pass it into `FUN` by calling `"["`. What fun! Not the empty commas there `"[",,1:3` are intentional and will be interpreted the same as saying `x[,1:3]` or... go get columns 1 through 3.
Clean those headers
```{r ApplyFamily.6, echo=T, eval=F}
HoboList<-lapply(HoboList,setNames, c("idx","DateTime","T_F"))
HoboList
```
Then we simply just rename the headers because lapply passes each dataframe to the list.
All together now with the piping!
```{r ApplyFamily.7, echo=T, eval=F}
HoboList<-lapply(fPaths,read.csv, skip=1, header=T)%>% #1. read hobo data into a list
lapply(.,"[",,1:3)%>% #2. Grab only first 3 columns
lapply(.,setNames, c("idx","DateTime","T_F")) #3. fix column names
```
Now all of those hobo files are available in the `HoboList` object that we could summarize, do some flagging, or plot from. How do we do that? We could do for loops again, or we could keep going with `lapply`.
</details>
<br>
<details open><summary class='drop2'>lapply() to summarize data</summary>
Now, let's summarize that object and .
```{r ApplyFamily.8, echo=TRUE, eval=FALSE}
HoboSummary<-lapply(HoboList[-3], FUN=function(x){ #-3 drop that blank file for now
d<-data.frame(
Mean = mean(x$T_F, na.rm=T),
Min = min(x$T_F, na.rm=T),
Max = max(x$T_F, na.rm=T),
nObs = nrow(x)
)
})
do.call(rbind, HoboSummary) #equivalent to dplyr::bind_rows(HoboSummary)
```
This is a stripped down version of the for loop we did before. We then use `do.call()`, the base equivalent to `dplyr::bind_rows()`, to generate a dataframe. `do.call` takes a function (here `rbind`) and passes it a list of arguments (here list elements). So, it passes all of the individual elements of HoboSummary to `rbind` as individual arguments to glue together.
We could then do the flagging step and the plotting steps. However, we are going to stop here and pick those up in the advanced R session because we really need to talk about how to make a `function()` before we go a whole lot further and that is a topic for next time.
</details>
</details>
<br>
#### while()
Another type of loop you should be aware of is the `while` loop. It probably won't come up much when you're working with data in R, but as a programmer you should have it in your toolbox. While the `apply` family of functions is specific to R, `for` loops and `while` loops are present in most programming languages.
In a `for` loop, you decide ahead of time how many times the code inside your loop should execute. The `for` loop iterates through a vector of a certain length and stops when it reaches the end.
In a `while` loop, on the other hand, your loop executes *while* some condition is true. When that condition is no longer true, the loop stops. `while` loops are good for situations when you can't predict ahead of time how many times the loop needs to execute.
As an example, we'll make a simple guessing game. The code below selects a random integer between 1 and 10. It prompts the user to type a guess at the console. If they guess correctly within three tries, they win.
```{r While, eval=FALSE}
guess <- 0 # Initialize the guess variable
number <- sample(1:10, 1) # Get our random number
guess_count <- 0 # Number of guesses so far
while (guess != number && guess_count < 3) { # Keep looping if the guess is wrong and we haven't hit our 3 guesses yet
guess_count <- guess_count + 1
message <- paste0("Guess ", guess_count, ": what is the random number? ")
guess <- as.integer(readline(prompt = message))
if (guess == number) {
print("Woohoo! You won the game!")
} else if (guess_count < 3) {
print("Nope, try again.")
} else {
print("You lost :(")
}
}
```
When you are writing a `while` loop, make sure that it will eventually terminate! If the test expression (in this case `guess != number && guess_count < 3`) never becomes false, your `while` loop will go on forever! An infinite `while` loop will ungracefully crash your code, so be sure to test your loop thoroughly, especially if it's more complex.