forked from raphg/Biostat-578
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathAdvanced_data_manipulation.Rmd
988 lines (685 loc) · 26.5 KB
/
Advanced_data_manipulation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
---
title: 'Bioinformatics for Big Omics Data: Advanced data manipulation'
author: "Raphael Gottardo"
date: "December 8, 2014"
output:
ioslides_presentation:
fig_caption: yes
fig_retina: 1
keep_md: yes
smaller: yes
---
## Setting up some options
Let's first turn on the cache for increased performance.
```{r, cache=FALSE}
# Set some global knitr options
library("knitr")
opts_chunk$set(tidy=TRUE, tidy.opts=list(blank=FALSE, width.cutoff=80), cache=TRUE)
```
## Motivation
- R has pass-by-value semantics, which minimizes accidental side effects. However, this can become a major bottleneck when dealing with large datasets both in terms of memory and speed.
- Working with `data.frame`s in R can also be painful process in terms of writing code.
- Fortunately, R provides some solution to this problems.
## Overview
Here we will review three R packages that can be used to provide efficient data manipulation:
- `data.table`: An package for efficient data storage and manipulation
- `RSQLite`: Database Interface R driver for SQLite
- `sqldf`: An R package for runing SQL statements on R data frames, optimized for convenience
<small>Thank to Kevin Ushey (@kevin_ushey) for the `data.table` notes and Matt Dowle and Arunkumar Srinivasan for helpful comments.</small>
## What is data.table?
`data.table` is an R package that extends. `R` `data.frame`s.
Under the hood, they are just `data.frame's, with some extra 'stuff' added on.
So, they're collections of equal-length vectors. Each vector can be of different type.
```{r}
library(data.table)
dt <- data.table(x=1:3, y=c(4, 5, 6), z=letters[1:3])
dt
class(dt)
```
The extra functionality offered by `data.table` allows us to modify, reshape, and merge `data.table`s much quicker than `data.frame`s. **See that `data.table` inherits from `data.frame`!**
## Installing data.table
- stable CRAN release
```{r eval=FALSE}
# Only install if not already installed.
require(data.table) || install.packages("data.table")
```
- latest bug-fixes + enhancements (No need to do that, we will use the stable release)
```{r eval=FALSE}
library(devtools)
install_github("Rdatatable/data.table", build_vignettes = FALSE)
```
## What's different?
Most of your interactions with `data.table`s will be through the subset (`[`)
operator, which behaves quite differently for `data.table`s. We'll examine
a few of the common cases.
Visit [this stackoverflow question](http://stackoverflow.com/questions/13618488/what-you-can-do-with-data-frame-that-you-cant-in-data-table) for a summary of the differences between `data.frame`s and `data.table`s.
## Single element subsetting
```{r}
library(data.table)
DF <- data.frame(x=1:3, y=4:6, z=7:9)
DT <- data.table(x=1:3, y=4:6, z=7:9)
DF[c(2,3)]
DT[c(2,3)]
```
By default, single-element subsetting in `data.table`s refers to rows, rather
than columns.
## Row subsetting
```{r}
library(data.table)
DF <- data.frame(x=1:3, y=4:6, z=7:9)
DT <- data.table(x=1:3, y=4:6, z=7:9)
DF[c(2,3), ]
DT[c(2,3), ]
```
Notice: row names are lost with `data.table`s. Otherwise, output is identical.
## Column subsetting
```{r}
library(data.table)
DF <- data.frame(x=1:3, y=4:6, z=7:9)
DT <- data.table(x=1:3, y=4:6, z=7:9)
DF[, c(2,3)]
DT[, c(2,3)]
```
`DT[, c(2,3)]` just returns `c(2, 3)`. Why on earth is that?
## The j expression
The subset operator is really a function, and `data.table` modifies it to behave
differently.
Call the arguments we pass e.g. `DT[i, j]`, or `DT[i]`.
The second argument to `[` is called the `j expression`, so-called because it's
interpreted as an `R` expression. This is where most of the `data.table`
magic happens.
`j` is an expression evaluated within the frame of the `data.table`, so
it sees the column names of `DT`. Similarly for `i`.
First, let's remind ourselves what an `R` expression is.
## Expressions
An `expression` is a collection of statements, enclosed in a block generated by
braces `{}`.
```{r}
## an expression with two statements
{
x <- 1
y <- 2
}
## the last statement in an expression is returned
k <- { print(10); 5 }
print(k)
```
## The j expression
So, `data.table` does something special with the `expression` that you pass as
`j`, the second argument, to the subsetting (`[`) operator.
The return type of the final statement in our expression determines the type
of operation that will be performed.
In general, the output should either be a `list` of symbols, or a statement
using `:=`.
We'll start by looking at the `list` of symbols as an output.
## An example
When we simply provide a `list` to the `j expression`, we generate a new
`data.table` as output, with operations as performed within the `list` call.
```{r}
library(data.table)
DT <- data.table(x=1:5, y=1:5)
DT[, list(mean_x = mean(x), sum_y = sum(y), sumsq=sum(x^2+y^2))]
```
Notice how the symbols `x` and `y` are looked up within the `data.table` `DT`.
No more writing `DT$` everywhere!
## Using :=
Using the `:=` operator tells us we should assign columns by reference into
the `data.table` `DT`:
```{r}
library(data.table)
DT <- data.table(x=1:5)
DT[, y := x^2]
```
## Using :=
By default, `data.table`s are not copied on a direct assignment `<-`:
```{r}
library(data.table)
DT <- data.table(x=1)
DT2 <- DT
DT[, y := 2]
DT2
```
Notice that `DT2` has changed. This is something to be mindful of; if you want
to explicitly copy a `data.table` do so with `DT2 <- copy(DT)`.
## A slightly more complicated example
```{r}
library(data.table)
DT <- data.table(x=1:5, y=6:10, z=11:15)
DT[, m := log2( (x+1) / (y+1) )]
```
## Using an expression in j
Note that the right-hand side of a `:=` call can be an expression.
```{r}
library(data.table)
DT <- data.table(x=1:5, y=6:10, z=11:15)
DT[, m := { tmp <- (x + 1) / (y + 1); log2(tmp) }]
```
## Multiple returns in j
The left hand side of a `:=` call can also be a character vector of names,
for which the corresponding final statement in the `j expression` should be
a list of the same length.
```{r}
library(data.table)
DT <- data.table(x=1:5, y=6:10, z=11:15)
DT[, c('m', 'n') := { tmp <- (x + 1) / (y + 1); list( log2(tmp), log10(tmp) ) }]
DT[, `:=`(a=x^2, b=y^2)]
DT[, c("c","d"):=list(x^2, y^2)]
```
## The j expression revisited
So, we typically call `j` the `j expression`, but really, it's either:
1. An expression, or
2. A call to the function `:=`, for which the first argument is a set of
names (vectors to update), and the second argument is an expression, with
the final statement typically being a list of results to assign within
the `data.table`.
As I said before, `a := b` is parsed by `R` as `":="(a, b)`, hence it
looking somewhat like an operator.
```{r}
quote(a := b)
```
## Why does it matter?
Whenever you sub-assign a `data.frame`, `R` is forced to copy the entire
`data.frame`.
That is, whenever you write `DF$x <- 1`, `DF["x"] <- 1`, `DF[["x"]] <- 1`...
... R will make a copy of `DF` before assignment.
This is done in order to ensure any other symbols pointing at the same object
do not get modified. This is a good thing for when we need to reason about
the code we write, since, in general, we expect `R` to operate without side
effects.
Unfortunately, it is prohibitively slow for large objects, and hence why
`:=` can be very useful.
## Why does it matter?
```{r}
library(data.table); library(microbenchmark)
big_df <- data.frame(x=rnorm(1E6), y=sample(letters, 1E6, TRUE))
big_dt <- data.table(big_df)
microbenchmark( big_df$z <- 1, big_dt[, z := 1] )
```
Once again, notice that `:=` is actually a function, and `z := 1`
is parsed as `":="(z, 1)`.
## Using by
We can also perform grouping like operations through the use of the
`by` argument:
```{r}
library(data.table)
DT <- data.table(x=1:5, y=6:10, gp=c('a', 'a', 'a', 'b', 'b'))
DT[, z := mean(x+y), by=gp]
```
Notice that since `mean(x+y)` returns a scalar (numeric vector of length 1), it is recycled to fill within each group.
## Generating a new data.table
What if, rather than modifying the current `data.table`, we wanted to generate
a new one?
```{r}
library(data.table)
DT <- data.table(x=1:5, y=6:10, gp=c('a', 'a', 'a', 'b', 'b'))
DT[, list(z=mean(x + y)), by=gp]
```
Notice that we retain one row for each unique group specified in the `by`
argument, and only the `by` variables along-side our `z` variable are returned.
## The j expression
- A `list`
... returns a new `data.table`, potentially subset over groups in your `by`.
- A `:=` Call
... modifies that `data.table` in place, hence saving memory. Output is
recycled if the `by` argument is used.
#### In general, our `j expression` is either:
2. an expression, with the final (or only) statement being a `list` of (named)
arguments,
3. a call to the `:=` function, with
* the first argument being names, and
* the second argument being an expression, for which the last statement is
a list of the same length as the first argument.
## Special variables
There are a number of special variables defined only within `j`, that allow us to do some neat things...
```{r}
library(data.table)
data.table()[, ls(all=TRUE)]
```
These variables allow us to infer a bit more about what's going on within the
`data.table` calls, and also allow us to write more complicated `j expression`s.
## Special variables
### `.SD`
A `data.table` containing the subset of data for each group, excluding
columns used in `by`.
### `.BY`
A ` list` containing a length 1 vector for each item in `by`.
### `.N`
## Special variables
An integer, length 1, containing the number of rows in `.SD`.
### `.I`
A vector of indices, holding the row locations from which `.SD` was
pulled from the parent `DT`.
### `.GRP`
A counter telling you which group you're working with (1st, 2nd, 3rd...)
## Example usage of .N - Counts
Compute the counts, by group, using `data.table`...
```{r}
set.seed(123); library(data.table); library(microbenchmark)
DT <- data.table(x=sample(letters[1:3], 1E5, TRUE))
DT[, .N, by=x]
table(DT$x)
```
## Example usage of .N - Counts
```{r}
library(data.table)
library(microbenchmark)
DT <- data.table(x=factor(sample(letters[1:3], 1E5, TRUE)))
microbenchmark( tbl=table(DT$x), DT=DT[, .N, by=x] )
```
## Example usage of .SD - lapply-type calls
```{r}
library(data.table)
DT <- data.table(x=rnorm(10), y=rnorm(10), z=rnorm(10), id=letters[1:10])
DT[, lapply(.SD, mean), .SDcols=c('x', 'y', 'z')]
lapply(DT[,1:3, with=FALSE], mean)
```
## Example usage of .SD - lapply-type calls
```{r, cache=TRUE}
library(data.table); library(microbenchmark)
DT <- data.table(x=rnorm(1E5), y=rnorm(1E5), z=sample(letters,1E5,replace=TRUE))
DT2 <- copy(DT)
setkey(DT2, "z")
microbenchmark(
DT=DT[, lapply(.SD, mean), .SDcols=c('x', 'y'), by='z'],
DT2=DT2[, lapply(.SD, mean), .SDcols=c('x', 'y'), by='z']
)
```
setting a key can lead to faster grouping operations.
## Keys
`data.table`s can be keyed, allowing for faster indexing and subsetting. Keys
are also used for `join`s, as we'll see later.
```{r}
library(data.table)
DT <- data.table(x=c('a', 'a', 'b', 'c', 'a'), y=rnorm(5))
setkey(DT, x)
DT['a'] ## grabs rows corresponding to 'a'
```
Note that this does a `binary search` rather than a `vector scan`, which is
much faster!
## Key performance
```{r, cache=TRUE}
library(data.table); library(microbenchmark)
DF <- data.frame(key=sample(letters, 1E6, TRUE), x=rnorm(1E6))
DT <- data.table(DF)
setkey(DT, key)
identical( DT['a']$x, DF[ DF$key == 'a', ]$x )
microbenchmark( DT=DT['a'], DF=DF[ DF$key == 'a', ], times=5 )
```
Further reading: you can set multiple keys with `setkeyv` as well.
## Joins
`data.table` comes with many kinds of joins, implements through the
`merge.data.table` function, and also through the `[` syntax as well. We'll
focus on using `merge`.
```{r}
library(data.table); library(microbenchmark)
DT1 <- data.table(x=c('a', 'a', 'b', 'dt1'), y=1:4)
DT2 <- data.table(x=c('a', 'b', 'dt2'), z=5:7)
setkey(DT1, x)
setkey(DT2, x)
merge(DT1, DT2)
```
## Overview of joins
Here is a quick summary of SQL joins, applicable to `data.table` too.
(Source: http://www.codeproject.com)
<img src="http://www.codeproject.com/KB/database/Visual_SQL_Joins/Visual_SQL_JOINS_orig.jpg" width="600">
## A left join
```{r}
library(data.table); library(microbenchmark)
DT1 <- data.table(x=c('a', 'a', 'b', 'dt1'), y=1:4)
DT2 <- data.table(x=c('a', 'b', 'dt2'), z=5:7)
setkey(DT1, x)
setkey(DT2, x)
merge(DT1, DT2, all.x=TRUE)
```
## A right join
```{r}
library(data.table); library(microbenchmark)
DT1 <- data.table(x=c('a', 'a', 'b', 'dt1'), y=1:4)
DT2 <- data.table(x=c('a', 'b', 'dt2'), z=5:7)
setkey(DT1, x)
setkey(DT2, x)
merge(DT1, DT2, all.y=TRUE)
```
## An outer join
```{r}
library(data.table); library(microbenchmark)
DT1 <- data.table(x=c('a', 'a', 'b', 'dt1'), y=1:4)
DT2 <- data.table(x=c('a', 'b', 'dt2'), z=5:7)
setkey(DT1, x)
setkey(DT2, x)
merge(DT1, DT2, all=TRUE) ## outer join
```
## An inner join
```{r}
library(data.table); library(microbenchmark)
DT1 <- data.table(x=c('a', 'a', 'b', 'dt1'), y=1:4)
DT2 <- data.table(x=c('a', 'b', 'dt2'), z=5:7)
setkey(DT1, x)
setkey(DT2, x)
merge(DT1, DT2, all=FALSE) ## inner join
```
## Speed example
```{r, cache=TRUE}
library(data.table); library(microbenchmark)
DT1 <- data.table(
x=do.call(paste, expand.grid(letters, letters, letters, letters)),
y=rnorm(26^4)
)
DT2 <- DT1[ sample(1:nrow(DT1), 1E5), ]
setnames(DT2, c('x', 'z'))
DF1 <- as.data.frame(DT1)
DF2 <- as.data.frame(DT2)
setkey(DT1, x); setkey(DT2, x)
microbenchmark( DT=merge(DT1, DT2), DF=merge.data.frame(DF1, DF2), replications=5)
```
## Subset joins
We can also perform joins of two keyed `data.table`s using the `[` operator.
We perform right joins, so that e.g.
- `DT1[DT2]`
is a right join of `DT1` into `DT2`. These joins are typically a bit faster. Do
note that the order of columns post-merge can be different, though.
## Subset joins
```{r}
library(data.table); library(microbenchmark)
DT1 <- data.table(
x=do.call(paste, expand.grid(letters, letters, letters, letters)),
y=rnorm(26^4)
)
DT2 <- DT1[ sample(1:nrow(DT1), 1E5), ]
setnames(DT2, c('x', 'z'))
setkey(DT1, x); setkey(DT2, x)
tmp1 <- DT2[DT1]
setcolorder(tmp1, c('x', 'y', 'z'))
tmp2 <- merge(DT1, DT2, all.x=TRUE)
setcolorder(tmp2, c('x', 'y', 'z'))
identical(tmp1, tmp2)
```
## Subset joins can be faster
```{r}
library(data.table); library(microbenchmark)
DT1 <- data.table(
x=do.call(paste, expand.grid(letters, letters, letters, letters)),
y=rnorm(26^4)
)
DT2 <- DT1[ sample(1:nrow(DT1), 1E5), ]
setnames(DT2, c('x', 'z'))
setkey(DT1, x); setkey(DT2, x)
microbenchmark( bracket=DT1[DT2], merge=merge(DT1, DT2, all.y=TRUE), times=5 )
```
## Subset joins and the j-expression
More importantly they can be used the j-expression simultaneously, which can be very convenient.
```{r}
DT1 <- data.table(x=1:5, y=6:10, z=11:15, key="x")
DT2 <- data.table(x=2L, y=7, w=1L, key="x")
# 1) subset only essential/necessary cols
DT1[DT2, list(z)]
# 2) create new col, i.y refer's to DT2's y col
DT1[DT2, list(newcol = y > i.y)]
# 3) also assign by reference with `:=`
DT1[DT2, newcol := z-w]
```
## data.table and SQL
We can understand the usage of `[` as SQL statements.
From [FAQ 2.16](http://datatable.r-forge.r-project.org/datatable-faq.pdf):
data.table Argument | SQL Statement
---|---
i | WHERE
j | SELECT
:= | UPDATE
by | GROUP BY
i | ORDER BY (in compound syntax)
i | HAVING (in compound syntax)
Compound syntax refers to multiple subsetting calls, and generally isn't
necessary until you really feel like a `data.table` expert:
DT[where,select|update,group by][having][order by][ ]...[ ]
## data.table and SQL - Joins
Here is a quick summary table of joins in `data.table`.
SQL | data.table
---|---
LEFT JOIN | x[y]
RIGHT JOIN | y[x]
INNER JOIN | x[y, nomatch=0]
OUTER JOIN | merge(x,y,all=TRUE)
## data.table and SQL
It's worth noting that I really mean it when I say that `data.table` is like
an in-memory data base. It will even perform some basic query optimization!
```{r}
library(data.table)
options(datatable.verbose=TRUE)
DT <- data.table(x=1:5, y=1:5, z=1:5, a=c('a', 'a', 'b', 'b', 'c'))
DT[, lapply(.SD, mean), by=a]
options(datatable.verbose=FALSE)
```
## Some thoughts
The primary use of `data.table` is for efficient and **elegant** data manipulation including aggregation and joins.
```{r}
library(data.table); library(microbenchmark)
DT <- data.table(gp1=sample(letters, 1E6, TRUE), gp2=sample(LETTERS, 1E6, TRUE), y=rnorm(1E6))
microbenchmark( times=5,
DT=DT[, mean(y), by=list(gp1, gp2)],
DF=with(DT, tapply(y, paste(gp1, gp2), mean)))
```
Unlike "split-apply-combine" approaches such as `plyr`, data is never split in `data.table`! `data.table` applies the function to each subset recursively (in C for speed). This keeps the memory footprint low - which is very essential for "big data".
## Other interesting convenience functions
- `like`
```{r like}
DT = data.table(Name=c("Mary","George","Martha"), Salary=c(2,3,4))
# Use regular expressions
DT[Name %like% "^Mar"]
```
- `set*` functions
`set`, `setattr`, `setnames`, `setcolorder`, `setkey`, `setkeyv`
```{r set}
setcolorder(DT, c("Salary", "Name"))
DT
```
- `DT[, (myvar):=NULL]` remove a column
```{r NULL}
DT[,Name:=NULL]
```
## Listing all tables
With data.table you can always list the tables that you've created, which will also return basic information on this tables including size, keys, nrows, etc.
```{r}
tables()
```
## Bonuses: fread
`data.table` also comes with `fread`, a file reader much, much better than
`read.table` or `read.csv`:
```{r}
library(data.table); library(microbenchmark)
big_df <- data.frame(x=rnorm(1E6), y=rnorm(1E6))
file <- tempfile()
write.table(big_df, file=file, row.names=FALSE, col.names=TRUE, sep="\t", quote=FALSE)
microbenchmark( fread=fread(file), r.t=read.table(file, header=TRUE, sep="\t"), times=1 )
unlink(file)
```
## Bonuses: rbindlist
Use this function to `rbind` a list of `data.frame`s, `data.table`s or `list`s:
```{r}
library(data.table); library(microbenchmark)
dfs <- replicate(100, data.frame(x=rnorm(1E4), y=rnorm(1E4)), simplify=FALSE)
all.equal( rbindlist(dfs), data.table(do.call(rbind, dfs)) )
microbenchmark( DT=rbindlist(dfs), DF=do.call(rbind, dfs), times=5 )
```
## Summary
To quote Matt Dowle
`data.table` builds on base R functionality to reduce 2 types of time :
1. programming time (easier to write, read, debug and maintain)
2. compute time
It has always been that way around, 1 before 2. The *main* benefit is the syntax: combining where, select|update and 'by' into one query without having to string along a sequence of isolated function calls. **Reduced function calls. Reduced variable name repetition. Easier to understand.**
## Learning More
- Read some of the `[data.table]` tagged questions on
[StackOverflow](http://stackoverflow.com/questions/tagged/data.table)
- Read through the [data.table FAQ](http://datatable.r-forge.r-project.org/datatable-faq.pdf),
which is surprisingly well-written and comprehensive.
- [data.table cheatsheet](https://s3.amazonaws.com/assets.datacamp.com/img/blog/data+table+cheat+sheet.pdf)
- Experiment!
## Databases and the Structured Query Language (SQL)
- A database is an organized collection of datasets (tables).
- A database management system (DBMS) is a software system designed to allow the definition, creation, querying, update, and administration of databases.
- Well-known DBMSs include MySQL, PostgreSQL, SQLite, Microsoft SQL Server, Oracle, etc.
- Relational DBMSs (RDBMs) store data in a set of related tables
- Most RDBMs use some form of the Structured Query Language (SQL)
**Why do we even need databases?**
## The Structured Query Language (SQL)
Although SQL is an ANSI (American National Standards Institute) standard, there are different flavors of the SQL language.
The data in RDBMS is stored in database objects called tables.
A table is a collection of related data entries and it consists of columns and rows.
Here we will use SQLite, which is a self contained relational database management system. In contrast to other database management systems, SQLite is not a separate process that is accessed from the client application (e.g. MySQL, PostgreSQL).
## Using RSQLite
Here we will make use of the [Bioconductor](http://www.bioconductor.org) project to load and use an SQLite database.
```{r eval=FALSE}
# You only need to run this once. Install if require() fails.
source("http://bioconductor.org/biocLite.R")
require(org.Hs.eg.db) || biocLite("org.Hs.eg.db")
```
```{r load-Hs-library, cache=FALSE}
# Now we can use the org.Hs.eg.db to load a database
library(org.Hs.eg.db)
# Create a connection
Hs_con <- org.Hs.eg_dbconn()
```
## Using RSQLite
```{r}
# List tables
head(dbListTables(Hs_con))
# Or using an SQLite command (NOTE: This is specific to SQLite)
head(dbGetQuery(Hs_con, "SELECT name FROM sqlite_master WHERE type='table' ORDER BY name;"))
```
## Using RSQLite
```{r fields, cache=FALSE}
# What columns are available?
dbListFields(Hs_con, "gene_info")
dbListFields(Hs_con, "alias")
# Or using SQLite
# dbGetQuery(Hs_con, "PRAGMA table_info('gene_info');")
```
## Using RSQLite
```{r select-all, cache=FALSE}
gc()
alias <- dbGetQuery(Hs_con, "SELECT * FROM alias;")
gc()
gene_info <- dbGetQuery(Hs_con, "SELECT * FROM gene_info;")
chromosomes <- dbGetQuery(Hs_con, "SELECT * FROM chromosomes;")
```
## Using RSQLite
```{r join, cache=FALSE}
CD154_df <- dbGetQuery(Hs_con, "SELECT * FROM alias a JOIN gene_info g ON g._id = a._id WHERE a.alias_symbol LIKE 'CD154';")
gc()
CD40LG_alias_df <- dbGetQuery(Hs_con, "SELECT * FROM alias a JOIN gene_info g ON g._id = a._id WHERE g.symbol LIKE 'CD40LG';")
gc()
```
## Some SQL Commands
### SELECT
The SELECT is used to query the database and retrieve selected data that match the specific criteria that you specify:
SELECT column1 [, column2, ...]
FROM tablename
WHERE condition
### ORDER BY
ORDER BY clause can order column name in either ascending (ASC) or descending (DESC) order.
## Some SQL Commands
### JOIN
There are times when we need to collate data from two or more tables.
As with data.tables we can use LEFT/RIGHT/INNER JOINS
### GROUP BY
The GROUP BY was added to SQL so that aggregate functions could return a result grouped by column values.
SELECT col_name, function (col_name) FROM table_name GROUP BY col_name
## A "GROUP BY" example
```{r join-chromosome, cache=FALSE}
dbGetQuery(Hs_con, "SELECT c.chromosome, COUNT(g.gene_name) AS count FROM chromosomes c JOIN gene_info g ON g._id = c._id WHERE c.chromosome IN (1,2,3,4,'X') GROUP BY c.chromosome ORDER BY count;")
```
## Some more SQL commands
Some other SQL statements that might be of used to you:
### CREATE TABLE
The CREATE TABLE statement is used to create a new table.
### DELETE
The DELETE command can be used to remove a record(s) from a table.
### DROP
To remove an entire table from the database use the DROP command.
### CREATE VIEW
A view is a virtual table that is a result of SQL SELECT statement. A view contains fields from one or more real tables in the database. This virtual table can then be queried as if it were a real table.
## Creating your own SQLite database in R
```{r sqlite-ex, cache=FALSE}
db <- dbConnect(SQLite(), dbname="./Data/SDY61/SDY61.sqlite")
dbWriteTable(conn = db, name = "hai", value = "./Data/SDY61/hai_result.txt", row.names = FALSE, header = TRUE, sep="\t", overwrite=TRUE)
dbWriteTable(conn = db, name = "cohort", value = "./Data/SDY61/arm_or_cohort.txt", row.names = FALSE, header = TRUE, sep="\t", overwrite=TRUE)
```
## Creating your own SQLite database in R
```{r some-queries, cache=FALSE}
dbListFields(db, "hai")
dbListFields(db, "cohort")
```
## Creating your own SQLite database in R
```{r}
res <- dbGetQuery(db, "SELECT STUDY_TIME_COLLECTED, cohort.DESCRIPTION, MAX(VALUE_REPORTED) AS max_value FROM hai JOIN cohort ON hai.ARM_ACCESSION = cohort.ARM_ACCESSION WHERE cohort.DESCRIPTION LIKE '%TIV%' GROUP BY BIOSAMPLE_ACCESSION;")
head(res)
```
## Using data.table to perform the same operations
```{r a-la-dt}
# Read the tables using fread
hai <- fread("./Data/SDY61/hai_result.txt")
cohort <- fread("./Data/SDY61/arm_or_cohort.txt")
## Set keys before joining
setkey(hai, "ARM_ACCESSION")
setkey(cohort, "ARM_ACCESSION")
## Inner join
dt_joined <- cohort[hai, nomatch=0]
## Summarize values
head(dt_joined[DESCRIPTION %like% "TIV", list(max_value=max(VALUE_REPORTED)), by="BIOSAMPLE_ACCESSION,STUDY_TIME_COLLECTED"])
```
## The sqldf package
Sometimes it can be convenient to use SQL statements on dataframes. This is exactly what the sqldf package does.
```{r sqldf, eval=FALSE}
library(sqldf)
data(iris)
sqldf("select * from iris limit 5")
sqldf("select count(*) from iris")
sqldf("select Species, count(*) from iris group by Species")
```
The `sqldf` package can even provide increased speed over pure R operations.
## dplyr
[dplyr](https://github.com/hadley/dplyr) is a new package which provides a set of tools for efficiently manipulating datasets in R.
- Identify the most important data manipulation tools needed for data analysis and make them easy to use from R.
- Provide blazing fast performance for in-memory data by writing key pieces in C++.
- Use the same interface to work with data no matter where it's stored, whether in a data frame, a data table or database.
## dplyr is verbose!
`dplyr` implements the following verbs useful for data manipulation:
- `select()`: focus on a subset of variables
- `filter()`: focus on a subset of rows
- `mutate()`: add new columns
- `summarise()`: reduce each group to a smaller number of summary statistics
- `arrange()`: re-order the rows
## dplyr with SQLite
```{r}
suppressMessages(library(dplyr))
my_db <- src_sqlite("./Data/SDY61/SDY61.sqlite", create = T)
hai_sql <- tbl(my_db, "hai")
```
## Select
```{r}
select(hai_sql, ARM_ACCESSION, BIOSAMPLE_ACCESSION)
```
## Filter
```{r}
filter(hai_sql, value_reported>10)
```
## Summarize
```{r}
summarize(group_by(hai_sql,SUBJECT_ACCESSION, STUDY_TIME_COLLECTED), mean(VALUE_REPORTED))
```
## dplyr on data.tables
```{r}
summarize(group_by(hai,SUBJECT_ACCESSION, STUDY_TIME_COLLECTED), mean(VALUE_REPORTED))
```
## dplyr and magrittr
dplyr can use piping operations via the [magrittr](https://github.com/smbache/magrittr) package, as follows,
```{r}
library(magrittr)
hai %>% group_by(SUBJECT_ACCESSION, STUDY_TIME_COLLECTED) %>% summarize(hai_mean=mean(VALUE_REPORTED)) %>% filter(STUDY_TIME_COLLECTED==28)
```
## Summary
- R base `data.frame`s are convenient but often not adapted to large dataset manipulation (e.g. genomics).
- Thankfully, there are good alternatives. My recommendtion is:
- Use `data.table` for your day-to-day operations
- When you have many tables and a complex schema, use `sqlite`.
**Note:** There many other R packages for "big data" such the `bigmemory` suite, `biglm`, `ff`, `RNetcdf`, `rhdf5`, etc.