forked from devonorourke/sus19mb
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathrf_example.Rmd
206 lines (166 loc) · 8.77 KB
/
rf_example.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
---
title: "rforest"
author: "Devon O'Rourke"
date: "5/10/2019"
output: html_document
---
## Motivation
One common goal among amplicon analyses is to determine what sequence variants associate with a metadata factor. This notebook uses Random Forest (RF) classification to identify ASVs that help discriminate amongst a given set of factors. There are multiple means with which an RF classifier can be executed, and multiple variations of this classifier are shown below.
Outputs include vary depending on the type of model being executed, but in general you can expect:
1) A confusion matrix that demonstates the precision of the model
2) A list of ASVs important to the RF model are both identified, and their overall importance to the model are provided in the k-fold cross-validated example (the last one)
3) A plot showing the abundances of these important ASVs relative to their metadata group
Code is adapted from a tutorial available at this address: https://github.com/LangilleLab/microbiome_helper/wiki/Random-Forest-Tutorial
## Inputs
We create a Phyloseq object using:
1) A table of per-sample, per-ASV abundances ("OTU table"),
2) A metadata file
3) A taxonomy file
We are using a previously generated dataset that executed this code:
```{r}
require(tidyverse)
taxatable_raw <- read_csv(file = "https://github.com/devonorourke/sus19mb/raw/master/bact_alldata_taxatable_wTax.csv")
otu <- as.data.frame(select(taxatable_raw, -X1,-taxonomy))
row.names(otu) <- taxatable_raw$X1
otu <- otu_table(otu, taxa_are_rows = T)
taxonomy <- data.frame(taxonomy=taxatable_raw$taxonomy)
row.names(taxonomy)<- taxatable_raw$X1
taxonomy <- data.frame(separate(taxonomy, col=taxonomy, into= c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species"),sep = ";"))
taxonomy <- apply(taxonomy, 2, function(x) gsub("^$|^ $", NA, x))
taxonomy <- as.matrix(taxonomy)
taxonomy <- tax_table(taxonomy)
map <- read_csv(file = "https://github.com/devonorourke/sus19mb/raw/master/bact_alldata_mapfile.csv")
names(map)[1] <- 'Sample'
row.names(map) <- map$Sample
map[7:9] <- lapply(map[7:9] , factor)
map2 <- data.frame(map)
meta <- sample_data(map)
## import as phyloseq object
ps <- phyloseq(otu, taxonomy, meta)
```
## Load libraries
```{r}
## runonce: install.packages("randomForest")
## runonce: install.packages("rfUtilities")
## runonce: install.packages("caret")
## runonce: install.packages("e1071")
## runonce: install.packages("klaR")
library("randomForest")
library("rfUtilities") # to test model significance
library("caret") # for k-fold cross validation
library(e1071) # caret package dependency
library(klaR) # caret package dependency
```
## Filtering and normalizing considerations
Authors of the tutorial document multiple approaches to both taxa filtering and read abundance normalization. Note that to incorporate those functions described one would need to operate on the `otu` and `meta` objects in the workflow, not the `ps` phyloseq object. Because we've imported our data as a Phyloseq object we'll use the functions already written for that program to filter and transform these data.
For simplicity we'll drop just those ASVs identified to a single sample and retain all others. We are not normalizing these data in any fashion in this example. We're also applying a very harsh filter here! This isn't to promote any specific filtering function; rather this is to reduce the number of ASVs in the dataset which will help reduce the computational burden for our classifyer in this example.
```{r}
## Remove taxa not seen more than 3 times in at least 20% of the samples
ps2 = filter_taxa(ps, function(x) sum(x > 3) > (0.2*length(x)), TRUE)
## saninity check: did we drop a lot of samples and/or taxa?
nsamples(ps2) ## nope
ntaxa(ps2) ## yep.
## filter original `otu` object to retain only those taxa that pass filter
otu2 <- otu[taxa_names(ps2),] ## this is the object we'll use for RF classifying
```
## Running the model - no permutations
Transforming our ASV table and metadata into a single object. First, transpose the `otu` table, then append whichever metadata you want to run the classifier with. In this example, we'll use the metadata column `$Verposition` - the vertical position the samples were obtained:
```{r}
t_otu <- data.frame(t(otu2)) ## update the ASV table of interest given filtering cosiderations above
t_otu$Vertposition <- map2[rownames(t_otu), "Vertposition"]
```
Set seed, then train the classifier:
```{r}
require(randomForest)
set.seed(151)
RF_state_classify <- randomForest( x=t_otu[,1:(ncol(t_otu)-1)] ,
y=t_otu[ , ncol(t_otu)] ,
ntree=501, importance=TRUE, proximities=TRUE )
```
Summaries of the classifier. Two main points to take note of in the summary:
1) the OOB (out-of-bag) estimate is an indication of the prediction error of the model
2) the Confusion matrix displays where predictions were and were not correct
```{r}
RF_state_classify
```
We can plot this confusion matrix as a heatmap too:
```{r}
require(tidyverse)
cfuzmat <- data.frame(RF_state_classify$confusion)
cfuzmat$Predicted <- row.names(cfuzmat)
tmp <- cfuzmat
tmp$class.error <- NULL
tmp$Predicted <- NULL
tmp2 <- sweep(tmp, 1, rowSums(tmp), '/')
tmp2$class.error <- tmp2$class.error
tmp2$Predicted <- row.names(tmp2)
cfuzplot <- gather(data = tmp2, key = Expected, value = value, X1:X5)
cfuzplot$Expected <- gsub("^X", "", cfuzplot$Expected) ## removing the X from label name...
ggplot(cfuzplot, aes(x=Predicted, y=Expected)) +
geom_tile(aes(fill=value), color="grey50") +
scale_fill_viridis_c() +
labs(fill="fraction of\nsamples assigned")
```
### which features are important?
We can see from the above confusion matrix that the classifier does a good job of assigning the correct samples to some sites but not to others. We might be interested in determining which ASVs are relevant to creating this model in the first place:
```{r}
RF_state_classify_imp <- as.data.frame( RF_state_classify$importance )
RF_state_classify_imp$features <- rownames( RF_state_classify_imp )
```
The values in the first few columns represent the decrease in prediction accuracy to the model when that particular ASV is removed from the classifier. Thus, the columns that match the metadata variable (`$Vertposition`) we were using as the predictive variable to traing the classifier with may demonstrate distinct accuracy prediction values for a given ASV.
## Running the model with permutations
Can incorporate significance information to RF model with permutations:
```{r}
require(rfUtilities)
RF_state_classify_sig <- rf.significance( x=RF_state_classify ,
xdata=t_otu[,1:(ncol(t_otu)-1)] ,
nperm=100 , ntree=501 )
RF_state_classify_sig
```
Note that the model OOB error rate remains around the same 36.7% we observed in a single run.
## Running the model with "leave out" and permutations
```{r}
require(caret)
fit_control <- trainControl( method = "LOOCV" ) ## method is "Leave One Out Cross Validation"
RF_state_classify_loo <- train( t_otu[,1:(ncol(t_otu)-1)] ,
y=t_otu[, ncol(t_otu)] ,
method="rf", ntree=501 , tuneGrid=data.frame( mtry=25 ) ,
trControl=fit_control )
```
To assess results:
```{r}
RF_state_classify_loo$results
```
## Running the model with repeated k-fold cross-validation
This method will take the entire dataset and split it into two groups: one subset is used to train the model, and the remaining subset not used in training is used to test the classifer. This process is repeated k-times, with each iteration having our dataset being shuffled so that the subset of data used to train and test are different with each iteration.
```{r}
require(caret)
# define training control
train_control <- trainControl(method="repeatedcv", number=5, repeats=10)
# train the model
model <- train(state ~ ., data = t_otu, trControl=train_control, method="rf", importance=TRUE)
# summarize results
print(model)
```
Obtaining the "imporatance" for each ASV to the model
```{r}
model_imp <- data.frame(varImp(model)$importance)
model_imp$taxa <- row.names(model_imp)
mi_plotdat <- gather(data = model_imp,
key = State,
value = Importance,
X1:X5)
```
Plot the Importance values associated with each ASV per metadata group.
```{r}
ggplot(mi_plotdat, aes(x=State, y=Importance, label=taxa)) +
geom_point() +
theme_bw()
```
## misc
more resources for classifiers:
http://topepo.github.io/caret/pre-processing.html
https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
https://machinelearningmastery.com/k-fold-cross-validation/
https://cran.r-project.org/web/packages/caret/caret.pdf
https://github.com/LangilleLab/microbiome_helper/wiki/Random-Forest-Tutorial