-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy pathSTAMPS_Intro2R_OptionA.R
427 lines (365 loc) · 19.2 KB
/
STAMPS_Intro2R_OptionA.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
##########################################
#
# Introduction to R
# Option A: for relatively new users
# A tutorial for STAMPS 2017
#
#########################################
#
# This tutorial is written by Amy Willis, July 2017
# It is based off an original tutorial by Sue Huse, 2011-16
# Mike Lee (AstrobioMike) contributed some fantastic explanations!
#
#########################################
#
# If you downloaded this from the STAMPS website,
# be sure to rename as "STAMPS_Intro2R_OptionA.R",
# R and R Studio will not interpret is as an R script with ".txt" suffix
###############################################
#
# Set your working directory
#
###############################################
# get the directory where you are currently reading and saving files
getwd()
# Set your working directory to wherever you want
# replace "MyProjectDirectory" with a valid location
setwd("~/Documents/MyProjectDirectory")
getwd() # check to see where it ended up, just in case
# "getwd" is a function, which we know because it is followed by
# some parentheses. We can also see this by typing in
getwd
# R tells us that this is an internally defined function
###############################################
#
# Read in some data and practice making selections
#
###############################################
#
# Read in a tab-delimited text file
#
# The following command reads in the specified file – which as entered here needs
# to be located in our current working directory – and stores it in the variable 'covariates'.
# With additional arguments we tell R that the first line is a header with
# names for each column (header=TRUE), that the columns are separated by
# tabs (sep="\t"), and that we don't want it to change text to factors (as.is=TRUE).
# Don't worry if you aren't sure what a 'factor' is just yet!
covariates <- read.table("FWS_covariates.txt", header=TRUE, sep="\t", as.is=TRUE)
# Typing "a <- 5" assigns value 5 to the variable 'a', just as the
# above command assigns the table read from the filename to the variable 'covariates'.
# Everytime you import data, look at it before you use it.
# does it look like it came in correctly?
covariates
# If it is too large to view all at once, you can use the head() function to view
# just the first few rows:
head(covariates)
# R stores things in objects called, well, 'objects'. An object
# is just a data structure that has certain properties that are defined by
# what 'class' the object is.
# Different object classes have different properties.
# The class() function returns the name of the object's class
class(covariates) # what type of object is it?
colnames(covariates) # what are the names of the columns / fields?
#
# We are going to use the data frame covariates for a bit while we practice selection.
# Let's look at the variable Month
Month
# That doesn't work! That's because Month only exists in the context of
# covariates. Here is how we access Month
covariates$Month
# If we want to make Month available outside of covariates, we need the following command
attach(covariates)
Month
# Beware! You can only use attach on one dataframe at a time.
# You should know of the existence of attach(), but I (Amy) recommend you avoid using it.
# It complicates working with multiple datatypes, and if you change
# a value in the named variable it does not change the data frame.
# However, this is my opinion, so other educators will tell you differently
###############################################
#
# Experiment with retrieving data based on its indices
#
###############################################
# Single index selection
# Selecting a single row or column of a dataframe or matrix returns a vector.
SampleName # so now I only have to type the column name to return the data, because we 'attached' the dataframe
SampleName[1] # the first index value in the vector
SampleName[4:9] # all indices from 4:9
SampleName[c(2,4,6)] # a set of non-consecutive indices – the 'c()' function 'combines' the provided arguments
SampleName[c(6,4,2)] # returns the values in the order selected
SampleName[-5] # all but the fifth value
# Dual index selection works the same for dataframes and matrices as with vectors like above,
# but you need two indices: rows and columns.
# In R, the row index is specified before the comma, and the column index is specified after ['row','column'].
# A blank selection of rows or columns selects all.
# Let's look at the covariates data frame again and then use dual indexing to select from it.
covariates
covariates[1:3, c(1,3,5)] # taking rows 1 through 3, and columns 1, 3, and 5
covariates[ ,c(1,3,5)] # since this leaves the 'rows' index blank, it selects all rows
covariates[ , c("Month", "Season")] # you can select columns by selecting their names
# Often we will want to select certain rows based on their value in a specific column. R's indexing
# is very useful for such things! Say we wanted all the rows that were from a specific location:
covariates[ Location == "MBL", ] # this pulls all rows that have "MBL" in their "Location" column
# note the "==" above, this is a logical operator that asks "does this thing equal this other thing?"
covariates[ Type != "Reservoir", ] # "!=" is the logical operator that asks "does this thing *not* equal this other thing?"
# common logical operators: ==, !=, >, >=, <, <=
covariates[ Month %in% c("Jan", "Feb"), ] # %in% specifies a vector of possibilities
# Get the set of unique items from a vector
unique(Season) # what seasons are represented in the covariate data
# Explore how many of each covariate type you have
table(Location)
table(Type)
table(Season)
###############################################
#
# Read in a taxa abundance matrix and get it ready to go
#
###############################################
# !! Remember, if you are running locally and the data are in your
# !! working directory, run this without the directory preceding the file name.
#abundances <- read.table("/class/stamps-shared/R_tutorial/FWS_OTUs.txt", header=TRUE, row.names = 1, sep="\t", as.is=TRUE)
abundances <- read.table("FWS_OTUs.txt", header=TRUE, row.names = 1, sep="\t", as.is=TRUE)
# This time we added an additional argument to read.table(). The "row.names=1" argument
# tells R that the first column contains the names of the rows.
# !! Always LOOK at your imported data before you use it.
class(abundances)
rownames(abundances) # check the names of the rows, what would happen here if we didn't set "row.names=1" when reading in the table?
colnames(abundances) # check the names of the columns
head(abundances) # quick check -- does it look okay?
# That was hard to look at. If not all columns can be displayed,
# it will repeat the rownames (taxa) for each set of columns.
# Look at the sample names in the covariate data file (remember "attach")
SampleName # Notice that the sample names in the covariates file are in the same order as the column names!
SampleName == colnames(abundances) # they are the same!
# comparing two vectors with "==" like this pits each element from the first vector
# against the corresponding element in the second vector; which is why we get back
# "TRUE" 12 times.
# Lets look at the abundances from the first sample ("JPA_Jan")
head(sort(abundances[,1])) # what kind of abundances do we see?
# All zeros?
# what if I want to sort decreasing and see the abundant taxa?
# Get help on the syntax also help(sort)
?sort
# by default, sort() puts things in increasing order, we can change that by adding an argument:
sort(abundances[,1], decreasing=TRUE) # Rank abundance for the data, but it includes 0's
sort(abundances[abundances[,1]>0,1], decreasing=TRUE) # Proper rank abundance without 0's
# Spend a moment deciphering how this command was put together. If you're not familiar with R,
# it may seem a bit confusing at first.
# Do you understand all of the pieces? If you're unsure, call over a friendly TA to chat
# about it! It's important to grasp as this is part of what makes indexing in R so wonderful :)
# Exercise: How many times was the most abundant OTU
# observed in the second sample?
# STOP HERE and test yourself in finding this.
# Answer: 3150
# Save the rank abundance for the first sample
myRankAbund <- sort(abundances[abundances[,1]>0,1], decreasing=TRUE)
# note that nothing was printed out here, because variable assignment
# is done silently.
head(myRankAbund) # always good to check that this worked as intended!
# Species abundances instead of rank abundances are important in many problems.
# table() provides the frequencies of each value (count)
table(abundances[abundances[,1] > 0 ,1]) # here is excluding those with 0 again
mySpeciesAbund <- table(abundances[,1]) # but we'll store it counting 0s so we know how many of them there are
mySpeciesAbund
#
# Create a relative abundance matrix from the abundance matrix
#
# To scale the values in each column by dividing the values by the column total,
# we can use the scale() function:
taxaRel <- scale(abundances, center=FALSE, scale=colSums(abundances))
# Check that the columns now all sum to 1 (=100%)
colSums(taxaRel) # colSums finds the sum of each column
#
# Lets add a new column, Total, that sums the abundances for each genus
#
# what is the dimension (size) of the table, how many rows and columns?
dim(taxaRel)
# Can we add a new column for the total abundance of each row,
# by assigning data to the next column index?
taxaRel[,13] <- rowSums(taxaRel)
#!! Why did that just fail?!?!
class(taxaRel) # it's a matrix now?
# R handles matrices and data frames differently, and the scale() function we used output a matrix.
# If something you think should work isn't working, checking the class of what you're working with is a good idea.
taxaRel <- as.data.frame(taxaRel) # convert it to a data frame, less efficient, but easier to use in tutorials
class(taxaRel) # double check
# Lets try that again
taxaRel[,13] <- rowSums(taxaRel) # Add a new column that is the total abundance of each row
colnames(taxaRel) # Check your column names
colnames(taxaRel)[13] <- "Total" # Fix the new column name
colnames(taxaRel) # Check again
head(taxaRel$Total)
# Which are the abundant taxa across all samples?
# select the names of rows whose total is > or = to 1%
rownames(taxaRel[taxaRel$Total >= 0.01, ])
#
# Let's practice sorting, then use only the 10 most abundant taxa
# because it makes the demo easier
#
# Sort based on decreasing Total, and return the values
sort(taxaRel$Total, decreasing=TRUE)
# Sort based on decreasing Total, but returns the indices
order(taxaRel$Total, decreasing=TRUE)
# Use the ordered indices to sort the data.frame,
# and populate a new data frame with those values:
sortedTaxaRel <- taxaRel[ order(taxaRel$Total, decreasing=TRUE) , ]
# Again, take a second to break the command down and make sure it's clear to you what
# it's doing and why.
# Take the first 10 rows (most abundant taxa) and all of the columns (samples)
topTen <- sortedTaxaRel[1:10,]
topTen
#
# Labeling with the full taxonomic names is rough, let's just get the genus names
#
# We'll substitute everything up to the last semicolon with nothing, leaving only the genus name.
# And we'll do this using the sub() function and a "regular expression", don't worry about it if
# you don't completely understand it right now:
# The function is called in this manner:
# sub(old_pattern, new_pattern, text, use perl regular expressions)
# We want to replace everything up to the last semicolon with nothing.
# Our 'old_pattern' will be: "^.*;"
# and our 'new_pattern will be: ""
# There are 3 special characters in the 'old_pattern':
# ^ starts the string at the beginning of the line,
# . means any character,
# * means whatever character precedes it any number of times
topGenera <- sub("^.*;", "", rownames(topTen), perl=TRUE)
topGenera # review the names --- oops we have a few "genus_NA", not helpful
which(topGenera == "genus_NA") # this gives us the indices of the genus_NA's
rownames(topTen)[7] # what are the full taxa names for these rows?
# seems our taxonomy annotation for this OTU wasn't confident beyond the Kingdom level
topGenera[7] <- "Bacteria" # fix the genus_NA, by replacing with what we can say about it
topGenera # Just checking...
####################################
#
# PLOTTING -- Practice a few types of graphs
#
####################################
# simple bar chart of the most abundant taxon:
barplot(topTen[1, ], main=topGenera[1], xlab="Sample", ylab="Relative Abundance")
# STAY CALM!
# Data frames can include characters as well as numbers,
# so many functions require matrix input to be sure.
# What is topTen again?
class(topTen)
# Okay it is a data frame and we need it to be a matrix.
topTen <- as.matrix(topTen)
class(topTen)
# Let's try again
barplot(topTen[1, ], main=topGenera[1], xlab="Sample", ylab="Relative Abundance")
# phooey, sample names don't fit, we better rotate them
barplot(topTen[1, ], main=topGenera[1], ylab="Relative Abundance", las=2)
# the various plotting parameters (main, ylab, las) are explained in
# par -- graphical PARameters
help(par)
# Remove the "Total" column, it really doesn't belong in our data, now that we have used it to sort
# dim() returns c(row_count, column_count),
dim(topTen)
topTen <- topTen[ , 1:dim(topTen)[2] - 1 ] # include all rows, and all but the last column = column count - 1
# Remember from the indexing examples above that we can specify a range of numbers by putting a colon between them.
# That is all we're doing here for choosing which columns to include except we're doing so using the dim() function, and
# then indexing the second element of what that returns, and then subtracting one. All this to give us the number 12 in this case.
# This may seem a bit convoluted, but it is actually more efficient in the long run. If the size of this table changes,
# but the format is the same, we can use this code still without having to change anything.
# rerun the plot without the Totals column
barplot(topTen[1, ], main=topGenera[1], ylab="Relative Abundance", las=2)
# # Practice an xy and a line plot, default type = "point"
# plot(topTen[2, ], main=topGenera[2], xlab="Sample", ylab="Relative Abundance")
# plot(topTen[2, ], main=topGenera[2], xlab="Sample", ylab="Relative Abundance")
plot(myRankAbund, main="Rank Abundance", xlab="Rank", ylab="Taxon Abundance")
# Yep, sure looks like the standard microbial distribution curve.
#
# Box plots use the quartiles and outliers, so we want the taxa to be the x-axis and the samples to be the spread
boxplot(topTen, cex.axis=0.75, main="Ten Most Common Taxa", xlab=NULL, ylab="Relative Abundance", las=2)
# That wasn't right, we wanted the taxa on the X axis, not the samples
# we will have to rotate the data and try again
# t() will transpose (rotate) the matrix for us
boxplot(t(topTen), names=topGenera, cex.axis=0.75, main="Ten Most Common Taxa", xlab=NULL, ylab="Relative Abundance", las=2)
#
# Heatmaps give a sense of how the data behave across samples and taxa
#
heatmap(topTen, main="Common Taxa", labRow=topGenera)
heatmap(topTen, main="Common Taxa", labRow=topGenera, labCol=Location)
heatmap(topTen, main="Common Taxa", labRow=topGenera, labCol=Season)
heatmap(topTen, main="Common Taxa", labRow=topGenera, labCol=Season, col=terrain.colors(128))
heatmap(topTen, main="Common Taxa", labRow=topGenera, labCol=Season, col=topo.colors(255))
pdf("heatmap_top10.pdf") # Send all subsequent plot output to a pdf file
heatmap(topTen, main="Common Taxa", labRow=topGenera)
dev.off() # Close the pdf file, return to using the plot window.
# If you try to open the pdf outside of R and it fails,
# come back and be sure you closed it with dev.off()
####################################
#
# Look for patterns by season
#
####################################
# Lets look at that heat map again, any interesting patterns?
heatmap(topTen, main="Common Taxa", labRow=topGenera, labCol=Seasons)
# Oops! 'Seasons' not found? Why is that?
colnames(covariates)
# Oh, we misspelled it.
heatmap(topTen, main="Common Taxa", labRow=topGenera, labCol=Season)
# Burkholderia, Curvibacter, Undibacterium, Pelomonas look by winter, sorta
topGenera # What is the order of the genera again? what is the index for Burkholderia
t.test(topTen[2,Season == "Winter"], topTen[2,Season=="Summer"]) #Burkholderia appear different by season, p-value <<0.05
t.test(topTen[1,Season == "Winter"], topTen[1,Season=="Summer"]) #Ralstonia appear different by season, p-value <<0.05
t.test(topTen[9,Season == "Winter"], topTen[9,Season=="Summer"]) #Not so with Midichloria, p>>0.05
# There is one other one that is significant, if you want to test that one too.
# Extra credit, perform a one-tailed t-test of Burkholderia, using the help
# Disclaimer: Amy does not endorse t-tests performed in this manner,
# because we did not set out in advance to investigate these taxa
####################################
#
# Statistical summaries
#
####################################
# We just did some sketchy hypothesis tests. Instead, let's calculate some
# sample means and standard deviations of the
# Ralstonia abundances
topTen[1,Season == "Winter"] # the abundances themselves
mean(topTen[1,Season == "Winter"]) # the sample mean of the abundances
sd(topTen[1,Season == "Winter"]) # the sample standard deviation of the abundances
var(topTen[1,Season == "Winter"]) # the sample variance of the abundances
# Let's investigate a basic model for Ralstonia abundances
# First create the abundance vector
ral_abundances <- topTen[1, ]
ral_model <- lm(ral_abundances ~ Season)
ral_model
# We just fit a linear model to abundances based on season
# The fitted model is
# Ralstonia abundance = 0.0683 in Summer
# Ralstonia abundance = 0.0683 + 0.2606 in Winter
# To get more details about the standard errors of these estimates:
summary(ral_model)$coef
####################################
#
# Save your work!
#
####################################
# this saves all of the objects (functions and variables)
# that you created as a file in your current working directory
save.image("optionA-objects.Rdata")
# next time, when you want to load these objects:
load("optionA-objects.Rdata")
# you can see how this works by first removing an object
rm(abundances) # remove this matrix
abundances # it no longer exists
load("optionA-objects.Rdata") # reload the saved data
abundances # look! it exists again
####################################
#
# Great work! You just saw some key concepts in R!
# Here are some exercises to practice with before you move on to the next tutorial
#
####################################
# 1. Pick your favourite Genus out of topGenera (this exercise is not a R exercise)
# 2. Construct a vector giving the abundances of this genus in the samples
# 3. Pick your favourite covariate
# 4. Make an effective plot to investigate if the abundance of this genus changes with the covariate
# 5. Calculate the mean and standard deviation of the abundance across the covariate gradient
# 6. Construct a linear model for the abundance of this genus with your covariate
# If you can complete these exercises you've achieved all 5 of the learning objectives
# that we set out to do!
# (reading in data, creating summaries, fitting basic models,
# using data types, plotting)
# Congratulate yourself and take a break before moving on to Lab B