Raphael Gottardo
February 2, 2014
Let's first turn on the cache for increased performance and improved styling
# Set some global knitr options
library("knitr")
opts_chunk$set(tidy=TRUE, tidy.opts=list(blank=FALSE, width.cutoff=60), cache=TRUE, messages=FALSE)
Before we start, you should read the following papers:
- Leek, J. T. & Storey, J. D. Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis. PLoS Genet 3, e161 (2007).
- Leek, J. T. & Storey, J. D. A general framework for multiple testing dependence. Proc. Natl. Acad. Sci. U.S.A. 105, 18718–18723 (2008).
- Leek, J. T. et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics 11, 733-739 (2010).
- Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118-127 (2007).
- Gagnon-Bartsch, J. A. & Speed, T. P. Using control genes to correct for unwanted variation in microarray data. Biostatistics 13, 539-552 (2012).
Batch effects are technical sources of variation that have been added to the samples during handling. Example of batch variables: lot number, technician, instrument settings, etc.
If not adjusted for, these batch variables can have a substantial effects on downstream analysis
Unfortunately, normalization will not always correct for batch effects. Technical variation due to batch effects might only affect a subset of the genes.
"For a published bladder cancer microarray data set obtained using an Affymetrix platform, we obtained the raw data for only the normal samples. Here, green and orange represent two different processing dates."
Two scenarios:
- You have information about the batch variable
- Use your batch effect as a covariate in your analysis (e.g. limma)
- You suspect a batch effect, but you don't know where it is coming from
- The batch effect needs to be estimated first and then corrected for, by adding the estimated variables as co-variates
Let
- U is the matrix of left singular vectors (eigenassays)
- V is the matrix of right singular vectors (eigengenes)
- S is the matrix of singular values (eigenvalues)
What happens if the rows of X are scaled?
Let
Suppose there are
and if factor
In practice it is impossible to estimate
- Detect unmodeled factors
- Construct surrogate variables
The main idea is as follows:
- Compute the residual matrix
$r_{ij} = x_{ij}- \hat{\mu}_i - \hat{f}_i(y_j)$ - Perform the SVD of
$R=(r_{ij})$ - Permute the rows of the matrix
$R$ to obtain $R^$. Regress $r^{ij} = x{ij}- \hat{\mu}_i - \hat{f}_i(y_j)$ to get residual matrix$R_0$ , and perform the SVD of$R_0$ . Repeat this many times to generate a null distribution for the residuals, given that$y$ is accounted for. - Compare the observed eigenvalues to those generated from the null distribution to obtain significance p-values
- Record the
$K$ significant variables
- Compute the residual matrix
$r_{ij} = x_{ij} - \hat{\mu}_i - \hat{f}_i(y_j)$ - Perform the SVD of
$R=(r_{ij})$
Let
- Regress
$e_k$ on the$x_i$ to access the significance of the$k$ -th factor on gene$i$ - Use the selected genes to form a reduced expression matrix and repeat 1. The estimated factor will form the basis for the surrogate variables
- In any subsequent analysis include these factors in your model
We're going to look at the dataset used in:
Nakaya, H. I., Wrammert, J., Lee, E. K., Racioppi, L., Marie-Kunze, S., Haining, W. N., et al. (2011). Systems biology of vaccination for seasonal influenza in humans. Nature Immunology, 12(8), 786795. doi:10.1038/ni.2067
library(GEOquery)
## Loading required package: Biobase
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
##
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
##
## The following object is masked from 'package:stats':
##
## xtabs
##
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, as.vector, cbind,
## colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
## intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unlist
##
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Setting options('download.file.method.GEOquery'='auto')
# Download the mapping information and processed data
gds <- getGEO("GSE29619", destdir = "Data/GEO/")
## ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE29nnn/GSE29619/matrix/
## Found 3 file(s)
## GSE29619-GPL13158_series_matrix.txt.gz
## Using locally cached version: Data/GEO//GSE29619-GPL13158_series_matrix.txt.gz
## Using locally cached version of GPL13158 found here:
## Data/GEO//GPL13158.soft
## GSE29619-GPL3921_series_matrix.txt.gz
## Using locally cached version: Data/GEO//GSE29619-GPL3921_series_matrix.txt.gz
## Using locally cached version of GPL3921 found here:
## Data/GEO//GPL3921.soft
## GSE29619-GPL570_series_matrix.txt.gz
## Using locally cached version: Data/GEO//GSE29619-GPL570_series_matrix.txt.gz
## Using locally cached version of GPL570 found here:
## Data/GEO//GPL570.soft
# main serie #gds[[1]] = LAIV/TIV 0809, gds[[2]] = FACS,
# gds[[3]] = TIV 0708
but before we can use this, we need to clean up the pData a bit (see code in .Rpres file).
Let us estimate the (surrogate) factors as follows:
library(sva)
## Loading required package: corpcor
## Loading required package: mgcv
## Loading required package: nlme
## This is mgcv 1.8-4. For overview type 'help("mgcv-package")'.
TIV_08 <- gds_new[[1]][, grepl("2008-TIV", pData(gds_new[[1]])$description)]
mm_TIV_08 <- model.matrix(~ptid + time, TIV_08)
mm0_TIV_08 <- model.matrix(~ptid, TIV_08)
# Estimate the surrogate variables
sv_TIV_08 <- sva(exprs(TIV_08), mm_TIV_08, mod0 = mm0_TIV_08)
## Number of significant surrogate variables is: 12
## Iteration (out of 5 ):1 2 3 4 5
Then we can use these variables in limma, as follows:
library(limma)
##
## Attaching package: 'limma'
##
## The following object is masked from 'package:BiocGenerics':
##
## plotMA
# Add the surrogate variables to the design matrix
modSv <- cbind(mm_TIV_08, sv_TIV_08$sv)
# Use the new design matrix
fit_TIV_08 <- lmFit(TIV_08, modSv)
ebay_TIV_08 <- eBayes(fit_TIV_08)
topT7_sv <- topTable(ebay_TIV_08, coef = "timeD7", number = Inf)
# Compare to the old analysis
fit_TIV_08 <- lmFit(TIV_08, mm_TIV_08)
ebay_TIV_08 <- eBayes(fit_TIV_08)
topT7 <- topTable(ebay_TIV_08, coef = "timeD7", number = Inf)
The result from the adjusted analysis:
topT7_sv[1:10, c("ID", "Gene.Symbol")]
## ID
## 211430_PM_s_at 211430_PM_s_at
## 232991_PM_at 232991_PM_at
## 219276_PM_x_at 219276_PM_x_at
## 239401_PM_at 239401_PM_at
## 239637_PM_at 239637_PM_at
## 227999_PM_at 227999_PM_at
## 226481_PM_at 226481_PM_at
## 216801_PM_at 216801_PM_at
## 241402_PM_at 241402_PM_at
## 235180_PM_at 235180_PM_at
## Gene.Symbol
## 211430_PM_s_at IGH@ /// IGHG1 /// IGHM /// IGHV4-31 /// LOC100290146
## 232991_PM_at
## 219276_PM_x_at C9orf82
## 239401_PM_at
## 239637_PM_at
## 227999_PM_at PWWP2B
## 226481_PM_at VPRBP
## 216801_PM_at
## 241402_PM_at TSEN54
## 235180_PM_at STYX
The result from the un-adjusted analysis:
topT7[1:10, c("ID", "Gene.Symbol")]
## ID
## 211430_PM_s_at 211430_PM_s_at
## 241824_PM_at 241824_PM_at
## 216576_PM_x_at 216576_PM_x_at
## 1559018_PM_at 1559018_PM_at
## 216207_PM_x_at 216207_PM_x_at
## 214669_PM_x_at 214669_PM_x_at
## 1568768_PM_s_at 1568768_PM_s_at
## 215946_PM_x_at 215946_PM_x_at
## 211645_PM_x_at 211645_PM_x_at
## 217157_PM_x_at 217157_PM_x_at
## Gene.Symbol
## 211430_PM_s_at IGH@ /// IGHG1 /// IGHM /// IGHV4-31 /// LOC100290146
## 241824_PM_at
## 216576_PM_x_at IGK@ /// IGKC /// LOC652493 /// LOC652694
## 1559018_PM_at PTPRE
## 216207_PM_x_at IGKV1D-13
## 214669_PM_x_at IGK@ /// IGKC /// IGKV3-20 /// LOC100291682
## 1568768_PM_s_at LOC100302650
## 215946_PM_x_at IGLL1 /// IGLL3 /// LOC91316
## 211645_PM_x_at
## 217157_PM_x_at IGK@ /// IGKC /// LOC652493
What do you think?