-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathAdditonal_File_2.Rmd
384 lines (307 loc) · 13.5 KB
/
Additonal_File_2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
---
title: >
Using GeDi on the T cell dataset (GSE130842)
author:
- name: Annekathrin Silvia Nedwed
affiliation:
- &id1 Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), Mainz<br>
email: [email protected]
- name: Arsenij Ustanzew
affiliation:
- *id1
- name: Najla Abassi
affiliation:
- *id1
- name: Shuqing Xu
affiliation:
- &id3 Institute for Organismic and Molecular Evolution (iomE), Mainz<br>
- name: Federico Marini
affiliation:
- *id1
- &id2 Center for Thrombosis and Hemostasis (CTH), Mainz<br>
- name: Konstantin Strauch
affiliation:
- *id1
date: "`r BiocStyle::doc_date()`"
package: "`r BiocStyle::pkg_ver('GeDi')`"
output:
bookdown::html_document2:
toc: true
toc_float: true
theme: cosmo
code_folding: show
code_download: true
editor_options:
chunk_output_type: console
link-citations: true
bibliography: "gedi_supplement.bib"
---
```{r setup, include=FALSE, cache=FALSE, eval = TRUE, echo = FALSE}
library("knitr")
opts_chunk$set(
fig.align = "center",
fig.show = "asis",
eval = TRUE,
fig.width = 10,
fig.height = 7,
tidy = FALSE,
message = FALSE,
warning = FALSE,
size = "small",
comment = "##",
echo = TRUE,
results = "markup"
)
options(replace.assign = TRUE, width = 80)
```
# About the data
The data illustrated in this document is an RNA-seq dataset, available at the Gene Expression Omnibus under the accession code [GSE130842](https://www.ncbi.xyz/geo/query/acc.cgi?acc=GSE130842).
The data represents a mouse data set of 32 different samples across different tissues and conditions. The data is part of a manuscript to analyse different tissue regulatory T cell populations [@Delacher2020] - the manuscript is available on [Pubmed] (https://pubmed.ncbi.nlm.nih.gov/31924477/).
# Loading required packages
We load the packages required to perform all the analytic steps presented in this document.
```{r loadLibraries, results='hide'}
library("DESeq2")
library("topGO")
library("pcaExplorer")
library("ideal")
library("GeneTonic")
library("apeglm")
library("dplyr")
library("org.Mm.eg.db")
library("GeDi")
library("visNetwork")
```
# Data processing
In this example we analyse data available on the Gene Expression Omnibus under
accession number [GSE130842](https://www.ncbi.xyz/geo/query/acc.cgi?acc=GSE130842).
From the available data, we downloaded the
GSE130842_Count_table_Delacher_et_al_2019.xlsx Excel file, which is also
available in the [Github](https://github.com/AnnekathrinSilvia/manuscript_GeDi)
repository to this document.
We will preprocess the data for its use in `r BiocStyle::Biocpkg("Gedi")`
according to the workflow described in the package vignette of
`r BiocStyle::Biocpkg("Gedi")`. The package vignette is available through using
the command `browseVignette(GeDi)` after successfully installing the package.
In the first step of the preprocessing, we will generate a `DESeqDataset`
[@Love2014] from the count table available in the Excel file. We will further
preprocess this object, called `dds_tregs`, to prepare the data for its use in
`GeDi`. We will also read in some metadata that we have set up for the dataset.
This file will also be available in the [repository](https://github.com/AnnekathrinSilvia/manuscript_GeDi)
as well as the final generated `dds_tregs`.
```{r create_dds}
# We read in the data form the Excel file
count_df <- readxl::read_excel("GSE130842_Count_table_Delacher_et_al_2019.xlsx")
# We transform the data into a matrix and set the gene ids as the rownames of
# the final matrix
count_matrix <- as.matrix(count_df[, -1])
rownames(count_matrix) <- count_df$ID
# We read in the metadata
coldata <- readxl::read_excel("GSE130842_metadata.xlsx")
coldata
# We build up the DESeqDataset object from the count data and the metadata
# As design we will choose the group of the data which indicates the tissue
# of origin as well as t5he type of Tcells in the sample
dds_tregs <- DESeqDataSetFromMatrix(
countData = count_matrix,
colData = coldata,
design = ~group
)
# We transform the columnnames of the dds_tregs to include the replicate number
# of each sample
colnames(dds_tregs) <- paste0(coldata$group, "_r", coldata$rep_nr)
# Lastly we can have a look at our DESeqDataset object
dds_tregs
```
Now we can see that we have set up a `DESeqDataset` object on all the available
samples. However, in this analysis, we will only focus on a subset of samples as
shown in Figure 2 of the original manuscript [@Delacher2020]. Hence, we will
subset our `dds_tregs` object to the groups used in the figure.
```{r subsetDDS}
dds_tregs <- dds_tregs[, dds_tregs$group %in%
c("KLRGminusNFIL3minusTreg" ,
"KLRGminusNFIL3plusTreg",
"KLRGplusNFIL3plusTreg",
"tisTregST2_BM",
"tisTregST2_Fat",
"tisTregST2_Liver",
"tisTregST2_Lung",
"tisTregST2_Skin"
)]
dds_tregs$group <- droplevels(dds_tregs$group)
design(dds_tregs) <- ~group
```
## Exploratory data analysis
In a first analysis step, we do an exploratory data analysis as described in
[@Ludt2022] using the `r BiocStyle::Biocpkg("pcaExplorer")` package [@Marini2019].
We will first apply a variance-stabilizing transformation before we plot a PCA
and a sample-to-sample distance heatmap.
```{r vst_tranformation}
# Apply the vst transformation
vst_tregs <- vst(dds_tregs)
# Plot the PCA
pcaExplorer::pcaplot(vst_tregs,
intgroup = "group",
ntop = 1000,
title = "PCA plot - top 1000 most variable genes",
ellipse = FALSE,
text_labels = FALSE
)
# Plot a sampl-to-sample distance heatmap
pheatmap::pheatmap(as.matrix(dist(t(assay(vst_tregs)))))
```
## Differential expression analysis
After we have done some exploratory data analysis, we can proceed with the
differential expression analysis. We use the `r BiocStyle::Biocpkg("DESeq2")`
package [@Love2014] for this. The False Discovery Rate is set to 5%.
Before, we run the `r BiocStyle::Biocpkg("DESeq2")` analysis, we first match the
gene ids to gene names using the `r BiocStyle::Biocpkg("pcaExplorer")` package
[@Marini2019]. With this we can add the gene names to the results, which are
usually better known than gene ids. The gene names will also be later used
as "Genes" column in the input for `r BiocStyle::Biocpkg("GeDi")`.
```{r de-tregs1, cache=TRUE}
# Create an annodation data frame mapping gene ids to gene names.
anno_df <- pcaExplorer::get_annotation_orgdb(dds_tregs,"org.Mm.eg.db","ENSEMBL")
# Assign a new column SWYMBOL to the dds_tregs object, which will be later used
# as "Genes" column in the input for GeDi
rowData(dds_tregs)$SYMBOL <- anno_df$gene_name[match(rownames(dds_tregs),
anno_df$gene_id)]
# Set the false discovery rate to 5%
FDR <- 0.05
# Perform the differential gene expression analysis
dds_tregs <- DESeq(dds_tregs)
```
After we performed the analysis, we extract the results for the condition
`KLRGplusNFIL3plusTreg vs KLRGminusNFIL3minusTreg` as we only want to compare
these two groups. Afterwards we print a summary overview of the previously
extracted results.
We also use the `r BiocStyle::Biocpkg("ideal")` package [@Marini2020] to plot an
MA-plot of the results.
In a last step, we add the gene symbols to the resulting `DataFrame` which will
later serve as our "Genes" column in the input data to `GeDi`.
```{r de-tregs2}
# Extract differentially expressed genes
# Perform contrast analysis comparing "KLRGplusNFIL3plusTreg" group to "KLRGminusNFIL3minusTreg" group
# Set a log2 fold change threshold of 1 and a significance level (alpha) of 0.05
res_tregs <- results(dds_tregs,
contrast = c("group", "KLRGplusNFIL3plusTreg", "KLRGminusNFIL3minusTreg"),
lfcThreshold = 1, alpha = 0.05
)
# Print a summary overview of the results
summary(res_tregs)
# Plot an MA-plot of the results
ideal::plot_ma(res_tregs,
ylim = c(-5, 5),
title = "MAplot - KLRGplusNFIL3plusTreg vs KLRGminusNFIL3minusTreg")
# Add gene symbols to the results in a column "SYMBOL"
res_tregs$SYMBOL <- rowData(dds_tregs)$SYMBOL
```
## Functional enrichment analysis
Following the differential expression analysis, we perform a functional enrichment
analysis using the `r BiocStyle::Biocpkg("topGO")` package [@topGO]. Before the analysis,
we first determine the set of background genes to be used, which in our case will
be the set of expressed genes in the data. We also transform the results of our
DE analysis to fit the format expectation of the `topGOtable` function from the
`r BiocStyle::Biocpkg("pcaExplorer")` package [@Marini2019].
```{r enrich-macro, cache=TRUE}
# Determine the set of background genes as all genes expressed in the dataset
geneUniverseExpr <- rowData(dds_tregs)$SYMBOL[rowSums(counts(dds_tregs)) > 0]
# Extract gene symbols from the DESeq2 results object where FDR is below 0.05
# The function deseqresult2df is used to convert the DESeq2 results to a
# dataframe format
# FDR is set to 0.05 to filter significant results
de_symbols <- deseqresult2df(res_tregs, FDR = 0.05)$SYMBOL
# Perform Gene Ontology enrichment analysis using the topGOtable function from
# the "pcaExplorer" package
topGO_tregs <- topGOtable(
DEgenes = de_symbols,
BGgenes = geneUniverseExpr,
ontology = "BP",
geneID = "symbol",
addGeneToTerms = TRUE,
mapping = "org.Mm.eg.db",
topTablerows = 500
)
```
## Preparing the data for GeDi
After we have performed the functional enrichment analysis, the data is almost
ready to be used with `r BiocStyle::Biocpkg("GeDi")`. However, in its current
state the data is not yet in the correct format expected by `r BiocStyle::Biocpkg("GeDi")`.
`r BiocStyle::Biocpkg("GeDi")` expects the data to have at least two columns,
one named "Genesets" containing some form of geneset identifiers and one named
"Genes" containing a list of genes belonging to the genesets. While this is not
strictly necessary to use `r BiocStyle::Biocpkg("GeDi")` on the data, it
facilitates the use of the app as the app can be used straight away instead of
having to wait for the data to be reformated. The correct data format is however
necessary, if you want to use the data as a parameter as in `GeDi(topGO_tregs)`.
Nevertheless, we want to show you here, how to adapt the data from the
`r BiocStyle::Biocpkg("topGO")` analysis to fit the data format requirements
of `r BiocStyle::Biocpkg("GeDi")`. For this we simply have to rename the
'GO.ID' and the 'genes' column of the results as these two columns already
contain the input data in the correct format.
```{r renamecolumns, eval=TRUE}
# Rename columns in the topGO_tregs dataframe
# Change the column name "GO.ID" to "Genesets"
names(topGO_tregs)[names(topGO_tregs) == "GO.ID"] <- "Genesets"
# Change the column name "genes" to "Genes"
names(topGO_tregs)[names(topGO_tregs) == "genes"] <- "Genes"
```
After we have renamed the columns, the data is now ready to be used in
`r BiocStyle::Biocpkg("GeDi")`.
# Running `GeDi` on the dataset
Now we can start to explore our data using `r BiocStyle::Biocpkg("GeDi")`. For
this you can either follow the chunks in this document to prepare the data or
we can load the prepared object provided with the
[repository](https://github.com/AnnekathrinSilvia/manuscript_GeDi) of this
document.
```{r loadPreprocessedData}
tregs_example <- readRDS("usecase_tregs_example.RDS")
```
Once we have loaded the data, we can start the app and interactively explore
the data set using:
```{r runGeDi, eval=FALSE}
GeDi(genesets = tregs_example)
```
Once the app is started, you can have interactive guidance of the user
interface and its features by usind the introductory tours of each panel of
the app.
## Using `GeDi`'s functions in analysis reports
The functionality of `r BiocStyle::Biocpkg("GeDi")` can be used also as
standalone functions, to be called for example in existing analysis reports
in RMarkdown, or R scripts.
In the following chunks, we show how it is possible to call some of the functions
on the dataset presented in this document.
```{r distance_Scores}
# First we extract a representation of all genes in the data
genes <- GeDi::getGenes(topGO_tregs)
# Next we calculate one distance score matrix for the data
mm_score <- GeDi::getMeetMinMatrix(genesets = genes)
rownames(mm_score) <- colnames(mm_score) <- topGO_tregs$Genesets
# We can use several plotting functions of GeDi to plot the data
GeDi::distanceHeatmap(mm_score)
GeDi::distanceDendro(mm_score)
adj <- GeDi::getAdjacencyMatrix(mm_score, 0.3)
rownames(adj) <- colnames(adj) <- topGO_tregs$Genesets
g <- GeDi::buildGraph(adj, topGO_tregs, topGO_tregs$Genesets)
visNetwork::visIgraph(g)
```
After we have calculated distance scores, we can cluster the data and visualize
the results in several different ways.
```{r clustering}
# First we are clustering the data
clustering <- GeDi::clustering(mm_score, threshold = 0.3, cluster_method = "markov")
# Next we can visualize the results of the clustering
g <- GeDi::buildClusterGraph(clustering,
topGO_tregs,
topGO_tregs$Genesets,
color_by = "Annotated")
visNetwork::visIgraph(g)
g <- GeDi::getBipartiteGraph(clustering, topGO_tregs$Genesets, genes)
visNetwork::visIgraph(g) %>%
visIgraphLayout(layout = "layout_as_bipartite")
GeDi::enrichmentWordcloud(topGO_tregs)
```
# Session information {-}
```{r}
sessionInfo()
```
# Bibliography {-}