diff --git a/docs/usage/DEanalysis/de_rstudio.md b/docs/usage/DEanalysis/de_rstudio.md
index 761e63dc0..f3adcc30b 100644
--- a/docs/usage/DEanalysis/de_rstudio.md
+++ b/docs/usage/DEanalysis/de_rstudio.md
@@ -2,12 +2,10 @@
order: 4
---
-
# Differential Analysis with DESeq2
In this section of the tutorial, we will guide you through the practical steps necessary to set up the RStudio environment, load the required libraries and data, and execute the DESeq2 analysis. By the end of this section, you will have a fully functional DESeq2 analysis pipeline set up in RStudio, ready to uncover the differentially expressed genes in your dataset.
-
## Launching the RStudio environment
Once the nf-core/rnaseq pipeline is terminated, the resulting data are stored in the folder `results_star_salmon`. Now, we can analyse the results by running DESeq2 on RStudio. First of all, we need to launch it:
@@ -27,20 +25,19 @@ Password: pass
Using `sleep` will keep the Gitpod session active, preventing disconnection and providing enough time to complete our analysis without interruptions
:::
-
## Differential Expression Analysis
As in all analysis, firstly we need to create a new project:
-1) Go to the **File** menu and select **New Project**;
+1. Go to the **File** menu and select **New Project**;
-2) Select **New Directory**, **New Project**, name the project as shown below and click on **Create Project**;
+2. Select **New Directory**, **New Project**, name the project as shown below and click on **Create Project**;
![r_project](./img/project_R.png){ width="400" }
-3) The new project will be automatically opened in RStudio.
+3. The new project will be automatically opened in RStudio.
We can check whether we are in the correct working directory with `getwd()`. The path `/workspace/gitpod/training/DE_analysis/` should appear on your console.
To store our results in an organized way, we will create a folder named **de_results** using the **New Folder** button in the bottom right panel. We will save all our resulting tables and plots in this folder. Next, go to the **File menu**, select **New File** and then **R Script** to create a script editor in which we will save all commands required for the analysis. In the editor type:
@@ -97,10 +94,8 @@ In DESEq2, the `dds` object is a central data structure that contains the follow
- `countData`: a matrix of raw count data, where each row represents a gene and each column represents a sample;
-
- `colData`: a data frame containing information about the samples, such as the experimental design, treatment and other relevant metadata;
-
- `design`: a formula specifying the experimental design utilised to estimate the dispersion and the log2 fold change.
All these components can be checked with specific commands:
@@ -201,10 +196,8 @@ The `DESeq()` function is a high-level wrapper that simplifies the process of di
- `estimateSizeFactors`: to normalise the count data;
-
- `estimateDispersion`: to estimate the dispersion;
-
- `nbinomWaldTest`: to perform differential expression test.
The individual functions can be carried out also singularly as shown below:
@@ -236,7 +229,6 @@ The `rlog` and the `vst` transformations have an argument, **blind** that can be
- **TRUE** (default): useful for QC analysis because it re-estimates the dispersion, allowing for comparison of samples in an unbiased manner with respect to experimental conditions;
-
- **FALSE**: the function utilizes the already estimated dispersion, generally applied when differences in counts are expected to be due to the experimental design.
Next, we perform Principal Component Analysis (PCA) to explore the data. DESeq2 provides a built-in function, `plotPCA()`, which uses [ggplot2](https://ggplot2.tidyverse.org) for visualisation, taking the `rld` (or the `vst`) object as input.
@@ -325,19 +317,14 @@ The `results()` function in DESeq2 is used to extract the results of the DE anal
- **baseMean**: the average expression level of the gene across all samples;
-
- **log2FoldChange**: the log2 fold change of the gene between the condition of interest and the reference level;
-
- **lfcSE**: the standard error of the log2 fold change;
-
- **stat**: the Wald statistic, which is used to calculate the p-value;
-
- **pvalue**: the p-value from the Wald test indicates the probability of observing the measured difference in gene expression (log2 fold change) by chance, assuming no true difference exists (null hypothesis). A low p-value suggests that the observed expression change between samples is unlikely due to random chance, so we can reject the null hypothesis --> the gene is differentially expressed;
-
- **padj**: the adjusted p-value, which takes into account multiple testing corrections, (Benjamini-Hochberg method default) to control the false discovery rate;
The `results()` function returns the results for all genes in the analysis with an adjusted p-value below a specific FDR cutoff, set by default to 0.1. This threshold can be modified with the parameter `alpha`. The `results()` function can also be customised to filter the results based on certain criteria (log2 fold change or padj) or to set a specific contrast (specific comparison between two or more levels).
@@ -386,7 +373,7 @@ res_viz <- as_tibble(res_viz) %>%
write.csv(res_viz, file = "de_results/de_result_table.csv")
```
-In the *Experimental Design* section, we emphasised the importance of estimating the log2 fold change threshold using a statistical power calculation, rather than selecting it arbitrarily. This approach ensures that the chosen threshold is statistically appropriate and tailored to the specifics of the experiment. However, since we are working with simulated data for demonstration purposes, we will use a padj threshold of 0.05 and consider genes with a log2 fold change of at least 1 or -1 as differentially expressed.
+In the _Experimental Design_ section, we emphasised the importance of estimating the log2 fold change threshold using a statistical power calculation, rather than selecting it arbitrarily. This approach ensures that the chosen threshold is statistically appropriate and tailored to the specifics of the experiment. However, since we are working with simulated data for demonstration purposes, we will use a padj threshold of 0.05 and consider genes with a log2 fold change of at least 1 or -1 as differentially expressed.
```r
#### Extract significant DE genes from the results ####
@@ -413,7 +400,6 @@ resSig
write.csv(resSig, file = "de_results/sig_de_genes.csv")
```
-
## Plot the results
Now that we have obtained the results of the differential expression analysis, it's time to visualise the data to gain a deeper understanding of the biological processes that are affected by the experimental conditions. Visualisation is a crucial step in RNA-seq analysis, as it allows us to identify patterns and trends in the data that may not be immediately apparent from the numerical results.
@@ -533,7 +519,6 @@ volcano_plot <- ggplot(data = res_tb, aes(x = log2FoldChange, y = -log10(padj),
ggsave("de_results/volcano_plot.png", plot = volcano_plot, width = 6, height = 5, dpi = 300)
```
-
## Functional analysis
The output of the differential expression analysis is a list of significant DE genes. To uncover the underlying biological mechanisms, various downstream analyses can be performed, such as functional enrichment analysis (identify overrepresented biological processes, molecular functions, cellular components or pathways), and network analysis (group genes based on similar expression patterns to identify novel interactions). To facilitate the interpretation of the resulting list of DE genes, a range of freely available web- and R-based tools can be employed.
@@ -542,10 +527,8 @@ In this tutorial, we will explore an enrichment analysis technique known as Over
- **Universe**: the background list of genes (for example the genes annotated in a genome);
-
- **GeneSet**: a collection of genes annotated by a reference database (such as Gene Ontology), and known to be involved in a particular biological pathway or process;
-
- **Gene List**: the differentially expressed genes.
The hypergeometric test calculates the probability of observing a certain number of genes from the gene set (pathway or process) within the gene list (DE genes) by chance.
diff --git a/docs/usage/DEanalysis/index.md b/docs/usage/DEanalysis/index.md
index 68e35bb99..56d9d5e24 100644
--- a/docs/usage/DEanalysis/index.md
+++ b/docs/usage/DEanalysis/index.md
@@ -2,7 +2,6 @@
order: 1
---
-
# Welcome
These pages are a tutorial workshop for the [Nextflow](https://www.nextflow.io) pipeline [nf-core/rnaseq](https://nf-co.re/rnaseq).
@@ -20,7 +19,6 @@ By the end of this workshop, you will be able to:
Let's get started!
-
## Running with Gitpod
In order to run this using GitPod, please make sure:
@@ -30,17 +28,13 @@ In order to run this using GitPod, please make sure:
Now you're all set and can use the following button to launch the service:
-
[![Open in GitPod](https://img.shields.io/badge/Gitpod-%20Open%20in%20Gitpod-908a85?logo=gitpod)](https://gitpod.io/#https://github.com/lescai-teaching/rnaseq-tutorial)
-
-
## Additional documentation
- You can find detailed documentation on **Nextflow** [here](https://www.nextflow.io/docs/latest/)
- You can find additional training on [these pages](https://training.nextflow.io)
-
## Credits & Copyright
This training material has been written and completed by [Lorenzo Sola](https://github.com/LorenzoS96), [Francesco Lescai](https://github.com/lescai), and [Mariangela Santorsola](https://github.com/msantorsola) during the [nf-core](https://nf-co.re) Hackathon in Barcellona, 2024. Thank you to [Victoria Cepeda](https://github.com/vcepeda) for her contributions to the tutorial's revision. The tutorial is aimed at anyone who is interested in using nf-core pipelines for their studies or research activities.
diff --git a/docs/usage/DEanalysis/interpretation.md b/docs/usage/DEanalysis/interpretation.md
index 20a8a7279..476f627cd 100644
--- a/docs/usage/DEanalysis/interpretation.md
+++ b/docs/usage/DEanalysis/interpretation.md
@@ -2,7 +2,6 @@
order: 5
---
-
# Interpretation
Once DE genes have been identified, the next crucial step is to interpret the results. This involves the inspection of tables and plots generated during the analysis to understand the biological significance of the data. In this part of the tutorial, we will explore the results by discussing the significant DE genes and we will examine various plots generated during the analysis.
@@ -11,7 +10,6 @@ Once DE genes have been identified, the next crucial step is to interpret the re
The results illustrated in this section might show slight variations compared to your runs due to randomness in the STAR algorithm. This randomness arises from using variable seed values and parallel processing, leading to minor differences in results between runs on the same data. These small discrepancies are not biologically significant and may affect counts and subsequent plots (such as PCA and count plots). However, the overall patterns and main findings should remain consistent. While exact reproducibility is ideal, minor variations are acceptable in practice, as long as they do not impact the main conclusions of the study.
:::
-
## Quality control plots
The first plot we will examine is the Principal Component Analysis (PCA) plot. Since we're working with simulated data, our metadata is relatively simple, consisting of just three variables: `sample`, `condition`, and `replica`. In a typical RNA-seq experiment, however, metadata can be complex and encompass a wide range of variables that could contribute to sample variation, such as sex, age, and developmental stage.
@@ -32,7 +30,6 @@ Remember that to create this plot, we utilized the `dist()` function, so in the
Overall, the integration of these plots suggests that we are working with high-quality data and we can confidently proceed to the differential expression analysis.
-
## Differential expression results
In this part of the tutorial, we will examine plots that are generated after the differential expression analysis. These plots are not quality control plots, but rather plots that help us to interpret the results.
@@ -65,12 +62,11 @@ The treatment induced differential expression in five genes: one downregulated a
After the identification of DE genes, it's informative to visualise the expression of specific genes of interest. The `plotCounts()` function applied directly on the `dds` object allows us to examine individual gene expression profiles without accessing the full `res` object.
-
![counts](./img/plotCounts.png){ width="400" }
-In our example, post-treatment, we observe a significant increase in the expression of the *ENSG00000142192* gene, highlighting its responsiveness to the experimental conditions.
+In our example, post-treatment, we observe a significant increase in the expression of the _ENSG00000142192_ gene, highlighting its responsiveness to the experimental conditions.
Finally, we can create a heatmap using the normalised expression counts of DE genes. The resulting heatmap visualises how the expression of significant genes varies across samples. Each row represents a gene, and each column represents a sample. The color intensity in the heatmap reflects the normalised expression levels: red colors indicate higher expression, while blue colors indicate lower expression.
@@ -80,7 +76,6 @@ Finally, we can create a heatmap using the normalised expression counts of DE ge
By examining the heatmap, we can visually identify the expression patterns of our five significant differentially expressed genes. This visualisation allows us to identify how these genes respond to the treatment. The heatmap provides a clear and intuitive way to explore gene expression dynamics.
-
## Over Representation Analysis (ORA)
Finally, we can attempt to assign biological significance to our differentially expressed genes through **Over Representation Analysis (ORA)**. The ORA analysis identifies specific biological pathways, molecular functions and cellular processes, according to the **Gene Ontology (GO)** database, that are enriched within our differentially expressed genes.
@@ -91,7 +86,6 @@ Finally, we can attempt to assign biological significance to our differentially
The enrichment analysis reveals a possible involvement of cellular structures and processes, including "clathrin-coated pit", "dendritic spine", "neuron spine" and "endoplasmic reticulum lumen". These terms suggest a focus on cellular transport, structural integrity and protein processing, especially in neural contexts. This pattern points to pathways related to cellular organization and maintenance, possibly playing an important role in the biological condition under study.
-
## Conclusions
In this tutorial, we have walked through the steps of the RNA-seq analysis, from launching the nfcore/rnaseq pipeline to interpreting differential expression results. You learned how the data are generated, identified differentially expressed genes, and conducted enrichment analysis. By following this tutorial, you should now be able to use the nfcore/rnaseq pipeline and perform differential expression analysis with DESeq2, interpreting the results within the context of your research.
diff --git a/docs/usage/DEanalysis/rnaseq.md b/docs/usage/DEanalysis/rnaseq.md
index b6a1fa5b8..9fd0e123e 100644
--- a/docs/usage/DEanalysis/rnaseq.md
+++ b/docs/usage/DEanalysis/rnaseq.md
@@ -2,12 +2,10 @@
order: 3
---
-
# The nf-core/rnaseq pipeline
In order to carry out a RNA-Seq analysis we will use the nf-core pipeline [rnaseq](https://nf-co.re/rnaseq/3.14.0).
-
## Overview
The pipeline is organised following the diffent blocks shown below: pre-processing, traditional alignment (or lightweight alignment) and quantification, post-processing and final QC.
@@ -20,10 +18,8 @@ In each process, the users can choose among a range of different options. Import
- traditional alignment and quantification (stage 2);
-
- lightweight alignment and quantification (stage 3).
-
## Experimental Design
The number of reads and the number of biological replicates are two critical factors that researchers need to carefully consider during the design of a RNA-seq experiment. While it may seem intuitive that having a large number of reads is always desirable, an excessive number can lead to unnecessary costs and computational burdens, without providing significant improvements. Instead, it is often more beneficial to prioritise the number of biological replicates, as it allows to capture the natural biological variation of the data. Biological replicates involve collecting and sequencing RNA from distinct biological samples (e.g., different individuals, tissues, or time points), helping to detect genuine changes in gene expression.
@@ -32,15 +28,12 @@ The number of reads and the number of biological replicates are two critical fac
This concept must not be confused with technical replicates that asses the technical variability of the sequencing platform by sequencing the same RNA sample multiple time.
:::
-
To obtain optimal results, it is crucial to balance the number of biological replicates and the sequencing depth. While increasing the depth of sequencing enhances the ability to detect genes with low expression levels, there is a plateau beyond which no further benefits are gained. Statistical power calculations can inform experimental design by estimating the optimal number of reads and replicates required. For instance, this approach helps to establish a suitable log2 fold change threshold for the DE analysis. By incorporating multiple biological replicates into the design and optimizing sequencing depth, researchers can enhance the statistical power of the analysis, reducing the number of false positive results, and increasing the reliability of the findings.
-
## Library design
RNA-seq library design involves important decisions, particularly the choice between paired-end and single-end sequencing. Paired-end sequencing offers insights into structural variations and transcript isoforms, significantly improving mapping accuracy for longer transcripts and repetitive regions. In contrast, single-end sequencing—where only one end of the fragment is sequenced—can be a more cost-effective option while still delivering high-quality data for gene expression analysis. The choice between these two methods ultimately depends on the research question and experimental objectives. Paired-end sequencing is ideal for identifying novel transcripts or characterizing isoforms, whereas single-end sequencing is often sufficient for quantifying gene expression. Factors such as the type of RNA (e.g., mRNA or total RNA), read length, budget, and available computational resources also influence this decision.
-
## Reference genome
nf-core pipelines make use of the Illumina iGenomes collection as [reference genomes](https://nf-co.re/docs/usage/reference_genomes).
@@ -53,7 +46,6 @@ In this tutorial we will edit the config file, since the data we will be using h
The two files are `/workspace/gitpod/training/data/refs/Homo_sapiens_assembly38_chr21.fa` and `/workspace/gitpod/training/data/refs/Homo_sapiens_assembly38_chr21.fa.fai`, respectively.
-
## Reference annoation
The reference annotation plays a crucial role in the RNA-seq analysis. Without a high-quality reference annotation, RNA-seq analysis would result in inaccurate or incomplete results. The reference annotation provides a precise guide for aligning sequencing reads to specific genomic regions, allowing to identify genes, transcripts, and regulatory elements, as well as novel transcripts and alternative splicing events.
@@ -65,7 +57,6 @@ Similarly to the approach utilised for the genome, in this tutorial we will edit
The two files are `/workspace/gitpod/training/data/refs/gencode_v29_chr21.gff` and `/workspace/gitpod/training/data/refs/gencode_v29_transcripts_chr21.fa`, respectively.
-
## Input files
The input data should be provided in a CSV file, according to a format that is largely common for nf-core pipelines.
@@ -73,23 +64,18 @@ The format is described in the [rnaseq usage page](https://nf-co.re/rnaseq/3.14.
The input file is `/workspace/gitpod/training/data/reads/rnaseq_samplesheet.csv`.
-
## Running nf-core/rnaseq
In the following sections we will:
- prepare our references;
-
- set our computational resources in order to be able to run the pipeline on a gitpod VM;
-
- edit the optional settings;
-
- run the pipeline.
-
## Reference and annotation files
Following the considerations above, we will first of all edit the `nextflow.config` file in our working directory to add a new genome.
@@ -111,7 +97,6 @@ genomes {
To speed up the analysis we will include the `star_index` and the `salmon_index` in the config. These files have already been created locally.
-
## Computing resources
Based on the choices we made when starting up the gitpod environment, we recommend to use the following additional parameters.
diff --git a/docs/usage/DEanalysis/theory.md b/docs/usage/DEanalysis/theory.md
index 0d663fe61..dd932fbb6 100644
--- a/docs/usage/DEanalysis/theory.md
+++ b/docs/usage/DEanalysis/theory.md
@@ -2,12 +2,10 @@
order: 2
---
-
# RNAseq Analysis
Before we dive into the nf-core pipeline for analysing RNA-sequencing data, it's worth looking at some theoretical aspects of RNA-seq.
-
## Overview
Given the central role of RNA in a wide range of molecular functions, RNA-seq has emerged as a powerful tool for measuring the presence and levels of RNA species in biological samples. The technique is based on next-generation sequencing (NGS) technologies and is now considered the gold standard in the field of transcriptomics.
@@ -22,20 +20,16 @@ In the scheme, we can identify three key phases in the workflow:
- **data pre-processing**: raw reads are handled to remove adapters or contaminants enhancing their quality;
-
- **traditional or lightweight alignment and quantification**: reads are mapped to a reference genome, and gene abundance is estimated. The workflow can also follow an alternative route based on lightweight alignment and quantification, reducing the time required for the analysis.
-
- **differential expression analysis**: differentially expressed genes are identified using statistical tests, annotated, and visualised.
Depending on the user's needs, the workflow can include additional downstream analyses such as functional enrichment analysis, co-expression analysis, and integration with other omics data.
-
## Pre-processing
The pre-processing of sequencing reads from RNA-seq data is a critical step to ensure the quality and accuracy of downstream analysis. The raw reads obtained from the sequencer are stored as [FASTQ](https://en.wikipedia.org/wiki/FASTQ_format) files, which contain both the sequence data and quality scores. The initial processing step involves evaluating the quality of the reads to identify potential issues such as adapter contamination, sequencing errors, or low-quality bases. The presence of adapters (short DNA sequences ligated to the ends of DNA fragments during library preparation) is firstly detected through comparison with known adapter sequences or by using algorithms that identify adapter-like sequences. These sequences are then removed in a process known as **read trimming**. Next, reads containing contaminants (genomic DNA and/or rRNA), and those with low-quality bases are filtered out. Finally, the quality of the filtered reads is checked to ensure their suitability for downstream processing.
-
## Alignment (or lightweight alignment) and quantification
In the RNA-seq workflow, the alignment step involves mapping sequencing reads to a reference genome or transcriptome to determine the position and orientation of each read relative to the reference sequence.
@@ -46,7 +40,6 @@ The alignment and quantification steps can follow two different approaches depen
- traditional alignment and quantification;
-
- lightweight alignment and quantification.
In RNA-seq analysis, traditional alignment is often performed with **splice-aware aligners**, which are designed to recognise the splicing process. These aligners can align reads across known splice junctions and detect novel splice sites or alternative splicing events. Popular splice-aware aligners include [STAR](https://github.com/alexdobin/STAR), and [HISAT2](https://github.com/DaehwanKimLab/hisat2). Once alignment is complete, the following step is quantification, which involves estimating the abundance (number of reads) associated with each gene. Tools like [featureCounts](https://subread.sourceforge.net/featureCounts.html), [HTSeq](https://htseq.readthedocs.io), [Salmon](http://combine-lab.github.io/salmon), and [RSEM](http://deweylab.github.io/RSEM) are commonly used for this purpose.
@@ -57,14 +50,12 @@ Traditional alignment tools are time-consuming and require substantial computati
Salmon can be run in two different modes: alignment-based mode or quasi-mapping mode. In the first mode, Salmon estimates abudance starting from `bam` file obtained with the chosen aligner. In the second mode, Salmon requires a reference transcriptome and raw reads to perform both mapping and quantification.
:::
-
## Differential expression (DE) analysis with DESeq2
Differential expression analysis is a statistical method for comparing gene expression levels between different experimental conditions, such as disease vs healthy (e.g., tumour tissue vs healthy tissue), treatment vs control (e.g., a sample treated with a specific stimulus, drug, or compound vs an untreated sample), and tissue vs tissue (e.g., brain vs heart). Differential expression analysis aims to assess, for each gene, whether the observed differences in expression between groups are statistically significant, accounting for the variation observed within groups (replicates). This part of the analysis is typically performed in R using various packages, such as [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html), [edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html), and [limma](https://bioconductor.org/packages/release/bioc/html/limma.html). This tutorial focuses on **DESeq2**, a popular R package known for its robustness. For more detailed information, see the [DESeq2 vignette](https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html).
The analysis begins with the input data, which generally consist of a count matrix obtained during the alignment and quantification step, summarising the expression levels of the different genes in each sample of the dataset. The rows of the matrix typically correspond to genes, while the columns represent the samples. Another essential prerequisite is a metadata table describing the samples.
-
### Quality Control
To ensure robust differential expression results, it is common to start by exploring the sources of variation in the data. DESeq2 provides quality control tools, including Principal Component Analysis (PCA) and hierarchical clustering. PCA is used to reduce the dimensionality of the data, allowing the visualisation of the samples in a lower-dimensional space. In contrast, hierarchical clustering shows the correlation of gene expression for all pairwise combinations of samples in the dataset. These methods can identify groups of samples with similar behaviour, as well as potential outliers or unexpected patterns.
@@ -77,7 +68,6 @@ Finally, it is often beneficial to filter out genes that are unlikely to show di
The `vst` or `rlog` transformations are applied to the normalised counts stored in the `dds` object, which is generated by running either the `DESeq()` or `estimateSizeFactors()` function. Since the estimation of size factors is an early step in the `DESeq()` function, the transformations are generally applied immediately afterwards.
:::
-
### Design Formula
The design formula specifies the sources of variation that the statistical model needs to account for. It defines how the counts will be related to the experimental conditions, allowing the software to model the relationship between gene expression and the factors of interest, such as treatment, time points or batch effects. It is important to specify the main factor of interest as well as additional covariates in the design formula.
@@ -100,7 +90,6 @@ In R, the tilde (`~`) is used in formula syntax to specify relationships between
The results will not be affected by the order of variables but the common practice is to specify the main source of variation in the last position of the design formula.
-
### Differential Expression Analysis
RNA-seq data typically contain a large number of genes with low expression counts, indicating that many genes are expressed at very low levels across samples. At the same time, RNA-seq data exhibit a skewed distribution with a long right tail due to the absence of an upper limit for gene expression levels. This means that while most genes have low to moderate expression levels, a small number are expressed at high levels. Accurate statistical modelling must therefore account for this distribution to avoid misleading conclusions.
@@ -121,26 +110,23 @@ While `DESeq()` combines these steps, a user could choose to perform each functi
The different steps are explained in detail below:
-1) **Normalisation**: since DESeq2 compares counts between sample groups for the same gene, it does not need to adjust for gene length. However, it is essential to account for variations in sequencing depth and RNA composition among samples. To normalise the data, DESeq2 utilises **size factors**.
+1. **Normalisation**: since DESeq2 compares counts between sample groups for the same gene, it does not need to adjust for gene length. However, it is essential to account for variations in sequencing depth and RNA composition among samples. To normalise the data, DESeq2 utilises **size factors**.
The size factors are calculated using the **median ratio** method:
- **Calculate the geometric mean**: for each gene, compute the geometric mean of its counts across all samples. This gives a row-wise geometric mean for each gene;
-
- **Calculate ratios**: divide each gene's count by its geometric mean to obtain a ratio for each sample;
-
- **Determine size factors**: for each sample, take the median of these ratios (column-wise) to derive the size factors;
-
- **Normalise counts**: divide each raw count by the corresponding sample size factor to generate normalised counts.
:::warning
While normalised counts are useful for downstream visualisation of results, they should not be used as input for DESeq2. Instead, DESeq2 requires raw count data in the form of a matrix of integer values.
:::
-2) **Estimate dispersion and gene-wise dispersion**: the dispersion is a measure of how much the variance deviates from the mean. The dispersion estimates indicate the variance in gene expression at a specific mean expression level. Importantly, RNA-seq data are characterised by **overdispersion**, where the variance in gene expression levels often exceeds the mean (variance > mean).
+2. **Estimate dispersion and gene-wise dispersion**: the dispersion is a measure of how much the variance deviates from the mean. The dispersion estimates indicate the variance in gene expression at a specific mean expression level. Importantly, RNA-seq data are characterised by **overdispersion**, where the variance in gene expression levels often exceeds the mean (variance > mean).
![overdispersion](./img/overdispersion.png){ width="400"}
@@ -155,9 +141,9 @@ $$
A key feature of DESeq2's dispersion estimates is their negative correlation with the mean, and positive correlation with the variance. Genes with low expression tend to have higher dispersion values, while genes with high expression tend to have lower dispersion.
-3) **Mean-dispersion relationship**: this process, known as dispersion fitting, models the relationship between the mean expression level of a gene, and its dispersion. DESeq2 assumes that genes with similar expression profiles share similar dispersion patterns, and leverages this information to refine the estimates identifying a trend in the dispersion estimates across genes. The fitted curve, typically a smooth curve, describes how dispersion changes as a function of the mean expression level.
+3. **Mean-dispersion relationship**: this process, known as dispersion fitting, models the relationship between the mean expression level of a gene, and its dispersion. DESeq2 assumes that genes with similar expression profiles share similar dispersion patterns, and leverages this information to refine the estimates identifying a trend in the dispersion estimates across genes. The fitted curve, typically a smooth curve, describes how dispersion changes as a function of the mean expression level.
-4) **Final dispersion estimates**: DESeq2 refines the gene-wise dispersion by shrinking it towards the fitted curve. The "shrinkage" helps control for overfitting, and makes the dispersion estimates more reliable. The strength of the shrinkage depends on the sample size (more samples = less shrinkage), and how close the initial estimates are to the fitted curve.
+4. **Final dispersion estimates**: DESeq2 refines the gene-wise dispersion by shrinking it towards the fitted curve. The "shrinkage" helps control for overfitting, and makes the dispersion estimates more reliable. The strength of the shrinkage depends on the sample size (more samples = less shrinkage), and how close the initial estimates are to the fitted curve.
![dispersion](./img/dispersion_estimates.png){ width="400"}
@@ -165,9 +151,9 @@ A key feature of DESeq2's dispersion estimates is their negative correlation wit
The initial estimates (black dots) are shrunk toward the fitted curve (red line) to obtain the final estimates (blue dots). However, genes with exceptionally high dispersion values are not shrunk, as they likely deviate from the model assumptions exhibiting elevated variability due to biological or technical factors. Shrinking these values could lead to false positives.
-5) **Fitting model and testing**: the initial step in hypothesis testing involves defining a **null hypothesis** for each gene. In DESeq2, The null hypothesis states that there is no difference in expression between the sample groups (log2 fold change == 0). Next, DESeq applies a statistical test to assess whether the null hypothesis is true.
-DESeq2 fits a **generalised linear model (GLM)** to the normalised counts using the calculated size factors and final dispersion estimates. A GLM is a flexible extension of linear regression that models the relationship between a response variable (normalised counts), and predictors (e.g., condition).
-By using a **negative binomial distribution**, DESeq2's GLM can handle the additional variability in gene expression counts that often occurs in RNA-seq data. Once the model is fit, coefficients are estimated for each sample group along with their standard errors. These coefficients represent the estimated log2 fold changes between groups, and serve as the basis for hypothesis testing, using either a **Wald test** or a **Likelihood Ratio Test (LRT)**, depending on the experimental design:
+5. **Fitting model and testing**: the initial step in hypothesis testing involves defining a **null hypothesis** for each gene. In DESeq2, The null hypothesis states that there is no difference in expression between the sample groups (log2 fold change == 0). Next, DESeq applies a statistical test to assess whether the null hypothesis is true.
+ DESeq2 fits a **generalised linear model (GLM)** to the normalised counts using the calculated size factors and final dispersion estimates. A GLM is a flexible extension of linear regression that models the relationship between a response variable (normalised counts), and predictors (e.g., condition).
+ By using a **negative binomial distribution**, DESeq2's GLM can handle the additional variability in gene expression counts that often occurs in RNA-seq data. Once the model is fit, coefficients are estimated for each sample group along with their standard errors. These coefficients represent the estimated log2 fold changes between groups, and serve as the basis for hypothesis testing, using either a **Wald test** or a **Likelihood Ratio Test (LRT)**, depending on the experimental design:
- **Wald Test**: The Wald test is ideal for simpler experimental designs, such as comparing two conditions (e.g., treated vs untreated). In DESeq2, the Wald test is implemented by calculating the log2 fold change and dividing it by its standard error to obtain a z-statistic. This z-statistic is then evaluated against a standard normal distribution, leading to the computation of a p-value that indicates the likelihood of observing a z-statistic as extreme as the calculated value under the null hypothesis. When the p-value is small, we reject the null hypothesis, concluding that the gene is differentially expressed.