Skip to content

Commit

Permalink
prettier check
Browse files Browse the repository at this point in the history
  • Loading branch information
LorenzoS96 committed Nov 7, 2024
1 parent 08efda6 commit 6be4d82
Show file tree
Hide file tree
Showing 5 changed files with 12 additions and 70 deletions.
25 changes: 4 additions & 21 deletions docs/usage/DEanalysis/de_rstudio.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,10 @@
order: 4
---


# Differential Analysis with DESeq2

In this section of the tutorial, we will guide you through the practical steps necessary to set up the RStudio environment, load the required libraries and data, and execute the DESeq2 analysis. By the end of this section, you will have a fully functional DESeq2 analysis pipeline set up in RStudio, ready to uncover the differentially expressed genes in your dataset.


## Launching the RStudio environment

Once the nf-core/rnaseq pipeline is terminated, the resulting data are stored in the folder `results_star_salmon`. Now, we can analyse the results by running DESeq2 on RStudio. First of all, we need to launch it:
Expand All @@ -27,20 +25,19 @@ Password: pass
Using `sleep` will keep the Gitpod session active, preventing disconnection and providing enough time to complete our analysis without interruptions
:::


## Differential Expression Analysis

As in all analysis, firstly we need to create a new project:

1) Go to the **File** menu and select **New Project**;
1. Go to the **File** menu and select **New Project**;

2) Select **New Directory**, **New Project**, name the project as shown below and click on **Create Project**;
2. Select **New Directory**, **New Project**, name the project as shown below and click on **Create Project**;

<figure markdown="span">
![r_project](./img/project_R.png){ width="400" }
</figure>

3) The new project will be automatically opened in RStudio.
3. The new project will be automatically opened in RStudio.

We can check whether we are in the correct working directory with `getwd()`. The path `/workspace/gitpod/training/DE_analysis/` should appear on your console.
To store our results in an organized way, we will create a folder named **de_results** using the **New Folder** button in the bottom right panel. We will save all our resulting tables and plots in this folder. Next, go to the **File menu**, select **New File** and then **R Script** to create a script editor in which we will save all commands required for the analysis. In the editor type:
Expand Down Expand Up @@ -97,10 +94,8 @@ In DESEq2, the `dds` object is a central data structure that contains the follow

- `countData`: a matrix of raw count data, where each row represents a gene and each column represents a sample;


- `colData`: a data frame containing information about the samples, such as the experimental design, treatment and other relevant metadata;


- `design`: a formula specifying the experimental design utilised to estimate the dispersion and the log2 fold change.

All these components can be checked with specific commands:
Expand Down Expand Up @@ -201,10 +196,8 @@ The `DESeq()` function is a high-level wrapper that simplifies the process of di

- `estimateSizeFactors`: to normalise the count data;


- `estimateDispersion`: to estimate the dispersion;


- `nbinomWaldTest`: to perform differential expression test.

The individual functions can be carried out also singularly as shown below:
Expand Down Expand Up @@ -236,7 +229,6 @@ The `rlog` and the `vst` transformations have an argument, **blind** that can be

- **TRUE** (default): useful for QC analysis because it re-estimates the dispersion, allowing for comparison of samples in an unbiased manner with respect to experimental conditions;


- **FALSE**: the function utilizes the already estimated dispersion, generally applied when differences in counts are expected to be due to the experimental design.

Next, we perform Principal Component Analysis (PCA) to explore the data. DESeq2 provides a built-in function, `plotPCA()`, which uses [ggplot2](https://ggplot2.tidyverse.org) for visualisation, taking the `rld` (or the `vst`) object as input.
Expand Down Expand Up @@ -325,19 +317,14 @@ The `results()` function in DESeq2 is used to extract the results of the DE anal

- **baseMean**: the average expression level of the gene across all samples;


- **log2FoldChange**: the log2 fold change of the gene between the condition of interest and the reference level;


- **lfcSE**: the standard error of the log2 fold change;


- **stat**: the Wald statistic, which is used to calculate the p-value;


- **pvalue**: the p-value from the Wald test indicates the probability of observing the measured difference in gene expression (log2 fold change) by chance, assuming no true difference exists (null hypothesis). A low p-value suggests that the observed expression change between samples is unlikely due to random chance, so we can reject the null hypothesis --> the gene is differentially expressed;


- **padj**: the adjusted p-value, which takes into account multiple testing corrections, (Benjamini-Hochberg method default) to control the false discovery rate;

The `results()` function returns the results for all genes in the analysis with an adjusted p-value below a specific FDR cutoff, set by default to 0.1. This threshold can be modified with the parameter `alpha`. The `results()` function can also be customised to filter the results based on certain criteria (log2 fold change or padj) or to set a specific contrast (specific comparison between two or more levels).
Expand Down Expand Up @@ -386,7 +373,7 @@ res_viz <- as_tibble(res_viz) %>%
write.csv(res_viz, file = "de_results/de_result_table.csv")
```

In the *Experimental Design* section, we emphasised the importance of estimating the log2 fold change threshold using a statistical power calculation, rather than selecting it arbitrarily. This approach ensures that the chosen threshold is statistically appropriate and tailored to the specifics of the experiment. However, since we are working with simulated data for demonstration purposes, we will use a padj threshold of 0.05 and consider genes with a log2 fold change of at least 1 or -1 as differentially expressed.
In the _Experimental Design_ section, we emphasised the importance of estimating the log2 fold change threshold using a statistical power calculation, rather than selecting it arbitrarily. This approach ensures that the chosen threshold is statistically appropriate and tailored to the specifics of the experiment. However, since we are working with simulated data for demonstration purposes, we will use a padj threshold of 0.05 and consider genes with a log2 fold change of at least 1 or -1 as differentially expressed.

```r
#### Extract significant DE genes from the results ####
Expand All @@ -413,7 +400,6 @@ resSig
write.csv(resSig, file = "de_results/sig_de_genes.csv")
```


## Plot the results

Now that we have obtained the results of the differential expression analysis, it's time to visualise the data to gain a deeper understanding of the biological processes that are affected by the experimental conditions. Visualisation is a crucial step in RNA-seq analysis, as it allows us to identify patterns and trends in the data that may not be immediately apparent from the numerical results.
Expand Down Expand Up @@ -533,7 +519,6 @@ volcano_plot <- ggplot(data = res_tb, aes(x = log2FoldChange, y = -log10(padj),
ggsave("de_results/volcano_plot.png", plot = volcano_plot, width = 6, height = 5, dpi = 300)
```


## Functional analysis

The output of the differential expression analysis is a list of significant DE genes. To uncover the underlying biological mechanisms, various downstream analyses can be performed, such as functional enrichment analysis (identify overrepresented biological processes, molecular functions, cellular components or pathways), and network analysis (group genes based on similar expression patterns to identify novel interactions). To facilitate the interpretation of the resulting list of DE genes, a range of freely available web- and R-based tools can be employed.
Expand All @@ -542,10 +527,8 @@ In this tutorial, we will explore an enrichment analysis technique known as Over

- **Universe**: the background list of genes (for example the genes annotated in a genome);


- **GeneSet**: a collection of genes annotated by a reference database (such as Gene Ontology), and known to be involved in a particular biological pathway or process;


- **Gene List**: the differentially expressed genes.

The hypergeometric test calculates the probability of observing a certain number of genes from the gene set (pathway or process) within the gene list (DE genes) by chance.
Expand Down
6 changes: 0 additions & 6 deletions docs/usage/DEanalysis/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
order: 1
---


# Welcome

These pages are a tutorial workshop for the [Nextflow](https://www.nextflow.io) pipeline [nf-core/rnaseq](https://nf-co.re/rnaseq).
Expand All @@ -20,7 +19,6 @@ By the end of this workshop, you will be able to:

Let's get started!


## Running with Gitpod

In order to run this using GitPod, please make sure:
Expand All @@ -30,17 +28,13 @@ In order to run this using GitPod, please make sure:

Now you're all set and can use the following button to launch the service:


[![Open in GitPod](https://img.shields.io/badge/Gitpod-%20Open%20in%20Gitpod-908a85?logo=gitpod)](https://gitpod.io/#https://github.com/lescai-teaching/rnaseq-tutorial)



## Additional documentation

- You can find detailed documentation on **Nextflow** [here](https://www.nextflow.io/docs/latest/)
- You can find additional training on [these pages](https://training.nextflow.io)


## Credits & Copyright

This training material has been written and completed by [Lorenzo Sola](https://github.com/LorenzoS96), [Francesco Lescai](https://github.com/lescai), and [Mariangela Santorsola](https://github.com/msantorsola) during the [nf-core](https://nf-co.re) Hackathon in Barcellona, 2024. Thank you to [Victoria Cepeda](https://github.com/vcepeda) for her contributions to the tutorial's revision. The tutorial is aimed at anyone who is interested in using nf-core pipelines for their studies or research activities.
Expand Down
8 changes: 1 addition & 7 deletions docs/usage/DEanalysis/interpretation.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
order: 5
---


# Interpretation

Once DE genes have been identified, the next crucial step is to interpret the results. This involves the inspection of tables and plots generated during the analysis to understand the biological significance of the data. In this part of the tutorial, we will explore the results by discussing the significant DE genes and we will examine various plots generated during the analysis.
Expand All @@ -11,7 +10,6 @@ Once DE genes have been identified, the next crucial step is to interpret the re
The results illustrated in this section might show slight variations compared to your runs due to randomness in the STAR algorithm. This randomness arises from using variable seed values and parallel processing, leading to minor differences in results between runs on the same data. These small discrepancies are not biologically significant and may affect counts and subsequent plots (such as PCA and count plots). However, the overall patterns and main findings should remain consistent. While exact reproducibility is ideal, minor variations are acceptable in practice, as long as they do not impact the main conclusions of the study.
:::


## Quality control plots

The first plot we will examine is the Principal Component Analysis (PCA) plot. Since we're working with simulated data, our metadata is relatively simple, consisting of just three variables: `sample`, `condition`, and `replica`. In a typical RNA-seq experiment, however, metadata can be complex and encompass a wide range of variables that could contribute to sample variation, such as sex, age, and developmental stage.
Expand All @@ -32,7 +30,6 @@ Remember that to create this plot, we utilized the `dist()` function, so in the

Overall, the integration of these plots suggests that we are working with high-quality data and we can confidently proceed to the differential expression analysis.


## Differential expression results

In this part of the tutorial, we will examine plots that are generated after the differential expression analysis. These plots are not quality control plots, but rather plots that help us to interpret the results.
Expand Down Expand Up @@ -65,12 +62,11 @@ The treatment induced differential expression in five genes: one downregulated a

After the identification of DE genes, it's informative to visualise the expression of specific genes of interest. The `plotCounts()` function applied directly on the `dds` object allows us to examine individual gene expression profiles without accessing the full `res` object.


<figure markdown="span">
![counts](./img/plotCounts.png){ width="400" }
</figure>

In our example, post-treatment, we observe a significant increase in the expression of the *ENSG00000142192* gene, highlighting its responsiveness to the experimental conditions.
In our example, post-treatment, we observe a significant increase in the expression of the _ENSG00000142192_ gene, highlighting its responsiveness to the experimental conditions.

Finally, we can create a heatmap using the normalised expression counts of DE genes. The resulting heatmap visualises how the expression of significant genes varies across samples. Each row represents a gene, and each column represents a sample. The color intensity in the heatmap reflects the normalised expression levels: red colors indicate higher expression, while blue colors indicate lower expression.

Expand All @@ -80,7 +76,6 @@ Finally, we can create a heatmap using the normalised expression counts of DE ge

By examining the heatmap, we can visually identify the expression patterns of our five significant differentially expressed genes. This visualisation allows us to identify how these genes respond to the treatment. The heatmap provides a clear and intuitive way to explore gene expression dynamics.


## Over Representation Analysis (ORA)

Finally, we can attempt to assign biological significance to our differentially expressed genes through **Over Representation Analysis (ORA)**. The ORA analysis identifies specific biological pathways, molecular functions and cellular processes, according to the **Gene Ontology (GO)** database, that are enriched within our differentially expressed genes.
Expand All @@ -91,7 +86,6 @@ Finally, we can attempt to assign biological significance to our differentially

The enrichment analysis reveals a possible involvement of cellular structures and processes, including "clathrin-coated pit", "dendritic spine", "neuron spine" and "endoplasmic reticulum lumen". These terms suggest a focus on cellular transport, structural integrity and protein processing, especially in neural contexts. This pattern points to pathways related to cellular organization and maintenance, possibly playing an important role in the biological condition under study.


## Conclusions

In this tutorial, we have walked through the steps of the RNA-seq analysis, from launching the nfcore/rnaseq pipeline to interpreting differential expression results. You learned how the data are generated, identified differentially expressed genes, and conducted enrichment analysis. By following this tutorial, you should now be able to use the nfcore/rnaseq pipeline and perform differential expression analysis with DESeq2, interpreting the results within the context of your research.
Loading

0 comments on commit 6be4d82

Please sign in to comment.