Merge pull request #203 from nf-core/transcript_lengths

Transcript lengths for DESeq2
nf-core · Nov 22, 2023 · 0b9aeeb · 0b9aeeb
2 parents 4587680 + 9bf5396
commit 0b9aeeb
Show file tree

Hide file tree

Showing 11 changed files with 113 additions and 40 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### `Added`
 
+- [[#203](https://github.com/nf-core/differentialabundance/pull/203)] - Transcript lengths for DESeq2 ([@pinin4fjords](https://github.com/pinin4fjords), review by [@maxulysse](https://github.com/maxulysse))
 - [[#193](https://github.com/nf-core/differentialabundance/pull/193)] - Add DESeq2 text to report ([@WackerO](https://github.com/WackerO), review by [@pinin4fjords](https://github.com/pinin4fjords))
 - [[#192](https://github.com/nf-core/differentialabundance/pull/192)] - Add scree plot in report ([@WackerO](https://github.com/WackerO), review by [@pinin4fjords](https://github.com/pinin4fjords))
 - [[#189](https://github.com/nf-core/differentialabundance/pull/189)] - Add DE models to report ([@WackerO](https://github.com/WackerO), review by [@pinin4fjords](https://github.com/pinin4fjords))

diff --git a/README.md b/README.md
@@ -52,7 +52,12 @@ RNA-seq:
 ```
 
 :::note
-If you are using the outputs of the nf-core rnaseq workflow as input here, please use the **gene_counts_length_scaled.tsv** or **gene_counts_scaled.tsv** matrices. See the [usage documentation](https://nf-co.re/differentialabundance/usage) for more information.
+If you are using the outputs of the nf-core rnaseq workflow as input here **either**:
+
+- supply the raw count matrices (file names like **gene_counts.tsv**) alongide the transcript length matrix via `--transcript_length_matrix` (rnaseq versions >=3.12.0, preferred)
+- **or** supply the **gene_counts_length_scaled.tsv** or **gene_counts_scaled.tsv** matrices.
+
+See the [usage documentation](https://nf-co.re/differentialabundance/usage) for more information.
 :::
 
 Affymetrix microarray:

diff --git a/conf/test.config b/conf/test.config
@@ -25,8 +25,9 @@ params {
     // Input data
 
     input = 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/mus_musculus/rnaseq_expression/SRP254919.samplesheet.csv'
-    matrix =  'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/mus_musculus/rnaseq_expression/SRP254919.salmon.merged.gene_counts.top1000cov.tsv'
-    contrasts =  'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/mus_musculus/rnaseq_expression/SRP254919.contrasts.csv'
+    matrix = 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/mus_musculus/rnaseq_expression/SRP254919.salmon.merged.gene_counts.top1000cov.tsv'
+    transcript_length_matrix = 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/mus_musculus/rnaseq_expression/SRP254919.spoofed_lengths.tsv'
+    contrasts = 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/mus_musculus/rnaseq_expression/SRP254919.contrasts.csv'
 
     // To do: replace this with a cut-down mouse GTF matching the matrix for testing
     gtf = 'https://ftp.ensembl.org/pub/release-81/gtf/mus_musculus/Mus_musculus.GRCm38.81.gtf.gz'

diff --git a/docs/usage.md b/docs/usage.md
@@ -69,17 +69,26 @@ This is a numeric square matrix file, comma or tab-separated, with a column for
 
 #### Outputs from nf-core/rnaseq and other tximport-processed results
 
-The nf-core rnaseq workflow uses [tximport](https://bioconductor.org/packages/release/bioc/html/tximport.html) to generate its quantification matrices. It does not currently output sufficient information to allow modelling of transcript length biases in differential analysis by this workflow, so we must use matrices per the second recommended approach in the [documentation](https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html#Downstream_DGE_in_Bioconductor):
+The nf-core RNAseq workflow incorporates [tximport](https://bioconductor.org/packages/release/bioc/html/tximport.html) for producing quantification matrices. From [version 3.12.2](https://github.com/nf-core/rnaseq/releases/tag/3.13.2), it additionally provides transcript length matrices which can be directly consumed by DESeq2 to model length bias across samples.
 
-> "The second method is to use the tximport argument countsFromAbundance="lengthScaledTPM" or "scaledTPM", and then to use the gene-level count matrix txi$counts directly as you would a regular count matrix with these software. Let’s call this method “bias corrected counts without an offset”"
+To use this approach, include the transcript lengths file with the **raw counts**:
 
-This corresponds to the **gene_counts_length_scaled.tsv** or **gene_counts_scaled.tsv** matrices, respectively, from the rnaseq workflow.
+```bash
+--matrix 'salmon.merged.gene_counts.tsv' \
+--transcript_length_matrix 'salmon.merged.gene_lengths.tsv'
+```
+
+Without the transcript lengths, for instance in earlier rnaseq workflow versions, follow the second recommendation in the [tximport documentation](https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html#Downstream_DGE_in_Bioconductor):
+
+> "Use the tximport argument countsFromAbundance='lengthScaledTPM' or 'scaledTPM', then employ the gene-level count matrix txi$counts directly in downstream software, a method we call 'bias corrected counts without an offset'"
+
+This aligns with the **gene_counts_length_scaled.tsv** or **gene_counts_scaled.tsv** matrices in the rnaseq workflow.
 
-Note that those documents also say:
+Important to note, the documentation advises:
 
-> "Note: Do not manually pass the original gene-level counts to downstream methods without an offset."
+> "Do not manually pass the original gene-level counts to downstream methods without an offset."
 
-This corresponds to the 'gene_counts.tsv' matrix, so we do not recomend this matrix is used as input for this workflow.
+So we **do not recommend** raw counts files such as `salmon.merged.gene_counts.tsv` as input for this workflow **except** where the transcript lengths are also provided.
 
 ### MaxQuant intensities
 

diff --git a/modules.json b/modules.json
@@ -37,7 +37,7 @@
                     },
                     "deseq2/differential": {
                         "branch": "master",
-                        "git_sha": "25047aa940979f64b31d19b94c483a9254263892",
+                        "git_sha": "c992bd93b22a97feda5f04755511366c45423626",
                         "installed_by": ["modules"]
                     },
                     "geoquery/getgeo": {

diff --git a/modules/nf-core/deseq2/differential/main.nf b/modules/nf-core/deseq2/differential/main.nf
diff --git a/modules/nf-core/deseq2/differential/meta.yml b/modules/nf-core/deseq2/differential/meta.yml
diff --git a/modules/nf-core/deseq2/differential/templates/deseq_de.R b/modules/nf-core/deseq2/differential/templates/deseq_de.R
diff --git a/nextflow.config b/nextflow.config
@@ -17,6 +17,7 @@ params {
     contrasts                  = null
     querygse                   = null
     matrix                     = null
+    transcript_length_matrix   = null
     control_features           = null
     sizefactors_from_controls  = false
 

diff --git a/nextflow_schema.json b/nextflow_schema.json
@@ -76,6 +76,12 @@
                     "pattern": "^\\S+\\.(tsv|csv|txt)$",
                     "fa_icon": "fas fa-border-all"
                 },
+                "transcript_length_matrix": {
+                    "type": "string",
+                    "fa_icon": "fas fa-border-all",
+                    "description": "(RNA-seq only): optional transcript length matrix with sample samples and genes as the abundance matrix",
+                    "help_text": "if provided, this file willl be used to provide transcript lengths to DESeq2 to model lengh bias across samples"
+                },
                 "affy_cel_files_archive": {
                     "type": "string",
                     "default": "None",
@@ -294,7 +300,7 @@
                 },
                 "proteus_round_digits": {
                     "type": "number",
-                    "default": -1,
+                    "default": -1.0,
                     "description": "Number of decimals to round the MaxQuant intensities to; default: -1 (will not round)."
                 }
             }
@@ -307,13 +313,13 @@
             "properties": {
                 "filtering_min_abundance": {
                     "type": "number",
-                    "default": 1,
+                    "default": 1.0,
                     "description": "Minimum abundance value",
                     "fa_icon": "fas fa-compress-alt"
                 },
                 "filtering_min_samples": {
                     "type": "number",
-                    "default": 1,
+                    "default": 1.0,
                     "description": "Minimum observations that must pass the threshold to retain the row/ feature (e.g. gene).",
                     "fa_icon": "fas fa-compress-alt"
                 },
@@ -445,13 +451,13 @@
                 },
                 "differential_min_fold_change": {
                     "type": "number",
-                    "default": 2,
+                    "default": 2.0,
                     "description": "Minimum fold change used to calculate differential feature numbers",
                     "fa_icon": "fas fa-angle-double-down"
                 },
                 "differential_max_pval": {
                     "type": "number",
-                    "default": 1,
+                    "default": 1.0,
                     "description": "Maximum p value used to calculate differential feature numbers",
                     "fa_icon": "fas fa-angle-double-up"
                 },
@@ -709,7 +715,7 @@
                 },
                 "limma_p_value": {
                     "type": "number",
-                    "default": 1,
+                    "default": 1.0,
                     "description": "cutoff value for adjusted p-values. Only genes with lower p-values are listed.",
                     "fa_icon": "fas fa-angle-double-up"
                 }
@@ -955,7 +961,7 @@
                 },
                 "report_scree": {
                     "type": "boolean",
-                    "default": "true",
+                    "default": true,
                     "description": "Whether to generate a scree plot in the report"
                 }
             },

diff --git a/workflows/differentialabundance.nf b/workflows/differentialabundance.nf
@@ -64,6 +64,7 @@ if (params.study_type == 'affy_array'){
 }
 
 // Check optional parameters
+if (params.transcript_length_matrix) { ch_transcript_lengths = Channel.of([ exp_meta, file(params.transcript_length_matrix, checkIfExists: true)]).first() } else { ch_transcript_lengths = [[],[]] }
 if (params.control_features) { ch_control_features = Channel.of([ exp_meta, file(params.control_features, checkIfExists: true)]).first() } else { ch_control_features = [[],[]] }
 if (params.gsea_run) {
     if (params.gsea_gene_sets){
@@ -356,7 +357,8 @@ workflow DIFFERENTIALABUNDANCE {
         DESEQ2_NORM (
             ch_contrasts.first(),
             ch_samples_and_matrix,
-            ch_control_features
+            ch_control_features,
+            ch_transcript_lengths
         )
 
         // Run the DESeq differential module, which doesn't take the feature
@@ -365,7 +367,8 @@ workflow DIFFERENTIALABUNDANCE {
         DESEQ2_DIFFERENTIAL (
             ch_contrasts,
             ch_samples_and_matrix,
-            ch_control_features
+            ch_control_features,
+            ch_transcript_lengths
         )
 
         // Let's make the simplifying assumption that the processed matrices from