diff --git a/README.md b/README.md index 728f3528..57e79d02 100644 --- a/README.md +++ b/README.md @@ -16,31 +16,29 @@ ## Introduction -**nf-core/oncoanalyser** is a Nextflow implementation of the comprehensive cancer DNA and RNA analysis and reporting -workflow from the Hartwig Medical Foundation. For detailed information on each component of the Hartwig Medical -Foundation workflow, please refer to [hartwigmedical/hmftools](https://github.com/hartwigmedical/hmftools/). - -The oncoanalyser pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across -multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation -trivial and results highly reproducible. The [Nextflow DSL2](https://www.nextflow.io/docs/latest/dsl2.html) -implementation of this pipeline uses one container per process which makes it much easier to maintain and update -software dependencies. Where possible, these processes have been submitted to and installed from -[nf-core/modules](https://github.com/nf-core/modules) in order to make them available to all nf-core pipelines, and to -everyone within the Nextflow community! - -On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud -infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on -real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other -analysis sources. The results obtained from the full-sized test can be viewed on the [nf-core -website](https://nf-co.re/oncoanalyser/results). +**nf-core/oncoanalyser** is a Nextflow implementation of the comprehensive cancer DNA/RNA analysis and reporting +workflow from the Hartwig Medical Foundation. Both the Hartwig WGS/WTS workflow and targeted sequencing workflow are +available in oncoanalyser. The targeted sequencing workflow has built-in support for the TSO500 panel and can also run +custom panels with externally-generated normalisation data. + +The key analysis results for each sample are summarised and presented in an ORANGE report (summary page excerpt shown +below from *[COLO829_wgts.orange_report.pdf](https://pub-29f2e5b2b7384811bdbbcba44f8b5083.r2.dev/oncoanalyser/other/example_report/COLO829_wgts.orange_report.pdf)*): + +

+ +For detailed information on each component of the Hartwig workflow, please refer to +[hartwigmedical/hmftools](https://github.com/hartwigmedical/hmftools/). ## Pipeline summary The following processes and tools can be run with oncoanalyser: -* SNV and MNV calling (`SAGE`, `PAVE`) -* SV calling (`SV Prep`, `GRIDSS`, `GRIPSS`, `PURPLE`, `LINX`) +* Simple DNA/RNA alignment (`bwa-mem2`, `STAR`) +* Post-alignment processing (`MarkDups`) +* SNV, MNV, INDEL calling (`SAGE`, `PAVE`) * CNV calling (`AMBER`, `COBALT`, `PURPLE`) +* SV calling (`SvPrep`, `GRIDSS`, `GRIPSS`) +* SV event interpretation (`LINX`) * Transcript analysis (`Isofox`) * Oncoviral detection (`VIRUSBreakend`, `Virus Interpreter`) * HLA calling (`LILAC`) @@ -51,25 +49,25 @@ The following processes and tools can be run with oncoanalyser: ## Quick Start -Create a samplesheet containing your inputs: +Create a samplesheet with your inputs (WGS/WTS FASTQs in this example): -```text -group_id,subject_id,sample_id,sample_type,sequence_type,filetype,filepath -P1__wgts,P1,SA,tumor,dna,bam,/path/to/SA.tumor.dna.wgs.bam -P1__wgts,P1,SB,tumor,rna,bam,/path/to/SB.tumor.rna.wts.bam -P1__wgts,P1,SC,normal,dna,bam,/path/to/SC.normal.dna.wgs.bam +```csv +group_id,subject_id,sample_id,sample_type,sequence_type,filetype,info,filepath +P1__wgts,P1,SA,normal,dna,fastq,library_id:SA_library;lane:001,/path/to/SA.normal.dna.wgs.001.R1.fastq.gz;/path/to/SA.normal.dna.wgs.001.R2.fastq.gz +P1__wgts,P1,SB,tumor,dna,fastq,library_id:SB_library;lane:001,/path/to/SB.tumor.dna.wgs.001.R1.fastq.gz;/path/to/SB.tumor.dna.wgs.001.R2.fastq.gz +P1__wgts,P1,SC,tumor,rna,fastq,library_id:SC_library;lane:001,/path/to/SC.tumor.rna.wts.001.R1.fastq.gz;/path/to/SC.tumor.rna.wts.001.R2.fastq.gz ``` Launch oncoanalyser: ```bash nextflow run nf-core/oncoanalyser \ - -revision v0.3.1 \ - -profile docker \ - --mode wgts \ - --genome GRCh38_hmf \ - --input samplesheet.csv \ - --outdir output/ + -revision v0.3.1 \ + -profile docker \ + --mode wgts \ + --genome GRCh38_hmf \ + --input samplesheet.csv \ + --outdir output/ ``` ## Documentation @@ -78,16 +76,33 @@ The nf-core/oncoanalyser pipeline comes with documentation about the pipeline [usage](https://nf-co.re/oncoanalyser/usage), [parameters](https://nf-co.re/oncoanalyser/parameters) and [output](https://nf-co.re/oncoanalyser/output). -## Version support +## Version information -As oncoanalyser is used in clinical settings and is subject to accreditation standards in some instances, there is a -need for long-term stability and reliability for feature releases in order to meet operational requirements. This is +### Extended support + +As oncoanalyser is used in clinical settings and subject to accreditation standards in some instances, there is a need +for long-term stability and reliability for feature releases in order to meet operational requirements. This is accomplished through long-term support of several nominated feature releases, which all receive bug fixes and security fixes during the period of extended support. Each release that is given extended support is allocated a separate long-lived git branch with the 'stable' prefix, e.g. `stable/1.2.x`, `stable/1.5.x`. Feature development otherwise occurs on the `main` branch. +Versions nominated to have current long-term support: + +* TBD + +### Release parity + +Versioning between oncoanalyser and hmftools naturally differ, however it is often necessary to relate the functional +equivalence of these two pieces of software. The functional/feature parity with regards to version releases are detailed +in the below table. + +| oncoanalyser | hmftools | +| --- | --- | +| 0.1.0 through 0.2.7 | 5.33 | +| 0.3.0 through 0.3.1 | 5.34 | + ## Credits The oncoanalyser pipeline was written by Stephen Watts while in the [Genomics Platform diff --git a/REFERENCE_DATA_STAGING.md b/REFERENCE_DATA_STAGING.md new file mode 100644 index 00000000..822ad3c0 --- /dev/null +++ b/REFERENCE_DATA_STAGING.md @@ -0,0 +1,93 @@ +# Reference data staging + +Download and unpack + +> All reference data is retrieved here, exclude unused files as desired; using GRCh38_hmf below + +```bash +base_url=https://pub-29f2e5b2b7384811bdbbcba44f8b5083.r2.dev/genomes + +fps=' +genomes/GRCh37_hmf/Homo_sapiens.GRCh37.GATK.illumina.fasta +genomes/GRCh37_hmf/bwa_index/0.7.17-r1188.tar.gz +genomes/GRCh37_hmf/bwa_index/2.2.1/Homo_sapiens.GRCh37.GATK.illumina.fasta.0123 +genomes/GRCh37_hmf/bwa_index/2.2.1/Homo_sapiens.GRCh37.GATK.illumina.fasta.bwt.2bit.64 +genomes/GRCh37_hmf/bwa_index_image/0.7.17-r1188/Homo_sapiens.GRCh37.GATK.illumina.fasta.img +genomes/GRCh37_hmf/gridss_index/2.13.2/Homo_sapiens.GRCh37.GATK.illumina.fasta.gridsscache +genomes/GRCh37_hmf/samtools_index/1.16/Homo_sapiens.GRCh37.GATK.illumina.fasta.dict +genomes/GRCh37_hmf/samtools_index/1.16/Homo_sapiens.GRCh37.GATK.illumina.fasta.fai +genomes/GRCh37_hmf/star_index/gencode_19/2.7.3a.tar.gz +genomes/GRCh38_hmf/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna +genomes/GRCh38_hmf/bwa_index/0.7.17-r1188.tar.gz +genomes/GRCh38_hmf/bwa_index/2.2.1/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.0123 +genomes/GRCh38_hmf/bwa_index/2.2.1/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bwt.2bit.64 +genomes/GRCh38_hmf/bwa_index_image/0.7.17-r1188/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.img +genomes/GRCh38_hmf/gridss_index/2.13.2/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gridsscache +genomes/GRCh38_hmf/samtools_index/1.16/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.dict +genomes/GRCh38_hmf/samtools_index/1.16/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai +genomes/GRCh38_hmf/star_index/gencode_38/2.7.3a.tar.gz +hmf_reference_data/hmftools/5.34_37--2.tar.gz +hmf_reference_data/hmftools/5.34_38--2.tar.gz +hmf_reference_data/panels/tso500_5.34_37--1.tar.gz +hmf_reference_data/panels/tso500_5.34_38--1.tar.gz +virusbreakend/virusbreakenddb_20210401.tar.gz +' + +parallel -j4 wget -c -x -nH -P reference_data/ ${base_url}/{} ::: ${fps} +find reference_data/ -name '*.tar.gz' | parallel -j0 'cd {//} && tar -xzvf {/}' +``` + +Create Nextflow config file for local reference data + +```bash +cat < refdata.local.config +params { + genomes { + 'GRCh37_hmf' { + fasta = "$(pwd)/genomes/GRCh37_hmf/Homo_sapiens.GRCh37.GATK.illumina.fasta" + fai = "$(pwd)/genomes/GRCh37_hmf/samtools_index/1.16/Homo_sapiens.GRCh37.GATK.illumina.fasta.fai" + dict = "$(pwd)/genomes/GRCh37_hmf/samtools_index/1.16/Homo_sapiens.GRCh37.GATK.illumina.fasta.dict" + bwa_index = "$(pwd)/genomes/GRCh37_hmf/bwa_index/0.7.17-r1188.tar.gz" + bwa_index_bseq = "$(pwd)/genomes/GRCh37_hmf/bwa_index/2.2.1/Homo_sapiens.GRCh37.GATK.illumina.fasta.0123" + bwa_index_biidx = "$(pwd)/genomes/GRCh37_hmf/bwa_index/2.2.1/Homo_sapiens.GRCh37.GATK.illumina.fasta.bwt.2bit.64" + bwa_index_image = "$(pwd)/genomes/GRCh37_hmf/bwa_index_image/0.7.17-r1188/Homo_sapiens.GRCh37.GATK.illumina.fasta.img" + gridss_index = "$(pwd)/genomes/GRCh37_hmf/gridss_index/2.13.2/Homo_sapiens.GRCh37.GATK.illumina.fasta.gridsscache" + star_index = "$(pwd)/genomes/GRCh37_hmf/star_index/gencode_19/2.7.3a.tar.gz" + } + 'GRCh38_hmf' { + fasta = "$(pwd)/reference_data/genomes/GRCh38_hmf/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna" + fai = "$(pwd)/reference_data/genomes/GRCh38_hmf/samtools_index/1.16/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai" + dict = "$(pwd)/reference_data/genomes/GRCh38_hmf/samtools_index/1.16/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.dict" + bwa_index = "$(pwd)/reference_data/genomes/GRCh38_hmf/bwa_index/0.7.17-r1188/" + bwa_index_bseq = "$(pwd)/reference_data/genomes/GRCh38_hmf/bwa_index/2.2.1/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.0123" + bwa_index_biidx = "$(pwd)/reference_data/genomes/GRCh38_hmf/bwa_index/2.2.1/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bwt.2bit.64" + bwa_index_image = "$(pwd)/reference_data/genomes/GRCh38_hmf/bwa_index_image/0.7.17-r1188/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.img" + gridss_index = "$(pwd)/reference_data/genomes/GRCh38_hmf/gridss_index/2.13.2/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gridsscache" + star_index = "$(pwd)/reference_data/genomes/GRCh38_hmf/star_index/gencode_38/2.7.3a/" + } + } + + ref_data_hmf_data_path = "$(pwd)/reference_data/hmf_reference_data/hmftools/5.34_38--2/" + ref_data_panel_data_path = "$(pwd)/reference_data/hmf_reference_data/panels/tso500_5.34_38--1/" + ref_data_virusbreakenddb_path = "$(pwd)/reference_data/virusbreakend/virusbreakenddb_20210401/" +} +EOF +``` + +Run oncoanalyser with local reference data + +> Assumes existing samplesheet at `samplesheet.csv` + +```bash +nextflow run oncoanalyser/main.nf \ + \ + -config refdata.local.config \ + -profile docker \ + \ + --mode targeted \ + --panel tso500 \ + --genome GRCh38_hmf \ + \ + --input samplesheet.csv \ + --outdir output/ +``` diff --git a/assets/samplesheet.csv b/assets/samplesheet.csv deleted file mode 100644 index 84d41336..00000000 --- a/assets/samplesheet.csv +++ /dev/null @@ -1,12 +0,0 @@ -group_id,subject_id,sample_id,sample_type,sequence_type,filetype,filepath -subject_one__to__dna,subject_one,sample_a,tumor,bam,dna,/Users/stephen/repos/oncoanalyser/subject_one/sample_a.bam - -subject_one__tn__dna,subject_one,sample_a,tumor,bam,dna,/Users/stephen/repos/oncoanalyser/subject_one/sample_a.bam -subject_one__tn__dna,subject_one,sample_b,normal,bam,dna,/Users/stephen/repos/oncoanalyser/subject_one/sample_b.bam - -subject_one__tn__dna_rna,subject_one,sample_a,tumor,bam,dna,/Users/stephen/repos/oncoanalyser/subject_one/sample_a.bam -subject_one__tn__dna_rna,subject_one,sample_b,normal,bam,dna,/Users/stephen/repos/oncoanalyser/subject_one/sample_b.bam -subject_one__tn__dna_rna,subject_one,sample_c,tumor,bam,rna,/Users/stephen/repos/oncoanalyser/subject_one/sample_c.bam - -subject_one__to__dna_rna,subject_one,sample_a,tumor,bam,dna,/Users/stephen/repos/oncoanalyser/subject_one/sample_a.bam -subject_one__to__dna_rna,subject_one,sample_c,tumor,bam,rna,/Users/stephen/repos/oncoanalyser/subject_one/sample_c.bam diff --git a/docs/images/COLO829_wgts.orange_report.summary_section.png b/docs/images/COLO829_wgts.orange_report.summary_section.png new file mode 100644 index 00000000..a5b9e6fa Binary files /dev/null and b/docs/images/COLO829_wgts.orange_report.summary_section.png differ diff --git a/docs/images/mqc_fastqc_adapter.png b/docs/images/mqc_fastqc_adapter.png deleted file mode 100755 index 361d0e47..00000000 Binary files a/docs/images/mqc_fastqc_adapter.png and /dev/null differ diff --git a/docs/images/mqc_fastqc_counts.png b/docs/images/mqc_fastqc_counts.png deleted file mode 100755 index cb39ebb8..00000000 Binary files a/docs/images/mqc_fastqc_counts.png and /dev/null differ diff --git a/docs/images/mqc_fastqc_quality.png b/docs/images/mqc_fastqc_quality.png deleted file mode 100755 index a4b89bf5..00000000 Binary files a/docs/images/mqc_fastqc_quality.png and /dev/null differ diff --git a/docs/output.md b/docs/output.md index fd696f58..6f718daa 100644 --- a/docs/output.md +++ b/docs/output.md @@ -2,56 +2,499 @@ ## Introduction -This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline. +This document describes the output produced by the pipeline. The directories listed below will be created in the results +directory after the pipeline has finished. All paths are relative to the top-level results directory. -The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory. - - +```text +output/ +│   +├── subject_1/ +│   ├── alignments/ +│   ├── amber/ +│   ├── bamtools/ +│   ├── chord/ +│   ├── cobalt/ +│   ├── cuppa/ +│   ├── flagstats/ +│   ├── gridss/ +│   ├── gripss/ +│   ├── isofox/ +│   ├── lilac/ +│   ├── linx/ +│   ├── orange/ +│   ├── pave/ +│   ├── purple/ +│   ├── sage/ +│   ├── sigs/ +│   ├── virusbreakend/ +│   └── virusinterpreter/ +│   +├── subject_2/ +│   └── ... +│   +... +│   +└── pipeline_info/ +``` ## Pipeline overview -The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps: +- [Simple DNA/RNA alignment](#simple-dnarna-alignment) + - [bwa-mem2](#bwa-mem2) - DNA alignment + - [STAR](#star) - RNA alignment +- [Alignment post-processing](#alignment-post-processing) + - [MarkDups](#markdups) - General alignment processing + - [Picard Markduplicates](#picard-markduplicates) - Duplicate read marking +- [SNV, MNV, INDEL calling](#snv-mnv-indel-calling) + - [SAGE](#sage) - SNV, MNV, INDEL calling + - [PAVE](#pave) - Small variant annotation (transcript/coding effects) +- [SV calling](#sv-calling) + - [SvPrep](#svprep) - Read filtering for SV calling + - [GRIDSS](#gridss) - SV calling + - [GRIPSS](#gripss) - SV filtering and post-processing +- [CNV calling](#cnv-calling) + - [AMBER](#amber) - β-allele frequencies + - [COBALT](#cobalt) - Read depth ratios + - [PURPLE](#purple) - Purity/ploid estimation, variant annotation +- [SV event interpretation](#sv-event-interpretation) + - [LINX](#linx) - SV event clustering and annotation +- [Transcript analysis](#transcript-analysis) + - [Isofox](#isofox) - transcript counts, novel splicing and fusion calling +- [Oncoviral detection](#oncoviral-detection) + - [VIRUSBreakend](#virusbreakend) - viral content and integration calling + - [Virus Interpreter](#virus-interpreter) - oncoviral calling post-processing +- [HLA calling](#hla-calling) + - [LILAC](#lilac) - HLA calling +- [HRD status prediction](#hrd-status-prediction) + - [CHORD](#chord) - HRD status prediction +- [Mutational signature fitting](#mutation-signature-fitting) + - [Sigs](#sigs) - Mutational signature fitting +- [Tissue of origin prediction](#tissue-of-origin-prediction) + - [CUPPA](#cuppa) - Tissue of origin prediction +- [Report generation](#report-generation) + - [ORANGE](#orange) - Key results summary + - [linxreport](#linxreport) - Interactive LINX report +- [Pipeline information](#pipeline-information) - Workflow execution metrics + +### Simple DNA/RNA alignment + +Alignment functionality in oncoanalyser is simple and rigid, and exists only to meet the exact requirements of the +hmftools. + +#### bwa-mem2 + +[bwa-mem2](https://github.com/bwa-mem2/bwa-mem2) is a short-read mapping tool used to align reads to a large reference +sequences. In oncoanalyser, bwa-mem2 is used to align DNA reads to the human genome. + +*No outputs are published directly from bwa-mem2, see [MarkDups](#markdups) for the fully processed alignment outputs* + +#### STAR + +[STAR](https://github.com/alexdobin/STAR) is a specialised mapping to used to align RNA reads to a reference +transcriptome. + +*No outputs are published directly from STAR, see [Picard MarkDuplicates](#picard-markduplicates) for the fully processed alignment outputs* + +### Alignment post-processing + +#### MarkDups + +
+Output files + +- `/alignments/dna/` + - `.duplicate_freq.tsv`: Normal DNA sample read duplicate frequencies. + - `.markdups.bam`: Normal DNA sample output read alignments. + - `.markdups.bam.bai`: Normal DNA sample output read alignments index. + - `.duplicate_freq.tsv`: Tumor DNA sample read duplicate frequencies. + - `.markdups.bam`: Tumor DNA sample output read alignments. + - `.markdups.bam.bai`: Tumor DNA sample output read alignments index. + +
+ +[MarkDups](https://github.com/hartwigmedical/hmftools/tree/master/mark-dups) applies various alignment post-processing +routines such as duplicate marking and unmapping of problematic regions. It can also handle UMIs when configured to do +so. + +*MarkDups is only run on DNA alignments* + +### Picard MarkDuplicates + +
+Output files + +- `/alignments/rna/` + - `.md.bam`: Tumor RNA sample read alignments. + - `.md.bam.bai`: Tumor RNA sample read alignments index. + - `.md.metrics`: Tumor RNA sample read duplicate marking metrics. + +
+ +[Picard MarkDuplicates](https://gatk.broadinstitute.org/hc/en-us/articles/360037052812-MarkDuplicates-Picard) used to +mark duplicate reads following alignment. + +*Picard MarkDuplicates is only run on RNA alignments* + +### SNV, MNV, INDEL calling + +#### SAGE + +
+Output files + +- `/sage/append/` + - `.sage.append.vcf.gz`: Tumor DNA sample small variant VCF with RNA data appended. + - `.sage.append.vcf.gz`: Normal DNA sample small variant VCF with RNA data appended.. + +- `/sage/somatic/` + - `.sage.bqr.png`: Normal DNA sample base quality recalibration metrics plot. + - `.sage.bqr.tsv`: Normal DNA sample base quality recalibration metrics. + - `.sage.bqr.png`: Tumor DNA sample base quality recalibration metrics plot. + - `.sage.bqr.tsv`: Tumor DNA sample base quality recalibration metrics. + - `.sage.exon.medians.tsv`: Tumor DNA sample exon median depths. + - `.sage.gene.coverage.tsv`: Tumor DNA sample gene coverages. + - `.sage.somatic.filtered.vcf.gz.tbi`: Tumor DNA sample filtered small variant calls index. + - `.sage.somatic.filtered.vcf.gz`: Tumor DNA sample filtered small variant calls. + - `.sage.somatic.vcf.gz.tbi`: Tumor DNA sample small variant calls index. + - `.sage.somatic.vcf.gz`: Tumor DNA sample small variant calls. + +- `/sage/germline/` + - `.sage.bqr.png`: Tumor DNA sample base quality recalibration metrics plot. + - `.sage.bqr.tsv`: Tumor DNA sample base quality recalibration metrics. + - `.sage.exon.medians.tsv`: Normal DNA sample exon median depths. + - `.sage.gene.coverage.tsv`: Normal DNA sample gene coverages. + - `.sage.bqr.png`: Normal DNA sample base quality recalibration metrics plot. + - `.sage.bqr.tsv`: Normal DNA sample base quality recalibration metrics. + - `.sage.germline.filtered.vcf.gz.tbi`: Normal DNA sample filtered small variant calls index. + - `.sage.germline.filtered.vcf.gz`: Normal DNA sample filtered small variant calls. + - `.sage.germline.vcf.gz.tbi`: Normal DNA sample small variant calls index. + - `.sage.germline.vcf.gz`: Normal DNA sample small variant calls. + +
+ +[SAGE](https://github.com/hartwigmedical/hmftools/tree/master/sage) is a SNV, MNV, and INDEL caller optimised for 100x +tumor and 40x normal. + +#### PAVE + +
+Output files + +- `/pave/` + - `.sage.germline.filtered.pave.vcf.gz.tbi`: Annotated SAGE germline small variants index. + - `.sage.germline.filtered.pave.vcf.gz`: Annotated SAGE germline small variants. + - `.sage.somatic.filtered.pave.vcf.gz.tbi`: Annotated SAGE somatic small variants index. + - `.sage.somatic.filtered.pave.vcf.gz`: Annotated SAGE somatic small variants. + +
+ +[PAVE](https://github.com/hartwigmedical/hmftools/tree/master/pave) annotates variants called by SAGE with impact +information with regards to transcript and coding effects. + +### SV calling + +#### SvPrep + +[SvPrep](https://github.com/hartwigmedical/hmftools/tree/master/sv-prep) runs prior to SV calling to reducing runtime by +rapidly identifying reads that are likely to be involved in a SV event. + +*No outputs are published directly from SvPrep, see [GRIPSS](#gripss) for the fully processed SV calling outputs* + +#### GRIDSS + +
+Output files + +- `/gridss/` + - `.gridss.vcf.gz`: GRIDSS structural variants. + - `.gridss.vcf.gz.tbi`: GRIDSS structural variants index. + +
+ +[GRIDSS](https://github.com/PapenfussLab/gridss) is a SV caller than uses both read support and local +breakend/breakpoint assemblies to call variants. + +#### GRIPSS + +
+Output files + +- `/gripss/germline/` + - `.gripss.filtered.germline.vcf.gz`: Filtered GRIDSS germline structural variants. + - `.gripss.filtered.germline.vcf.gz.tbi`: Filtered GRIDSS germline structural variants index. + - `.gripss.germline.vcf.gz`: GRIDSS structural variants (GRIPSS filters set but not applied). + - `.gripss.germline.vcf.gz.tbi`: GRIDSS structural variants index (GRIPSS filters set but not applied). + +- `/gripss/somatic/` + - `.gripss.filtered.somatic.vcf.gz`: Filtered GRIDSS somatic structural variants. + - `.gripss.filtered.somatic.vcf.gz.tbi`: Filtered GRIDSS somatic structural variants index. + - `.gripss.somatic.vcf.gz`: GRIDSS structural variants (GRIPSS filters set but not applied). + - `.gripss.somatic.vcf.gz.tbi`: GRIDSS structural variants index (GRIPSS filters set but not applied). + +
+ +[GRIPSS](https://github.com/hartwigmedical/hmftools/tree/master/gripss) applies filter and post-processing to SV calls. + +### CNV calling + +#### AMBER + +
+Output files + +- `/amber/` + - `amber.version`: AMBER version file. + - `.amber.baf.pcf`: Tumor DNA sample piecewise constant fit. + - `.amber.baf.tsv.gz`: Tumor DNA sample β-allele frequencies. + - `.amber.contamination.tsv`: Tumor DNA sample contamination TSV. + - `.amber.contamination.vcf.gz`: Tumor DNA sample contamination sites. + - `.amber.contamination.vcf.gz.tbi`: Tumor DNA sample contamination sites index. + - `.amber.qc`: AMBER QC file. + - `.amber.homozygousregion.tsv`: Normal DNA sample regions of homozygosity. + +
+ +[AMBER](https://github.com/hartwigmedical/hmftools/tree/master/amber) generates β-allele frequencies in tumor samples +for CNV calling in PURPLE. + +#### COBALT + +
+Output files + +- `/cobalt/` + - `cobalt.version`: COBALT version file. + - `.cobalt.gc.median.tsv`: Tumor DNA sample GC median read depths. + - `.cobalt.ratio.pcf`: Tumor DNA sample piecewise constant fit. + - `.cobalt.ratio.tsv.gz`: Tumor DNA sample read counts and ratios (with reference or supposed diploid + regions). + - `.cobalt.gc.median.tsv`: Normal DNA sample GC median read depths. + - `.cobalt.ratio.median.tsv`: Normal DNA sample chromosome median ratios. + - `.cobalt.ratio.pcf`: Normal DNA sample piecewise constant fit. + +
-- [FastQC](#fastqc) - Raw read QC -- [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline -- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution +[COBALT](https://github.com/hartwigmedical/hmftools/tree/master/cobalt) generates read depth ratios (or an estimation +for tumor-only) for CNV calling in PURPLE. -### FastQC +#### PURPLE
Output files -- `fastqc/` - - `*_fastqc.html`: FastQC report containing quality metrics. - - `*_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images. +- `/purple/` + - `circos/`: Circos plot data. + - `.purple.cnv.gene.tsv`: Somatic gene copy number. + - `.purple.cnv.somatic.tsv`: Copy number variant segments. + - `.purple.driver.catalog.germline.tsv`: Normal DNA sample driver catalogue. + - `.purple.driver.catalog.somatic.tsv`: Tumor DNA sample driver catalogue. + - `.purple.germline.deletion.tsv`: Normal DNA deletions. + - `.purple.germline.vcf.gz`: Normal DNA SAGE small variants with PURPLE annotations. + - `.purple.germline.vcf.gz.tbi`: Normal DNA SAGE small variants with PURPLE annotations index. + - `.purple.purity.range.tsv`: Purity/ploid model fit scores across a range of purity values. + - `.purple.purity.tsv`: Purity/ploidy summary. + - `.purple.qc`: PURPLE QC file. + - `.purple.segment.tsv`: Genomic copy number segments. + - `.purple.somatic.clonality.tsv`: Clonality peak model data. + - `.purple.somatic.hist.tsv`: Somatic variants histogram data. + - `.purple.somatic.vcf.gz`: Tumor DNA sample small variants with PURPLE annotations. + - `.purple.somatic.vcf.gz.tbi`: Tumor DNA sample small variants with PURPLE annotations index. + - `.purple.sv.germline.vcf.gz`: Germline structural variants with PURPLE annotations. + - `.purple.sv.germline.vcf.gz.tbi`: Germline structural variants with PURPLE annotations index. + - `.purple.sv.vcf.gz`: Somatic structural variants with PURPLE annotations. + - `.purple.sv.vcf.gz.tbi`: Somatic structural variants with PURPLE annotations. + - `plot/`: PURPLE plots. + - `purple.version`: PURPLE version file.
-[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/). +[PURPLE](https://github.com/hartwigmedical/hmftools/tree/master/purple) is a CNV caller that also infers tumor +purity/ploidy and annotates both small and structural variant calls with copy-number information. -![MultiQC - FastQC sequence counts plot](images/mqc_fastqc_counts.png) +### SV event interpretation -![MultiQC - FastQC mean quality scores plot](images/mqc_fastqc_quality.png) +#### LINX + +
+Output files + +- `/linx/germline_annotations/` + - `linx.version`: LINX version file. + - `.linx.germline.breakend.tsv`: Normal DNA sample breakend data. + - `.linx.germline.clusters.tsv`: Normal DNA sample clustered events. + - `.linx.germline.disruption.tsv`: Normal DNA sample breakend data. + - `.linx.germline.driver.catalog.tsv`: Normal DNA sample driver catalogue. + - `.linx.germline.links.tsv`: Normal DNA sample cluster links. + - `.linx.germline.svs.tsv`: Normal DNA sample structural variants. + +- `/linx/somatic_annotations/` + - `linx.version`: LINX version file. + - `.linx.breakend.tsv`: Tumor DNA sample breakend data. + - `.linx.clusters.tsv`: Tumor DNA sample clustered events. + - `.linx.driver.catalog.tsv`: Tumor DNA sample driver catalogue. + - `.linx.drivers.tsv`: Tumor DNA sample LINX driver drivers. + - `.linx.fusion.tsv`: Tumor DNA sample fusions. + - `.linx.links.tsv`: Tumor DNA sample cluster links. + - `.linx.svs.tsv`: Tumor DNA sample structural variants. + - `.linx.vis_*`: Tumor DNA sample visualisation data. + +- `/linx/somatic_plots/` + - `all/*png`: All available tumor DNA sample cluster plots. + - `reportable/*png`: Driver-only tumor DNA sample cluster plots. + +
-![MultiQC - FastQC adapter content plot](images/mqc_fastqc_adapter.png) +[LINX](https://github.com/hartwigmedical/hmftools/tree/master/linx) clusters PURPLE-annotated SVs into high-order events +and classifies these events within a biological context. Following clustering and interpretation, events are visualised +as LINX plots. -> **NB:** The FastQC plots displayed in the MultiQC report shows _untrimmed_ reads. They may contain adapter sequence and potentially regions with low quality. +### Transcript analysis -### MultiQC +#### Isofox
Output files -- `multiqc/` - - `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser. - - `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline. - - `multiqc_plots/`: directory containing static images from the report in various formats. +- `/isofox/` + - `.isf.alt_splice_junc.csv`: Tumor RNA sample alternative splice junctions. + - `.isf.fusions.csv`: Tumor RNA sample fusions, unfiltered. + - `.isf.gene_collection.csv`: Tumor RNA sample gene-collection fragment counts. + - `.isf.gene_data.csv`: Tumor RNA sample gene fragment counts. + - `.isf.pass_fusions.csv`: Tumor RNA sample fusions, filtered. + - `.isf.retained_intron.csv`: Tumor RNA sample retained introns. + - `.isf.summary.csv`: Tumor RNA sample analysis summary file. + - `.isf.transcript_data.csv`: Tumor RNA sample transcript fragment counts.
-[MultiQC](http://multiqc.info) is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory. +[Isofox](https://github.com/hartwigmedical/hmftools/tree/master/isofox) analyses RNA alignment data to quantify +transcripts, identify novel splice junctions, and caller fusions. + +### Oncoviral detection + +#### VIRUSBreakend + +
+Output files + +- `/virusbreakend/` + - `.virusbreakend.vcf`: Tumor DNA sample viral integratino sites. + - `.virusbreakend.vcf.summary.tsv`: Tumor DNA sample analysis summary file. + +
+ +[VIRUSBreakend](VIRUSBreakend) detects the presence of oncoviruses and intergration sites in tumor samples. + +#### Virus Interpreter -Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see . +
+Output files + +- `/virusinterpreter/` + - `.virus.annotated.tsv`: Processed oncoviral call/annotation data. + +
+ +[Virus Interpreter](https://github.com/hartwigmedical/hmftools/tree/master/virus-interpreter) post-processing for +VIRUSBreakend calls that provides higher-level interpretation of data. + +### HLA calling + +#### LILAC + +
+Output files + +- `/lilac/` + - `.lilac.candidates.coverage.tsv`: Coverage of high scoring candidates. + - `.lilac.qc.tsv`: LILAC qc file. + - `.lilac.tsv`: Analysis summary. + +
+ +[LILAC](https://github.com/hartwigmedical/hmftools/tree/master/lilac) calls HLA Class I and characterises allelic status +(copy-number alterations, somatic mutations) in the tumor sample. Analysis can also incorporate RNA data as an +indirectly measurement of allele expression. + +### HRD status prediction + +#### CHORD + +
+Output files + +- `/chord/` + - `_chord_prediction.txt`: Tumor DNA sample analysis summary file. + - `_chord_signatures.txt`: Tumor DNA sample variant counts contributing to signatures. + +
+ +[CHORD](https://github.com/UMCUGenetics/CHORD) predicts the HRD status of a tumor using statistical inference on the +basis of relative somatic mutation counts. + +### Mutational signature fitting + +#### Sigs + +
+Output files + +- `/sigs/` + - `.sig.allocation.tsv`: Tumor DNA sample signature allocations. + - `.sig.snv_counts.csv`: Tumor DNA sample variant counts contributing to signatures. + +
+ +[Sigs](https://github.com/hartwigmedical/hmftools/tree/master/sigs) fits defined COSMIC trinucleotide mutational +signatures to tumor sample data. + +### Tissue of origin prediction + +#### CUPPA + +
+Output files + +- `/cuppa/` + - `_cup_report.pdf`: Combined figure of summary and feature plot. + - `.cup.data.csv`: Model feature scores. + - `.cup.report.features.png`: Feature plot. + - `.cup.report.summary.png`: Summary plot. + - `.cuppa.chart.png`: CUPPA chart plot. + - `.cuppa.conclusion.txt`: Prediction conclusion file. + +
+ +[CUPPA](https://github.com/hartwigmedical/hmftools/tree/master/cuppa) predicts tissue of origin for a given tumor sample +using DNA and/or RNA features generated by upstream hmftools components. + +### Report generation + +#### ORANGE + +
+Output files + +- `/orange/` + - `.orange.json`: Aggregated report data. + - `.orange.pdf`: Static report PDF. + +
+ +[ORANGE](https://github.com/hartwigmedical/hmftools/tree/master/orange) summaries and integrates key results from +hmftool components into a single static PDF report. + +#### linxreport + +
+Output files + +- `/linx/` + - `_linx.html`: Interactive HTML report. + +
+ +[linxreport](https://github.com/umccr/linxreport) generates an interactive report containing LINX annotations and plots. ### Pipeline information @@ -64,5 +507,3 @@ Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQ - Reformatted samplesheet files used as input to the pipeline: `samplesheet.valid.csv`. - -[Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage. diff --git a/docs/usage.md b/docs/usage.md index 8d291b27..3d5dbe10 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -6,62 +6,141 @@ ## Introduction - +The oncoanalyser pipeline typically runs from FASTQs or BAMs and supports two modes: (1) whole genome and/or +transcriptome, and (2) targeted panel. Launching an analysis requires only the creation of a samplesheet that describes +details of each input such as the sample type (tumor or normal), sequence type (DNA or RNA), and filepath. -## Samplesheet input +Various aspects of an oncoanalyser analysis can be configured to fit a range of needs, and many of these are considered +[advanced usage](#advanced-usage) of the pipeline. The most useful include: -You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row as shown in the examples below. +- precise process selection +- starting from existing data +- granular control over resource data -```bash ---input '[path to samplesheet file]' -``` +These features enable oncoanalyser to be run in a highly flexible way. For example, an analysis can be run with existing +PURPLE data as the starting point and skip variant calling processes. Additionally, resource/reference data can staged +locally to optimise execution or modified to create user-defined driver gene panels. -### Multiple runs of the same sample +> [!WARNING] +> There are important requirements when using BAMs as input instead of FASTQs: +> - STAR must have been run with [specific +> parameters](https://github.com/hartwigmedical/hmftools/tree/master/isofox#a-note-on-alignment-and-multi-mapping), +> this is critical for WTS data, and +> - reads are expected to have been aligned to one of the Hartwig-distributed reference genomes (user-defined genomes may be used though are not recommended) -The `sample` identifiers have to be the same when you have re-sequenced the same sample more than once e.g. to increase sequencing depth. The pipeline will concatenate the raw reads before performing any downstream analysis. Below is an example for the same sample sequenced across 3 lanes: +## Supported analyses -```console -sample,fastq_1,fastq_2 -CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz -CONTROL_REP1,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz -CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz +A variety of analyses are accessible in oncoanalyser and are implicitly run according to the data available in the +samplesheet. The supported analysis types for each workflow are listed below. + +| Input sequence data | WGS/WTS workflow | Targeted sequencing workflow\* | +| --- | :-: | :-: | +| • Tumor/normal DNA
• Tumor RNA | :white_check_mark: | - | +| • Tumor only DNA
• Tumor RNA | :white_check_mark: | :white_check_mark: | +| • Tumor/normal DNA | :white_check_mark: | - | +| • Tumor only DNA | :white_check_mark: | :white_check_mark: | +| • Tumor only RNA | :white_check_mark: | - | + +\* Supported analyses relate to the TSO500 panel only + +## Samplesheet + +A samplesheet that contains information of each input in CSV format is needed to run oncoanalyser. The required input +details and columns are [described below](#column-descriptions). + +The oncoanalyser pipeline also recognises several input filetypes, including intermediate output files generated during +execution such as the PURPLE output directory. The full list recognised input filetypes is available +[here](https://github.com/nf-core/oncoanalyser/blob/v0.3.1/lib/Constants.groovy#L56-L86). + +### Simple example + +#### FASTQ + +> [!NOTE] +> Currently only non-interleaved paired-end reads are accepted as FASTQ input + +```csv title="samplesheet.csv" +group_id,subject_id,sample_id,sample_type,sequence_type,filetype,info,filepath +P1__wgts,P1,SA,normal,dna,fastq,library_id:SA_library;lane:001,/path/to/P1.SA.normal.dna.wgs.001.R1.fastq.gz;/path/to/P1.SA.normal.dna.wgs.001.R2.fastq.gz +P1__wgts,P1,SB,tumor,dna,fastq,library_id:SB_library;lane:001,/path/to/P1.SB.tumor.dna.wgs.001.R1.fastq.gz;/path/to/P1.SB.tumor.dna.wgs.001.R2.fastq.gz +P1__wgts,P1,SC,tumor,rna,fastq,library_id:SC_library;lane:001,/path/to/P1.SC.tumor.rna.wts.001.R1.fastq.gz;/path/to/P1.SC.tumor.rna.wts.001.R2.fastq.gz ``` -### Full samplesheet +#### BAM -The pipeline will auto-detect whether a sample is single- or paired-end using the information provided in the samplesheet. The samplesheet can have as many columns as you desire, however, there is a strict requirement for the first 3 columns to match those defined in the table below. +> [!NOTE] +> Inputs with the `bam` filetype will be processed by MarkDups as required by hmftools. Where an input BAM has already +> been processed by MarkDups, you can avoid needless reprocessing by setting `bam_markdups` as the filetype instead. +> +> Please note there are important requirements around the use of BAMs, see the warning above in the +> [Introduction](#introduction). + +```csv title="samplesheet.csv" +group_id,subject_id,sample_id,sample_type,sequence_type,filetype,filepath +P1__wgts,P1,SA,normal,dna,bam,/path/to/P1.SA.normal.dna.wgs.bam +P1__wgts,P1,SB,tumor,dna,bam,/path/to/P1.SB.tumor.dna.wgs.bam +P1__wgts,P1,SC,tumor,rna,bam,/path/to/P1.SC.tumor.rna.wts.bam +``` -A final samplesheet file consisting of both single- and paired-end data may look something like the one below. This is for 6 samples, where `TREATMENT_REP3` has been sequenced twice. +### Multiple lanes -```console -sample,fastq_1,fastq_2 -CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz -CONTROL_REP2,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz -CONTROL_REP3,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz -TREATMENT_REP1,AEG588A4_S4_L003_R1_001.fastq.gz, -TREATMENT_REP2,AEG588A5_S5_L003_R1_001.fastq.gz, -TREATMENT_REP3,AEG588A6_S6_L003_R1_001.fastq.gz, -TREATMENT_REP3,AEG588A6_S6_L004_R1_001.fastq.gz, +```csv title="samplesheet.csv" +group_id,subject_id,sample_id,sample_type,sequence_type,filetype,info,filepath +P1__wgts,P1,SA,normal,dna,fastq,library_id:SA_library;lane:001,/path/to/P1.SA.normal.dna.wgs.001.R1.fastq.gz;/path/to/P1.SA.normal.dna.wgs.001.R2.fastq.gz +P1__wgts,P1,SA,normal,dna,fastq,library_id:SA_library;lane:002,/path/to/P1.SA.normal.dna.wgs.002.R1.fastq.gz;/path/to/P1.SA.normal.dna.wgs.002.R2.fastq.gz +P1__wgts,P1,SB,tumor,dna,fastq,library_id:SB_library;lane:001,/path/to/P1.SB.tumor.dna.wgs.001.R1.fastq.gz;/path/to/P1.SB.tumor.dna.wgs.001.R2.fastq.gz +P1__wgts,P1,SB,tumor,dna,fastq,library_id:SB_library;lane:002,/path/to/P1.SB.tumor.dna.wgs.002.R1.fastq.gz;/path/to/P1.SB.tumor.dna.wgs.002.R2.fastq.gz +P1__wgts,P1,SC,tumor,rna,fastq,library_id:SC_library;lane:001,/path/to/P1.SC.tumor.rna.wts.001.R1.fastq.gz;/path/to/P1.SC.tumor.rna.wts.001.R2.fastq.gz +``` + +### Multiple patients + +```csv title="samplesheet.csv" +group_id,subject_id,sample_id,sample_type,sequence_type,filetype,info,filepath +P1__wgts,P1,SA,normal,dna,fastq,library_id:SA_library;lane:001,/path/to/P1.SA.normal.dna.wgs.001.R1.fastq.gz;/path/to/P1.SA.normal.dna.wgs.001.R2.fastq.gz +P1__wgts,P1,SB,tumor,dna,fastq,library_id:SB_library;lane:001,/path/to/P1.SB.tumor.dna.wgs.001.R1.fastq.gz;/path/to/P1.SB.tumor.dna.wgs.001.R2.fastq.gz +P2__wgts,P2,SA,normal,dna,fastq,library_id:SA_library;lane:001,/path/to/P2.SA.normal.dna.wgs.001.R1.fastq.gz;/path/to/P2.SA.normal.dna.wgs.001.R2.fastq.gz +P2__wgts,P2,SB,tumor,dna,fastq,library_id:SB_library;lane:001,/path/to/P2.SB.tumor.dna.wgs.001.R1.fastq.gz;/path/to/P2.SB.tumor.dna.wgs.001.R2.fastq.gz ``` -| Column | Description | -| --------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `sample` | Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (`_`). | -| `fastq_1` | Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". | -| `fastq_2` | Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". | +### Column descriptions -An [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline. +| Column | Description | +| --- | --- | +| group_id | Group ID for a set of samples and inputs | +| subject_id | Subject/patient ID | +| sample_id | Sample ID | +| sample_type | Sample type: `tumor`, `normal` | +| sequence_type | Sequence type: `dna`, `rna` | +| filetype | File type: e.g. `fastq`, `bam`, `bai` | +| filepath | Absolute filepath to input file (can be local filepath, URL, S3 URI) | + +The identifiers provided in the samplesheet are used to set output file paths: + +* `group_id`: top-level output directory for analysis files e.g. `output/COLO829_example/` +* tumor `sample_id`: output prefix for most filenames e.g. `COLO829T.purple.sv.vcf.gz` +* normal `sample_id`: output prefix for some filenames e.g. `COLO829R.cobalt.ratio.pcf` ## Running the pipeline The typical command for running the pipeline is as follows: ```bash -nextflow run nf-core/oncoanalyser --input samplesheet.csv --outdir --genome GRCh37 -profile docker +nextflow run nf-core/oncoanalyser \ + -profile docker \ + --mode \ + --genome \ + --input samplesheet.csv \ + --outdir ``` This will launch the pipeline with the `docker` configuration profile. See below for more information about profiles. +> [!NOTE] +> When oncoanalyser is run, it will retrieve all reference data it requires to perform the requested analysis. When +> running oncoanalyser more than once, it is strongly recommended to pre-stage reference data locally to avoid it being +> retrieved multiple times by oncoanalyser. See [Staging reference data](#staging-reference-data). + Note that the pipeline will create the following files in your working directory: ```bash @@ -87,6 +166,125 @@ First, go to the [nf-core/oncoanalyser releases page](https://github.com/nf-core This version number will be logged in reports when you run the pipeline, so that you'll know what you used when you look back in the future. +## Advanced usage + +### Selecting processes + +Most of the major components in oncoanalyser can be skipped using `--processes_exclude` (the full list of available +processes can be view [here](https://github.com/nf-core/oncoanalyser/blob/v0.3.1/lib/Constants.groovy#L36-L54)). +Multiple processes can be given as comma-separated list. While there are some use-cases for this feature (e.g. skipping +resource intensive processes such as VIRUSBreakend), it becomes more powerful when combined with existing inputs as +described in the follow section. + +> [!WARNING] +> When skipping components, no checks are done to identify orphan processes in the execution DAG or for redundant +> processes. + +### Existing inputs + +The oncoanalyser pipeline has been designed to allow entry at arbiturary points and is particularly useful in +situtations where previous outputs exist and re-running oncoanalyser is desired (e.g. to subsequently execute an +optional sensor or use an upgrade component such as PURPLE). The primary advantage of this approach is that only the +required processes are executed, which can greatly reduce runtimes by skipping unneccessary processes. + +In order to effectively utilise this feature, existing inputs must be set in the [samplesheet](#samplesheet) and the +appropriate [processes selected](#selecting-processes). Take the below example where existing PURPLE inputs are used so +that all upstream variant calling can be skipped: + +```csv title='samplesheet.existing_purple.csv' +P1__wgts,P1,SA,normal,dna,bam,/path/to/P1.SA.normal.dna.wgs.bam +P1__wgts,P1,SB,tumor,dna,bam,/path/to/P1.SB.tumor.dna.wgs.bam +P1__wgts,P1,SB,tumor,dna,purple_dir,/path/to/P1.purple_dir/ +``` + +> [!NOTE] +> The original source input file (i.e. BAM or FASTQ) must always be provided for oncoanalyser to infer the correct +> analysis type. + +And now run and skip variant calling: + +```bash +nextflow run nf-core/oncoanalyser \ + -profile docker \ + --mode wgts \ + --processes_exclude amber,cobalt,gridss,gripss,sage,pave \ + --genome GRCh38_hmf \ + --input samplesheet.csv \ + --outdir output/ +``` + +> [!WARNING] +> Providing existing inputs will cause oncoanalyser to skip the corresponding process but *not any* of the upstream +> processes. + +### Configuring reference data + +All reference data can be configured as needed. These are defined in various locations: + +| Reference data | Filepath | Note | +| --- | --- | --- | +| hmftools resource files | `conf/hmf_data.config` | Paths relative to data bundle directory | +| panel resource files | `conf/panel_data.config` | Paths relative to data bundle directory | +| Genomes and indexes | `conf/hmf_genomes.config` | Absolute paths | + +To override hmftools resource files (e.g. driver gene panel), [stage the bundle](#staging-reference-data) locally then +copy in the desired file(s) and update `conf/hmf_data.config` accordingly. The local custom bundle must be provided to +oncoanalyser with the `--ref_data_hmf_data_path` CLI option. The same approach is followed for customising panel +resource files, configuring `conf/panel_data.config` and supplying with `--ref_data_panel_data_path` instead. + +The path or URI to the VIRUSBreakend database can also be explicitly set with `--ref_data_virusbreakenddb_path`. +Configuring custom genomes uses a different approach to align with the existing concepts in nf-core. + +#### Custom genomes + +It is strongly recommended to use the Hartwig-distributed reference genomes for alignments +([GRCh37](https://console.cloud.google.com/storage/browser/hmf-public/HMFtools-Resources/ref_genome/37) or +[GRCh38](https://console.cloud.google.com/storage/browser/hmf-public/HMFtools-Resources/ref_genome/38)). If there is no +other option than to use a custom genome, one can be configured with the following process: + +```text title='genome.custom.config' +params { + genomes { + CustomGenome { + fasta = "/path/to/CustomGenome/custom_genome.fa" + fai = "/path/to/CustomGenome/samtools_index/1.16/custom_genome.fa.fai" + dict = "/path/to/CustomGenome/samtools_index/1.16/custom_genome.fa.dict" + bwa_index = "/path/to/CustomGenome/bwa_index/0.7.17-r1188/" + bwa_index_bseq = "/path/to/CustomGenome/bwa_index/2.2.1/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.0123" + bwa_index_biidx = "/path/to/CustomGenome/bwa_index/2.2.1/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bwt.2bit.64" + bwa_index_image = "/path/to/CustomGenome/bwa_index_image/0.7.17-r1188/custom_genome.fa.img" + gridss_index = "/path/to/CustomGenome/gridss_index/2.13.2/custom_genome.fa.gridsscache" + star_index = "/path/to/CustomGenome/star_index/gencode_38/2.7.3a/" + } + } +} +``` + +Run a custom genome with the above configuration and below command + +```bash +nextflow run nf-core/oncoanalyser \ + -profile docker \ + -config genome.custom.config \ + --mode wgts \ + \ + --genome CustomGenome \ + --genome_version <37|38> \ + --genome_type \ + --force_genome \ + \ + --input samplesheet.csv \ + --outdir output/ +``` + +> [!WARNING] +> RNA alignment with STAR must use an index generated from a matching Ensembl release version (GRCh37: v74; GRCh38: +> v104). + +#### Staging reference data + +Please refer to [REFERENCE_DATA.md](https://github.com/nf-core/oncoanalyser/REFERENCE_DATA.md). + ## Core Nextflow arguments > **NB:** These options are part of Nextflow and use a _single_ hyphen (pipeline parameters use a double-hyphen). diff --git a/examples/README.md b/examples/README.md deleted file mode 100644 index 2d88ffdf..00000000 --- a/examples/README.md +++ /dev/null @@ -1,109 +0,0 @@ -# Examples - -## Full - -| Filetype | Keyword | Description | Type | -| --- | --- | --- | --- | -| BAM (WGS) | `bam_wgs` | WGS read alignments | Required | -| BAM (WTS) | `bam_wts` | WTS read alignments | Optional | -| SV VCF | `vcf` | SV VCF produced by an external caller [_used to filter reads for GRIDSS_] | Optional | - -```text -id subject_name sample_name sample_type filetype filepath -SONE-1 SUBJECT_ONE SAMPLE_ONE_TUMOR tumor bam_wgs /path/to/tumor_bam/sample_one_tumor.bam -SONE-1 SUBJECT_ONE SAMPLE_ONE_TUMOR tumor bam_wts /path/to/tumor_bam/sample_one_tumor.rna.bam -SONE-1 SUBJECT_ONE SAMPLE_ONE_TUMOR tumor vcf_sv /path/to/tumor_sv_vcf/sample_one_tumor.vcf.gz -SONE-1 SUBJECT_ONE SAMPLE_ONE_NORMAL normal bam_wgs /path/to/normal_bam/sample_one_normal.bam -STWO-1 SUBJECT_TWO SAMPLE_TWO_TUMOR-1 tumor bam_wgs /path/to/tumor_bam/sample_two_tumor_one.bam -STWO-1 SUBJECT_TWO SAMPLE_TWO_NORMAL normal bam_wgs /path/to/normal_bam/sample_two_normal.bam -STWO-2 SUBJECT_TWO SAMPLE_TWO_TUMOR-2 tumor bam_wgs /path/to/tumor_bam/sample_two_tumor_two.bam -STWO-2 SUBJECT_TWO SAMPLE_TWO_NORMAL normal bam_wgs /path/to/normal_bam/sample_two_normal.bam -STRHEE-1 SUBJECT_THREE SAMPLE_THREE_TUMOR tumor bam_wgs /path/to/tumor_bam/sample_three_tumor.bam -STRHEE-1 SUBJECT_THREE SAMPLE_THREE_TUMOR tumor vcf_sv /path/to/tumor_sv_vcf/sample_three_tumor.vcf.gz -STRHEE-1 SUBJECT_THREE SAMPLE_THREE_NORMAL normal bam_wgs /path/to/normal_bam/sample_three_normal.bam -``` - -## GRIDSS - -See [Full section](#full) - -## PURPLE - -| Filetype | Keyword | Description | Type | -| --- | --- | --- | --- | -| AMBER directory | `amber_dir` | AMBER output directory | Required | -| COBALT directory | `cobalt_dir` | COBALT output directory | Required | -| GRIPSS SV VCF (hard filtered) | `vcf_sv_gripss_hard` | Hard filtered GRIPSS SV VCF | Required | -| GRIPSS SV VCF (soft filtered) | `vcf_sv_gripss_soft` | Soft filtered GRIPSS SV VCF | Required | -| SNV/MNV and INDEL VCF | `vcf_smlv` | Small SNV/MNV VCF | Optional | - -```text -id subject_name sample_name sample_type filetype filepath -SONE-1 SUBJECT_ONE SAMPLE_ONE_TUMOR tumor amber_dir /path/to/amber_dir/ -SONE-1 SUBJECT_ONE SAMPLE_ONE_TUMOR tumor cobalt_dir /path/to/cobalt_dir/ -SONE-1 SUBJECT_ONE SAMPLE_ONE_TUMOR tumor vcf_sv_gripss_hard /path/to/tumor_gripss_hard_sv/sample_one_tumor.vcf.gz -SONE-1 SUBJECT_ONE SAMPLE_ONE_TUMOR tumor vcf_sv_gripss_soft /path/to/tumor_gripss_soft_sv/sample_one_tumor.vcf.gz -SONE-1 SUBJECT_ONE SAMPLE_ONE_TUMOR tumor vcf_smlv /path/to/tumor_smlv_vcf/sample_one_tumor.vcf.gz -SONE-1 SUBJECT_ONE SAMPLE_ONE_NORMAL normal vcf_smlv /path/to/normal_smlv_vcf/sample_one_normal.vcf.gz -``` - -## LINX - -| Filetype | Keyword | Description | Type | -| --- | --- | --- | --- | -| PURPLE directory | `purple_dir` | PURPLE output directory [_LINX somatic_] | Required | -| GRIPSS SV VCF (hard filtered) | `vcf_sv_gripss_hard` | Hard filtered GRIPSS SV VCF [_LINX germline_] | Required | - -```text -id subject_name sample_name sample_type filetype filepath -SONE-1 SUBJECT_ONE SAMPLE_ONE_TUMOR tumor purple_dir /path/to/purple_dir/ -SONE-1 SUBJECT_ONE SAMPLE_ONE_NORMAL normal vcf_sv_gripss_hard /path/to/normal_gripss_hard_sv/sample_one_normal.vcf.gz -``` - -## GRIDSS-PURPLE-LINX - -| Filetype | Keyword | Description | Type | -| --- | --- | --- | --- | -| BAM (WGS) | `bam` | WGS read alignments | Required | -| SV VCF | `vcf_sv` | SV VCF produced by an external caller [_used to filter reads for GRIDSS_] | Optional | -| SNV/MNV and INDEL VCF | `vcf_smlv` | Small SNV/MNV VCF | Optional | - -```text -id subject_name sample_name sample_type filetype filepath -SONE-1 SUBJECT_ONE SAMPLE_ONE_TUMOR tumor bam_wgs /path/to/tumor_bam/sample_one_tumor.bam -SONE-1 SUBJECT_ONE SAMPLE_ONE_TUMOR tumor vcf_sv /path/to/tumor_sv_vcf/sample_one_tumor.vcf.gz -SONE-1 SUBJECT_ONE SAMPLE_ONE_NORMAL normal bam_wgs /path/to/normal_bam/sample_one_normal.bam -SONE-1 SUBJECT_ONE SAMPLE_ONE_TUMOR tumor vcf_smlv /path/to/tumor_smlv_vcf/sample_one_tumor.vcf.gz -SONE-1 SUBJECT_ONE SAMPLE_ONE_NORMAL normal vcf_smlv /path/to/normal_smlv_vcf/sample_one_normal.vcf.gz -``` - - -## LILAC - -| Filetype | Keyword | Description | Type | -| --- | --- | --- | --- | -| BAM (WGS) | `bam_wgs` | WGS read alignments | Required | -| PURPLE directory | `purple_dir` | PURPLE output directory | Required | - -```text -id subject_name sample_name sample_type filetype filepath -SONE-1 SUBJECT_ONE SAMPLE_ONE_TUMOR tumor bam_wgs /path/to/tumor_bam/sample_one_tumor.bam -SONE-1 SUBJECT_ONE SAMPLE_ONE_TUMOR tumor purple_dir /path/to/purple_dir/ -SONE-1 SUBJECT_ONE SAMPLE_ONE_NORMAL normal bam_wgs /path/to/normal_bam/sample_one_normal.bam -``` - -## TEAL - -| Filetype | Keyword | Description | Type | -| --- | --- | --- | --- | -| BAM (WGS) | `bam_wgs` | WGS read alignments | Required | -| COBALT directory | `cobalt_dir` | COBALT output directory | Required | -| PURPLE directory | `purple_dir` | PURPLE output directory | Required | - -```text -id subject_name sample_name sample_type filetype filepath -SONE-1 SUBJECT_ONE SAMPLE_ONE_TUMOR tumor bam_wgs /path/to/tumor_bam/sample_one_tumor.bam -SONE-1 SUBJECT_ONE SAMPLE_ONE_TUMOR tumor purple_dir /path/to/purple_dir/ -SONE-1 SUBJECT_ONE SAMPLE_ONE_TUMOR tumor cobalt_dir /path/to/cobalt_dir/ -SONE-1 SUBJECT_ONE SAMPLE_ONE_NORMAL normal bam_wgs /path/to/normal_bam/sample_one_normal.bam -```