Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for generating taxprofiler/funcscan input samplesheets for preprocessed FASTQs/FASTAs #688

Draft
wants to merge 19 commits into
base: dev
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,9 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d

Note that when specifying the parameter `--coassemble_group`, for the corresponding output filenames/directories of the assembly or downsteam processes the group ID, or more precisely the term `group-[group_id]`, will be used instead of the sample ID.

The pipeline can also generate downstream pipeline input samplesheets.
These are stored in `<outdir>/downstream_samplesheets`.

## Quality control

These steps trim away the adapter sequences present in input reads, trims away bad quality bases and sicard reads that are too short.
Expand Down Expand Up @@ -766,8 +769,8 @@ The pipeline can also generate input files for the following downstream pipeline
<summary>Output files</summary>

- `downstream_samplesheets/`
- `funcscan.csv`: Filled out nf-core/funcscan `--input` csv with absolute paths to the assembled contig FASTA files produced by nf-core/mag (MEGAHIT, SPAdes, SPAdesHybrid)
- `taxprofiler.csv`: Partially filled out nf-core/taxprofiler csv with paths to preprocessed reads (adapter trimmed, host removed etc.) `.fastq.gz`
- `taxprofiler.csv`: Partially filled out nf-core/taxprofiler `--input` csv with paths to preprocessed reads (adapter trimmed, host removed etc.) in `.fastq.gz` formats. I.e., the direct input into MEGAHIT, SPAdes, SPAdesHybrid.
- `funcscan.csv`: Filled out nf-core/funcscan `--input` csv with absolute paths to the assembled contig FASTA files produced by nf-core/mag (i.e., the direct output from MEGAHIT, SPAdes, SPAdesHybrid - not bins).

</details>

Expand Down
5 changes: 3 additions & 2 deletions nextflow_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -96,9 +96,10 @@
},
"generate_pipeline_samplesheets": {
"type": "string",
"default": "funcscan,taxprofiler",
"description": "Specify which pipeline to generate a samplesheet for.",
"fa_icon": "fas fa-toolbox"
"help": "Note that the nf-core/funcscan samplesheet will only include paths to raw assemblies, not bins\n\nThe nf-core/taxprofiler samplesheet will include of paths the pre-processed reads that are used are used as input for _de novo_ assembly.",
"fa_icon": "fas fa-toolbox",
"pattern": "^(taxprofiler|funcscan)(?:,(taxprofiler|funcscan)){0,1}"
}
}
},
Expand Down
24 changes: 12 additions & 12 deletions subworkflows/local/generate_downstream_samplesheets/main.nf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like @jfy133 used only one workflow, which will selectively generate samplesheets based on params.generate_pipeline_samplesheets. Do you think it would be best to keep that consistent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, since FastQ files are being pulled from the publishDir, it might be a good idea to include options that override user inputs for params.publish_dir_mode (so that it is always 'copy' if a samplesheet is generated) and params.save_clipped_reads, params.save_phixremoved_reads ...etc so that the preprocessed FastQ files are published to the params.outdir if a downstream samplesheet is generated

Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,7 @@ workflow SAMPLESHEET_TAXPROFILER {
ch_reads

main:
format = 'csv' // most common format in nf-core
format_sep = ','
format = 'csv'

def fastq_rel_path = '/'
if (params.bbnorm) {
Expand Down Expand Up @@ -36,7 +35,7 @@ workflow SAMPLESHEET_TAXPROFILER {
}
.tap{ ch_colnames }
jfy133 marked this conversation as resolved.
Show resolved Hide resolved

channelToSamplesheet(ch_colnames, ch_list_for_samplesheet, 'downstream_samplesheets', 'taxprofiler', format, format_sep)
channelToSamplesheet(ch_list_for_samplesheet, "${params.outdir}/downstream_samplesheets/mag", format)

}

Expand All @@ -45,8 +44,7 @@ workflow SAMPLESHEET_FUNCSCAN {
ch_assemblies

main:
format = 'csv' // most common format in nf-core
format_sep = ','
format = 'csv'

ch_list_for_samplesheet = ch_assemblies
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next thing which I don't think will be so complicated is to add another input channel for bins, and here make an if/else statement if they want to send just the raw assemblies (all contigs) or binned contigs to the samplesheet.

It will need another pipeline level parameter too though --generate_samplesheet_funcscan_seqtype or something

.map {
Expand All @@ -57,8 +55,7 @@ workflow SAMPLESHEET_FUNCSCAN {
}
.tap{ ch_colnames }

channelToSamplesheet(ch_colnames, ch_list_for_samplesheet, 'downstream_samplesheets', 'funcscan', format, format_sep)

channelToSamplesheet(ch_list_for_samplesheet, "${params.outdir}/downstream_samplesheets/funcscan", format)
}

workflow GENERATE_DOWNSTREAM_SAMPLESHEETS {
Expand All @@ -78,14 +75,17 @@ workflow GENERATE_DOWNSTREAM_SAMPLESHEETS {
}
}

// Constructs the header string and then the strings of each row, and
def channelToSamplesheet(ch_header, ch_list_for_samplesheet, outdir_subdir, pipeline, format, format_sep) {
def channelToSamplesheet(ch_list_for_samplesheet, path, format) {
def format_sep = [csv: ",", tsv: "\t", txt: "\t"][format]

def ch_header = ch_list_for_samplesheet

ch_header
.first()
.map{ it.keySet().join(format_sep) }
.concat( ch_list_for_samplesheet.map{ it.values().join(format_sep) })
.map { it.keySet().join(format_sep) }
.concat(ch_list_for_samplesheet.map { it.values().join(format_sep) })
.collectFile(
name:"${params.outdir}/${outdir_subdir}/${pipeline}.${format}",
name: "${path}.${format}",
newLine: true,
sort: false
)
Expand Down

This file was deleted.

5 changes: 5 additions & 0 deletions subworkflows/local/utils_nfcore_mag_pipeline/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,11 @@ workflow PIPELINE_INITIALISATION {
//
validateInputParameters(
hybrid
jfy133 marked this conversation as resolved.
Show resolved Hide resolved

// Validate samplesheet generation parameters
if (params.generate_downstream_samplesheets && !params.generate_pipeline_samplesheets) {
error('[nf-core/createtaxdb] If supplying `--generate_downstream_samplesheets`, you must also specify which pipeline to generate for with `--generate_pipeline_samplesheets! Check input.')
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nf-core/mag ?

}
)

// Validate PRE-ASSEMBLED CONTIG input when supplied
Expand Down