Skip to content

Commit

Permalink
Updating docs.
Browse files Browse the repository at this point in the history
  • Loading branch information
emarinier committed Feb 5, 2025
1 parent 2acfcea commit da16c13
Show file tree
Hide file tree
Showing 3 changed files with 73 additions and 149 deletions.
93 changes: 52 additions & 41 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,41 @@
[![Nextflow](https://img.shields.io/badge/nextflow-%E2%89%A523.04.3-brightgreen.svg)](https://www.nextflow.io/)

# Example Pipeline for IRIDA Next
# Metadata Transformation Pipeline for IRIDA Next

This is an example pipeline to be used for integration with IRIDA Next.
This pipeline transforms metadata from IRIDA Next.

# Input

The input to the pipeline is a standard sample sheet (passed as `--input samplesheet.csv`) that looks like:
The input to the pipeline is a sample sheet (passed as `--input samplesheet.csv`) that looks like:

| sample | fastq_1 | fastq_2 |
| ------- | --------------- | --------------- |
| SampleA | file_1.fastq.gz | file_2.fastq.gz |
| sample | sample_name | metadata_1 | metadata_2 | metadata_3 | metadata_4 | metadata_5 | metadata_6 | metadata_7 | metadata_8 |
| ------- | ----------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
| Sample1 | SampleA | meta_1 | meta_2 | meta_3 | meta_4 | meta_5 | meta_6 | meta_7 | meta_8 |

The structure of this file is defined in [assets/schema_input.json](assets/schema_input.json). Validation of the sample sheet is performed by [nf-validation](https://nextflow-io.github.io/nf-validation/).

# Parameters

The main parameters are `--input` as defined above and `--output` for specifying the output results directory. You may wish to provide `-profile singularity` to specify the use of singularity containers and `-r [branch]` to specify which GitHub branch you would like to run.

## Transformation

You may specify the metadata transformation with the `--transformation` parameter. For example, `--transformation lock` will perform the lock transformation. The available transformations are as follows:

| Transformation | Explanation |
| -------------- | --------------------------------- |
| lock | Locks the metadata in IRIDA Next. |

## Other Parameters

Other parameters (defaults from nf-core) are defined in [nextflow_schema.json](nextflow_schema.json).

# Running

To run the pipeline, please do:

```bash
nextflow run phac-nml/metadatatransformation -profile singularity -r main -latest --input assets/samplesheet.csv --outdir results
nextflow run phac-nml/metadatatransformation -profile singularity -r main -latest --input assets/samplesheet.csv --outdir results --transformation lock
```

Where the `samplesheet.csv` is structured as specified in the [Input](#input) section.
Expand All @@ -41,64 +51,65 @@ An example of the what the contents of the IRIDA Next JSON file looks like for t
{
"files": {
"global": [
{
"path": "summary/summary.txt.gz"
}
],
"samples": {
"SAMPLE1": [
{
"path": "assembly/SAMPLE1.assembly.fa.gz"
}
],
"SAMPLE2": [
{
"path": "assembly/SAMPLE2.assembly.fa.gz"
}
],
"SAMPLE3": [
{
"path": "assembly/SAMPLE3.assembly.fa.gz"
}
]
}
},
"metadata": {
"samples": {
"SAMPLE1": {
"reads.1": "sample1_R1.fastq.gz",
"reads.2": "sample1_R2.fastq.gz"
"ABC": {
"irida_id": "sample1",
"metadata_1": "1.1",
"metadata_2": "1.2",
"metadata_3": "1.3",
"metadata_4": "1.4",
"metadata_5": "1.5",
"metadata_6": "1.6",
"metadata_7": "1.7",
"metadata_8": "1.8"
},
"SAMPLE2": {
"reads.1": "sample2_R1.fastq.gz",
"reads.2": "sample2_R2.fastq.gz"
"DEF": {
"irida_id": "sample2",
"metadata_1": "2.1",
"metadata_2": "2.2",
"metadata_3": "2.3",
"metadata_4": "2.4",
"metadata_5": "2.5",
"metadata_6": "2.6",
"metadata_7": "2.7",
"metadata_8": "2.8"
},
"SAMPLE3": {
"reads.1": "sample1_R1.fastq.gz",
"reads.2": "null"
"GHI": {
"irida_id": "sample3",
"metadata_1": "3.1",
"metadata_2": "3.2",
"metadata_3": "3.3",
"metadata_4": "3.4",
"metadata_5": "3.5",
"metadata_6": "3.6",
"metadata_7": "3.7",
"metadata_8": "3.8"
}
}
}
}
```

Within the `files` section of this JSON file, all of the output paths are relative to the `outdir`. Therefore, `"path": "assembly/SAMPLE1.assembly.fa.gz"` refers to a file located within `outdir/assembly/SAMPLE1.assembly.fa.gz`.

There is also a pipeline execution summary output file provided (specified in the above JSON as `"global": [{"path":"summary/summary.txt.gz"}]`). However, there is no formatting specification for this file.

For more information see [output doc](docs/output.md)
For more information see [output doc](docs/output.md).

## Test profile

To run with the test profile, please do:

```bash
nextflow run phac-nml/metadatatransformation -profile docker,test -r main -latest --outdir results
nextflow run phac-nml/metadatatransformation -profile docker,test -r main -latest --outdir results --transformation lock
```

# Legal

Copyright 2023 Government of Canada
Copyright 2025 Government of Canada

Licensed under the MIT License (the "License"); you may not use
this work except in compliance with the License. You may obtain a copy of the
Expand Down
74 changes: 6 additions & 68 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,87 +4,25 @@

This document describes the output produced by the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
The directories listed below may be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory. The exact directories created may depend on which metadata transformation is performed.

- assembly: very small mock assembly files for each sample
- generate: intermediate files used in generating the IRIDA Next JSON output
- pipeline_info: information about the pipeline's execution
- simplify: simplified intermediate files used in generating the IRIDA Next JSON output
- summary: summary report about the pipeline's execution and results
- lock: the outputs of the metadata lock operation

The IRIDA Next-compliant JSON output file will be named `iridanext.output.json.gz` and will be written to the top-level of the results directory. This file is compressed using GZIP and conforms to the [IRIDA Next JSON output specifications](https://github.com/phac-nml/pipeline-standards#42-irida-next-json).

## Pipeline overview

The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:

- [Assembly stub](#assembly-stub) - Performs a stub assembly by generating a mock assembly
- [Generate sample JSON](#generate-sample-json) - Generates a JSON file for each sample
- [Generate summary](#generate-summary) - Generates a summary text file describing the samples and assemblies
- [Simplify IRIDA JSON](#simplify-irida-json) - Simplifies the sample JSONs by limiting nesting depth
- [IRIDA Next Output](#irida-next-output) - Generates a JSON output file that is compliant with IRIDA Next
- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
- [Lock](#lock) - Locks the metadata for IRIDA Next.

### Assembly stub
### Lock

<details markdown="1">
<summary>Output files</summary>

- `assembly/`
- Mock assembly files: `ID.assembly.fa.gz`

</details>

### Generate sample JSON

<details markdown="1">
<summary>Output files</summary>

- `generate/`
- JSON files: `ID.json.gz`

</details>

### Generate summary

<details markdown="1">
<summary>Output files</summary>

- `summary/`
- Text summary describing samples and assemblies: `summary.txt.gz`

</details>

### Simplify IRIDA JSON

<details markdown="1">
<summary>Output files</summary>

- `simplify/`
- Simplified JSON files: `ID.simple.json.gz`

</details>

### IRIDA Next Output

<details markdown="1">
<summary>Output files</summary>

- `/`
- IRIDA Next-compliant JSON output: `iridanext.output.json.gz`

</details>

### Pipeline information

<details markdown="1">
<summary>Output files</summary>

- `pipeline_info/`
- Reports generated by Nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.dot`/`pipeline_dag.svg`.
- Reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.yml`. The `pipeline_report*` files will only be present if the `--email` / `--email_on_fail` parameter's are used when running the pipeline.
- Reformatted samplesheet files used as input to the pipeline: `samplesheet.valid.csv`.
- Parameters used by the pipeline run: `params.json`.
- `lock/`
- A CSV-format file reporting locked files: `locked.csv`

</details>

Expand Down
55 changes: 15 additions & 40 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,34 +2,34 @@

## Introduction

This pipeline is an example that illustrates running a nf-core-compliant pipeline on IRIDA Next.
This pipeline transforms metadata from IRIDA Next.

## Samplesheet input
## Sample sheet input

You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row as shown in the examples below.
You will need to create a sample sheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 10 columns, and a header row as shown in the examples below.

```bash
--input '[path to samplesheet file]'
```

### Full samplesheet

The input samplesheet must contain three columns: `ID`, `fastq_1`, `fastq_2`. The IDs within a samplesheet should be unique. All other columns will be ignored.
The input samplesheet must contain the following columns: `sample`, and `metadata_1` through `metadata_8`. The IDs within a samplesheet should be unique. You may optionally provide a `sample_name` column, which will replace the Irida Next IDs in the `sample` column if available. All other columns will be ignored.

A final samplesheet file consisting of both single- and paired-end data may look something like the one below.
A final samplesheet file contain the `sample_name` column may look something like the one below.

```csv title="samplesheet.csv"
sample,fastq_1,fastq_2
SAMPLE1,sample1_R1.fastq.gz,sample1_R2.fastq.gz
SAMPLE2,sample2_R1.fastq.gz,sample2_R2.fastq.gz
SAMPLE3,sample1_R1.fastq.gz,
sample,sample_name,metadata_1,metadata_2,metadata_3,metadata_4,metadata_5,metadata_6,metadata_7,metadata_8
sample1,"ABC",1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8
sample2,"DEF",2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8
sample3,"GHI",3.1,3.2,3.3,3.4,3.5,3.6,3.7,3.8
```

| Column | Description |
| --------- | -------------------------------------------------------------------------------------------------------------------------- |
| `sample` | Custom sample name. Samples should be unique within a samplesheet. |
| `fastq_1` | Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". |
| `fastq_2` | Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". |
| Column | Description |
| ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------ |
| `sample` | Sample ID. Samples should be unique within a samplesheet. Likely Irida Next IDs. |
| `sample_name` | Sample name. Likely user-provided IDs that should be unique, but are not required to be unique. Will be used over `sample` when available. |
| `metadata_1..metadata_8` | Metadata that will be used in the metadata transformations. |

An [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline.

Expand All @@ -38,7 +38,7 @@ An [example samplesheet](../assets/samplesheet.csv) has been provided with the p
The typical command for running the pipeline is as follows:

```bash
nextflow run main.nf --input ./samplesheet.csv --outdir ./results -profile singularity
nextflow run phac-nml/metadatatransformation -profile singularity -r main -latest --input assets/samplesheet.csv --outdir results --transformation lock
```

This will launch the pipeline with the `singularity` configuration profile. See below for more information about profiles.
Expand All @@ -58,31 +58,6 @@ Pipeline settings can be provided in a `yaml` or `json` file via `-params-file <

Do not use `-c <file>` to specify parameters as this will result in errors. Custom config files specified with `-c` must only be used for [tuning process resource specifications](https://nf-co.re/docs/usage/configuration#tuning-workflow-resources), other infrastructural tweaks (such as output directories), or module arguments (args).

### Overriding Container Registries with the `container` Directive

The `metadatatransformation` has implemented the process `override_configured_container_registry` ([detailed here](https://github.com/phac-nml/pipeline-standards?tab=readme-ov-file#5221-example-overriding-container-registries-with-the-container-directive)) to allow `docker.io` to be used when default registry is `quay.io` to [customize the container](#custom-containers) for the [process](/modules/local/simplifyiridajson/main.nf) `SIMPLIFY_IRIDA_JSON`. The process can be changed in the [nextflow.config](/./nextflow.config#L158)

```bash
// Override the default Docker registry when required
process.ext.override_configured_container_registry = true
```

The above pipeline run specified with a params file in yaml format:

```bash
nextflow run phac-nml/metadatatransformation -profile docker -params-file params.yaml
```

with `params.yaml` containing:

```yaml
input: './samplesheet.csv'
outdir: './results/'
<...>
```

You can also generate such `YAML`/`JSON` files via [nf-core/launch](https://nf-co.re/launch).

### Reproducibility

It is a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since.
Expand Down

0 comments on commit da16c13

Please sign in to comment.