Updating docs.

phac-nml · Feb 5, 2025 · da16c13 · da16c13
1 parent 2acfcea
commit da16c13
Show file tree

Hide file tree

Showing 3 changed files with 73 additions and 149 deletions.
diff --git a/README.md b/README.md
@@ -1,31 +1,41 @@
 [![Nextflow](https://img.shields.io/badge/nextflow-%E2%89%A523.04.3-brightgreen.svg)](https://www.nextflow.io/)
 
-# Example Pipeline for IRIDA Next
+# Metadata Transformation Pipeline for IRIDA Next
 
-This is an example pipeline to be used for integration with IRIDA Next.
+This pipeline transforms metadata from IRIDA Next.
 
 # Input
 
-The input to the pipeline is a standard sample sheet (passed as `--input samplesheet.csv`) that looks like:
+The input to the pipeline is a sample sheet (passed as `--input samplesheet.csv`) that looks like:
 
-| sample  | fastq_1         | fastq_2         |
-| ------- | --------------- | --------------- |
-| SampleA | file_1.fastq.gz | file_2.fastq.gz |
+| sample  | sample_name | metadata_1 | metadata_2 | metadata_3 | metadata_4 | metadata_5 | metadata_6 | metadata_7 | metadata_8 |
+| ------- | ----------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
+| Sample1 | SampleA     | meta_1     | meta_2     | meta_3     | meta_4     | meta_5     | meta_6     | meta_7     | meta_8     |
 
 The structure of this file is defined in [assets/schema_input.json](assets/schema_input.json). Validation of the sample sheet is performed by [nf-validation](https://nextflow-io.github.io/nf-validation/).
 
 # Parameters
 
 The main parameters are `--input` as defined above and `--output` for specifying the output results directory. You may wish to provide `-profile singularity` to specify the use of singularity containers and `-r [branch]` to specify which GitHub branch you would like to run.
 
+## Transformation
+
+You may specify the metadata transformation with the `--transformation` parameter. For example, `--transformation lock` will perform the lock transformation. The available transformations are as follows:
+
+| Transformation | Explanation                       |
+| -------------- | --------------------------------- |
+| lock           | Locks the metadata in IRIDA Next. |
+
+## Other Parameters
+
 Other parameters (defaults from nf-core) are defined in [nextflow_schema.json](nextflow_schema.json).
 
 # Running
 
 To run the pipeline, please do:
 
 ```bash
-nextflow run phac-nml/metadatatransformation -profile singularity -r main -latest --input assets/samplesheet.csv --outdir results
+nextflow run phac-nml/metadatatransformation -profile singularity -r main -latest --input assets/samplesheet.csv --outdir results --transformation lock
 ```
 
 Where the `samplesheet.csv` is structured as specified in the [Input](#input) section.
@@ -41,64 +51,65 @@ An example of the what the contents of the IRIDA Next JSON file looks like for t
 {
     "files": {
         "global": [
-            {
-                "path": "summary/summary.txt.gz"
-            }
+            
         ],
         "samples": {
-            "SAMPLE1": [
-                {
-                    "path": "assembly/SAMPLE1.assembly.fa.gz"
-                }
-            ],
-            "SAMPLE2": [
-                {
-                    "path": "assembly/SAMPLE2.assembly.fa.gz"
-                }
-            ],
-            "SAMPLE3": [
-                {
-                    "path": "assembly/SAMPLE3.assembly.fa.gz"
-                }
-            ]
+            
         }
     },
     "metadata": {
         "samples": {
-            "SAMPLE1": {
-                "reads.1": "sample1_R1.fastq.gz",
-                "reads.2": "sample1_R2.fastq.gz"
+            "ABC": {
+                "irida_id": "sample1",
+                "metadata_1": "1.1",
+                "metadata_2": "1.2",
+                "metadata_3": "1.3",
+                "metadata_4": "1.4",
+                "metadata_5": "1.5",
+                "metadata_6": "1.6",
+                "metadata_7": "1.7",
+                "metadata_8": "1.8"
             },
-            "SAMPLE2": {
-                "reads.1": "sample2_R1.fastq.gz",
-                "reads.2": "sample2_R2.fastq.gz"
+            "DEF": {
+                "irida_id": "sample2",
+                "metadata_1": "2.1",
+                "metadata_2": "2.2",
+                "metadata_3": "2.3",
+                "metadata_4": "2.4",
+                "metadata_5": "2.5",
+                "metadata_6": "2.6",
+                "metadata_7": "2.7",
+                "metadata_8": "2.8"
             },
-            "SAMPLE3": {
-                "reads.1": "sample1_R1.fastq.gz",
-                "reads.2": "null"
+            "GHI": {
+                "irida_id": "sample3",
+                "metadata_1": "3.1",
+                "metadata_2": "3.2",
+                "metadata_3": "3.3",
+                "metadata_4": "3.4",
+                "metadata_5": "3.5",
+                "metadata_6": "3.6",
+                "metadata_7": "3.7",
+                "metadata_8": "3.8"
             }
         }
     }
 }
 ```
 
-Within the `files` section of this JSON file, all of the output paths are relative to the `outdir`. Therefore, `"path": "assembly/SAMPLE1.assembly.fa.gz"` refers to a file located within `outdir/assembly/SAMPLE1.assembly.fa.gz`.
-
-There is also a pipeline execution summary output file provided (specified in the above JSON as `"global": [{"path":"summary/summary.txt.gz"}]`). However, there is no formatting specification for this file.
-
-For more information see [output doc](docs/output.md)
+For more information see [output doc](docs/output.md).
 
 ## Test profile
 
 To run with the test profile, please do:
 
 ```bash
-nextflow run phac-nml/metadatatransformation -profile docker,test -r main -latest --outdir results
+nextflow run phac-nml/metadatatransformation -profile docker,test -r main -latest --outdir results --transformation lock
 ```
 
 # Legal
 
-Copyright 2023 Government of Canada
+Copyright 2025 Government of Canada
 
 Licensed under the MIT License (the "License"); you may not use
 this work except in compliance with the License. You may obtain a copy of the

diff --git a/docs/output.md b/docs/output.md
@@ -4,87 +4,25 @@
 
 This document describes the output produced by the pipeline.
 
-The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
+The directories listed below may be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory. The exact directories created may depend on which metadata transformation is performed.
 
-- assembly: very small mock assembly files for each sample
-- generate: intermediate files used in generating the IRIDA Next JSON output
-- pipeline_info: information about the pipeline's execution
-- simplify: simplified intermediate files used in generating the IRIDA Next JSON output
-- summary: summary report about the pipeline's execution and results
+- lock: the outputs of the metadata lock operation
 
 The IRIDA Next-compliant JSON output file will be named `iridanext.output.json.gz` and will be written to the top-level of the results directory. This file is compressed using GZIP and conforms to the [IRIDA Next JSON output specifications](https://github.com/phac-nml/pipeline-standards#42-irida-next-json).
 
 ## Pipeline overview
 
 The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:
 
-- [Assembly stub](#assembly-stub) - Performs a stub assembly by generating a mock assembly
-- [Generate sample JSON](#generate-sample-json) - Generates a JSON file for each sample
-- [Generate summary](#generate-summary) - Generates a summary text file describing the samples and assemblies
-- [Simplify IRIDA JSON](#simplify-irida-json) - Simplifies the sample JSONs by limiting nesting depth
-- [IRIDA Next Output](#irida-next-output) - Generates a JSON output file that is compliant with IRIDA Next
-- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
+- [Lock](#lock) - Locks the metadata for IRIDA Next.
 
-### Assembly stub
+### Lock
 
 <details markdown="1">
 <summary>Output files</summary>
 
-- `assembly/`
-  - Mock assembly files: `ID.assembly.fa.gz`
-
-</details>
-
-### Generate sample JSON
-
-<details markdown="1">
-<summary>Output files</summary>
-
-- `generate/`
-  - JSON files: `ID.json.gz`
-
-</details>
-
-### Generate summary
-
-<details markdown="1">
-<summary>Output files</summary>
-
-- `summary/`
-  - Text summary describing samples and assemblies: `summary.txt.gz`
-
-</details>
-
-### Simplify IRIDA JSON
-
-<details markdown="1">
-<summary>Output files</summary>
-
-- `simplify/`
-  - Simplified JSON files: `ID.simple.json.gz`
-
-</details>
-
-### IRIDA Next Output
-
-<details markdown="1">
-<summary>Output files</summary>
-
-- `/`
-  - IRIDA Next-compliant JSON output: `iridanext.output.json.gz`
-
-</details>
-
-### Pipeline information
-
-<details markdown="1">
-<summary>Output files</summary>
-
-- `pipeline_info/`
-  - Reports generated by Nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.dot`/`pipeline_dag.svg`.
-  - Reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.yml`. The `pipeline_report*` files will only be present if the `--email` / `--email_on_fail` parameter's are used when running the pipeline.
-  - Reformatted samplesheet files used as input to the pipeline: `samplesheet.valid.csv`.
-  - Parameters used by the pipeline run: `params.json`.
+- `lock/`
+  - A CSV-format file reporting locked files: `locked.csv`
 
 </details>
 

diff --git a/docs/usage.md b/docs/usage.md
@@ -2,34 +2,34 @@
 
 ## Introduction
 
-This pipeline is an example that illustrates running a nf-core-compliant pipeline on IRIDA Next.
+This pipeline transforms metadata from IRIDA Next.
 
-## Samplesheet input
+## Sample sheet input
 
-You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row as shown in the examples below.
+You will need to create a sample sheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 10 columns, and a header row as shown in the examples below.
 
 ```bash
 --input '[path to samplesheet file]'
 ```
 
 ### Full samplesheet
 
-The input samplesheet must contain three columns: `ID`, `fastq_1`, `fastq_2`. The IDs within a samplesheet should be unique. All other columns will be ignored.
+The input samplesheet must contain the following columns: `sample`, and `metadata_1` through `metadata_8`. The IDs within a samplesheet should be unique. You may optionally provide a `sample_name` column, which will replace the Irida Next IDs in the `sample` column if available. All other columns will be ignored.
 
-A final samplesheet file consisting of both single- and paired-end data may look something like the one below.
+A final samplesheet file contain the `sample_name` column may look something like the one below.
 
 ```csv title="samplesheet.csv"
-sample,fastq_1,fastq_2
-SAMPLE1,sample1_R1.fastq.gz,sample1_R2.fastq.gz
-SAMPLE2,sample2_R1.fastq.gz,sample2_R2.fastq.gz
-SAMPLE3,sample1_R1.fastq.gz,
+sample,sample_name,metadata_1,metadata_2,metadata_3,metadata_4,metadata_5,metadata_6,metadata_7,metadata_8
+sample1,"ABC",1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8
+sample2,"DEF",2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8
+sample3,"GHI",3.1,3.2,3.3,3.4,3.5,3.6,3.7,3.8
 ```
 
-| Column    | Description                                                                                                                |
-| --------- | -------------------------------------------------------------------------------------------------------------------------- |
-| `sample`  | Custom sample name. Samples should be unique within a samplesheet.                                                         |
-| `fastq_1` | Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". |
-| `fastq_2` | Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". |
+| Column                   | Description                                                                                                                                |
+| ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------ |
+| `sample`                 | Sample ID. Samples should be unique within a samplesheet. Likely Irida Next IDs.                                                           |
+| `sample_name`            | Sample name. Likely user-provided IDs that should be unique, but are not required to be unique. Will be used over `sample` when available. |
+| `metadata_1..metadata_8` | Metadata that will be used in the metadata transformations.                                                                                |
 
 An [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline.
 
@@ -38,7 +38,7 @@ An [example samplesheet](../assets/samplesheet.csv) has been provided with the p
 The typical command for running the pipeline is as follows:
 
 ```bash
-nextflow run main.nf --input ./samplesheet.csv --outdir ./results -profile singularity
+nextflow run phac-nml/metadatatransformation -profile singularity -r main -latest --input assets/samplesheet.csv --outdir results --transformation lock
 ```
 
 This will launch the pipeline with the `singularity` configuration profile. See below for more information about profiles.
@@ -58,31 +58,6 @@ Pipeline settings can be provided in a `yaml` or `json` file via `-params-file <
 
 Do not use `-c <file>` to specify parameters as this will result in errors. Custom config files specified with `-c` must only be used for [tuning process resource specifications](https://nf-co.re/docs/usage/configuration#tuning-workflow-resources), other infrastructural tweaks (such as output directories), or module arguments (args).
 
-### Overriding Container Registries with the `container` Directive
-
-The `metadatatransformation` has implemented the process `override_configured_container_registry` ([detailed here](https://github.com/phac-nml/pipeline-standards?tab=readme-ov-file#5221-example-overriding-container-registries-with-the-container-directive)) to allow `docker.io` to be used when default registry is `quay.io` to [customize the container](#custom-containers) for the [process](/modules/local/simplifyiridajson/main.nf) `SIMPLIFY_IRIDA_JSON`. The process can be changed in the [nextflow.config](/./nextflow.config#L158)
-
-```bash
-// Override the default Docker registry when required
-process.ext.override_configured_container_registry = true
-```
-
-The above pipeline run specified with a params file in yaml format:
-
-```bash
-nextflow run phac-nml/metadatatransformation -profile docker -params-file params.yaml
-```
-
-with `params.yaml` containing:
-
-```yaml
-input: './samplesheet.csv'
-outdir: './results/'
-<...>
-```
-
-You can also generate such `YAML`/`JSON` files via [nf-core/launch](https://nf-co.re/launch).
-
 ### Reproducibility
 
 It is a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since.