Merge pull request #18 from phac-nml/dev

Version 0.1.0 Release
phac-nml · Jun 28, 2024 · e7e73cf · e7e73cf
2 parents cb8aa4b + 08582c8
commit e7e73cf
Show file tree

Hide file tree

Showing 108 changed files with 1,938 additions and 1,142 deletions.
diff --git a/.github/workflows/linting.yml b/.github/workflows/linting.yml
@@ -14,13 +14,12 @@ jobs:
   pre-commit:
     runs-on: ubuntu-latest
     steps:
-      - uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4
+      - uses: actions/checkout@0ad4b8fadaa221de15dcec353f45205ec38ea70b # v4
 
-      - name: Set up Python 3.11
-        uses: actions/setup-python@0a5c61591373683505ea898e09a3ea4f39ef2b9c # v5
+      - name: Set up Python 3.12
+        uses: actions/setup-python@82c7e631bb3cdc910f68e0081d67478d79c6982d # v5
         with:
-          python-version: 3.11
-          cache: "pip"
+          python-version: "3.12"
 
       - name: Install pre-commit
         run: pip install pre-commit
@@ -32,14 +31,14 @@ jobs:
     runs-on: ubuntu-latest
     steps:
       - name: Check out pipeline code
-        uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4
+        uses: actions/checkout@0ad4b8fadaa221de15dcec353f45205ec38ea70b # v4
 
       - name: Install Nextflow
-        uses: nf-core/setup-nextflow@v1
+        uses: nf-core/setup-nextflow@v2
 
-      - uses: actions/setup-python@0a5c61591373683505ea898e09a3ea4f39ef2b9c # v5
+      - uses: actions/setup-python@82c7e631bb3cdc910f68e0081d67478d79c6982d # v5
         with:
-          python-version: "3.11"
+          python-version: "3.12"
           architecture: "x64"
 
       - name: Install dependencies
@@ -60,7 +59,7 @@ jobs:
 
       - name: Upload linting log file artifact
         if: ${{ always() }}
-        uses: actions/upload-artifact@5d5d22a31266ced268874388b861e4b58bb5c2f3 # v4
+        uses: actions/upload-artifact@65462800fd760344b1a7b4382951275a0abb4808 # v4
         with:
           name: linting-logs
           path: |

diff --git a/.github/workflows/linting_comment.yml b/.github/workflows/linting_comment.yml
@@ -11,7 +11,7 @@ jobs:
     runs-on: ubuntu-latest
     steps:
       - name: Download lint results
-        uses: dawidd6/action-download-artifact@f6b0bace624032e30a85a8fd9c1a7f8f611f5737 # v3
+        uses: dawidd6/action-download-artifact@09f2f74827fd3a8607589e5ad7f9398816f540fe # v3
         with:
           workflow: linting.yml
           workflow_conclusion: completed

diff --git a/.nf-core.yml b/.nf-core.yml
@@ -1,5 +1,6 @@
 repository_type: pipeline
 
+nf_core_version: "2.14.1"
 lint:
   files_exist:
     - assets/nf-core-gasnomenclature_logo_light.png

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,32 +3,13 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
-## In-development
+## [0.1.0] - 2024/06/28
 
-- Fixed nf-core tools linting failures introduced in version 2.12.1.
-- Added phac-nml prefix to nf-core config
-
-## 1.0.3 - 2024/02/23
-
-- Pinned [email protected] plugin
-
-## 1.0.2 - 2023/12/18
-
-- Removed GitHub workflows that weren't needed.
-- Adding additional parameters for testing purposes.
-
-## 1.0.1 - 2023/12/06
-
-Allowing non-gzipped FASTQ files as input. Default branch is now main.
-
-## 1.0.0 - 2023/11/30
-
-Initial release of phac-nml/gasnomenclature, created with the [nf-core](https://nf-co.re/) template.
+Initial release of the Genomic Address Nomenclature pipeline to be used to assign cluster addresses to samples based on an existing cluster designations.
 
 ### `Added`
 
-### `Fixed`
-
-### `Dependencies`
+- Input of cg/wgMLST allele calls produced from [locidex](https://github.com/phac-nml/locidex).
+- Output of assigned cluster addresses for any **query** samples using [profile_dists](https://github.com/phac-nml/profile_dists) and [gas call](https://github.com/phac-nml/genomic_address_service).
 
-### `Deprecated`
+[0.1.0]: https://github.com/phac-nml/gasnomenclature/releases/tag/0.1.0
diff --git a/CITATIONS.md b/CITATIONS.md
@@ -10,6 +10,18 @@
 
 ## Pipeline tools
 
+- [locidex](https://github.com/phac-nml/locidex) (in-development, citation subject to change)
+
+  > Robertson, James, Wells, Matthew, Christy-Lynn, Peterson, Kyrylo Bessonov, Reimer, Aleisha, Schonfeld, Justin. LOCIDEX: Distributed allele calling engine. 2024. https://github.com/phac-nml/locidex
+
+- [profile_dists](https://github.com/phac-nml/profile_dists) (in-development, citation subject to change)
+
+  > Robertson, James, Wells, Matthew, Schonfeld, Justin, Reimer, Aleisha. Profile Dists: Convenient package for comparing genetic similarity of samples based on allelic profiles. 2023. https://github.com/phac-nml/profile_dists
+
+- [genomic_address_service (GAS)](https://github.com/phac-nml/genomic_address_service) (in-development, citation subject to change)
+
+  > Robertson, James, Wells, Matthew, Schonfeld, Justin, Reimer, Aleisha. Genomic Address Service: Convenient package for de novo clustering and sample assignment to existing clusters. 2023. https://github.com/phac-nml/genomic_address_service
+
 ## Software packaging/containerisation tools
 
 - [Anaconda](https://anaconda.com)

diff --git a/LICENSE b/LICENSE
@@ -1,6 +1,6 @@
 MIT License
 
-Copyright (c) Aaron Petkau
+Copyright (c) Government of Canada
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

diff --git a/README.md b/README.md
@@ -1,23 +1,80 @@
 [![Nextflow](https://img.shields.io/badge/nextflow-%E2%89%A523.04.3-brightgreen.svg)](https://www.nextflow.io/)
 
-# Example Pipeline for IRIDA Next
+# Genomic Address Service Nomenclature Workflow
 
-This is an example pipeline to be used for integration with IRIDA Next.
+This workflow takes provided JSON-formatted MLST allelic profiles and assigns cluster addresses to samples based on an existing cluster designations. This pipeline is designed to be integrated into IRIDA Next. However, it may be run as a stand-alone pipeline.
+
+A brief overview of the usage of this pipeline is given below. Detailed documentation can be found in the [docs/](docs/) directory.
 
 # Input
 
 The input to the pipeline is a standard sample sheet (passed as `--input samplesheet.csv`) that looks like:
 
-| sample  | fastq_1         | fastq_2         |
-| ------- | --------------- | --------------- |
-| SampleA | file_1.fastq.gz | file_2.fastq.gz |
+| sample  | mlst_alleles      | address |
+| ------- | ----------------- | ------- |
+| sampleA | sampleA.mlst.json | 1.1.1   |
+| sampleQ | sampleQ.mlst.json |         |
+| sampleF | sampleF.mlst.json |         |
 
 The structure of this file is defined in [assets/schema_input.json](assets/schema_input.json). Validation of the sample sheet is performed by [nf-validation](https://nextflow-io.github.io/nf-validation/).
 
+Details on the columns can be found in the [Full samplesheet](docs/usage.md#full-samplesheet) documentation.
+
 # Parameters
 
 The main parameters are `--input` as defined above and `--output` for specifying the output results directory. You may wish to provide `-profile singularity` to specify the use of singularity containers and `-r [branch]` to specify which GitHub branch you would like to run.
 
+## Distance Method and Thresholds
+
+Profile_Dists and the Genomic Address Service workflows can use two distance methods: hamming or scaled.
+
+### Hamming Distances
+
+Hamming distances are integers representing the number of differing loci between two sequences and will range between [0, n], where `n` is the total number of loci. When using Hamming distances, you must specify `--pd_distm hamming` and provide Hamming distance thresholds as integers between [0, n]: `--gm_thresholds "10,5,0"` (10, 5, and 0 loci).
+
+### Scaled Distances
+
+Scaled distances are floats representing the percentage of differing loci between two sequences and will range between [0.0, 100.0]. When using scaled distances, you must specify `--pd_distm scaled` and provide percentages between [0.0, 100.0] as thresholds: `--gm_thresholds "50,20,0"` (50%, 20%, and 0% of loci).
+
+### Thresholds and Linkage Methods
+
+The `--gm_thresholds` parameter sets thresholds for each cluster level, which dictate how sequences are assigned cluster codes. These thresholds specify the maximum allowable differences in loci between sequences sharing the same cluster code at each level. The consistency of these thresholds in ensuring uniform cluster codes across levels depends on the `--gm_method` parameter, which determines the linkage method used for clustering.
+
+- _Complete Linkage_: When using complete linkage clustering, sequences are grouped such that identical cluster codes at a particular level guarantee that all sequences in that cluster are within the specified threshold distance. For example, specifying `--pd_distm hamming` and `--gm_thresholds "10,5,0"` would mean that sequences with no more than 10 loci differences are assigned the same cluster code at the first level, no more than 5 differences at the second level, and identical sequences at the third level.
+
+- _Average Linkage_: With average linkage clustering, sequences may share the same cluster code if their average distance is below the specified threshold. For instance, sequences with average distances less than 10, 5, and 0 for each level respectively may share the same cluster code.
+
+- _Single Linkage_: Single linkage clustering can result in merging distant samples into the same cluster if there exists a third sample that bridges the distance between them. This method does not provide strict guarantees on the maximum distance within a cluster, potentially allowing distant sequences to share the same cluster code.
+
+## Profile_dists
+
+The following can be used to adjust parameters for the [profile_dists][] tool.
+
+- `--pd_distm`: The distance method/unit, either _hamming_ or _scaled_. For _hamming_ distances, the distance values will be a non-negative integer. For _scaled_ distances, the distance values are between 0.0 and 100.0. Please see the [Distance Method and Thresholds](#distance-method-and-thresholds) section for more information.
+- `--pd_missing_threshold`: The maximum proportion of missing data per locus for a locus to be kept in the analysis. Values from 0 to 1.
+- `--pd_sample_quality_threshold`: The maximum proportion of missing data per sample for a sample to be kept in the analysis. Values from 0 to 1.
+- `--pd_file_type`: Output format file type. One of _text_ or _parquet_.
+- `--pd_mapping_file`: A file used to map allele codes to integers for internal distance calculations. This is the same file as produced from the _profile dists_ step (the [allele_map.json](docs/output.md#profile-dists) file). Normally, this is unneeded unless you wish to override the automated process of mapping alleles to integers.
+- `--pd_skip`: Skip QA/QC steps. Can be used as a flag, `--pd_skip`, or passing a boolean, `--pd_skip true` or `--pd_skip false`.
+- `--pd_columns`: Path to a file that defines the loci to keep within the analysis (default when unset is to keep all loci). Formatted as a single column file with one locus name per line. For example:
+  - **Single column format**
+    ```
+    loci1
+    loci2
+    loci3
+    ```
+- `--pd_count_missing`: Count missing alleles as different. Can be used as a flag, `--pd_count_missing`, or passing a boolean, `--pd_count_missing true` or `--pd_count_missing false`. If true, will consider missing allele calls for the same locus between samples as a difference, increasing the distance counts.
+
+## GAS CALL
+
+The following can be used to adjust parameters for the [gas call][] tool.
+
+- `--gm_thresholds`: Thresholds delimited by `,`. Values should match units from `--pd_distm` (either _hamming_ or _scaled_). Please see the [Distance Method and Thresholds](#distance-method-and-thresholds) section for more information.
+- `--gm_method`: The linkage method to use for clustering. Value should be one of _single_, _average_, or _complete_.
+- `--gm_delimiter`: Delimiter desired for nomenclature code. Must be alphanumeric or one of `._-`.
+
+## Other
+
 Other parameters (defaults from nf-core) are defined in [nextflow_schema.json](nextflow_schmea.json).
 
 # Running
@@ -39,51 +96,26 @@ An example of the what the contents of the IRIDA Next JSON file looks like for t
 ```
 {
     "files": {
-        "global": [
-            {
-                "path": "summary/summary.txt.gz"
-            }
-        ],
+        "global": [],
         "samples": {
-            "SAMPLE1": [
+            "sampleF": [
                 {
-                    "path": "assembly/SAMPLE1.assembly.fa.gz"
+                    "path": "input/sampleF_error_report.csv"
                 }
             ],
-            "SAMPLE2": [
-                {
-                    "path": "assembly/SAMPLE2.assembly.fa.gz"
-                }
-            ],
-            "SAMPLE3": [
-                {
-                    "path": "assembly/SAMPLE3.assembly.fa.gz"
-                }
-            ]
         }
     },
     "metadata": {
         "samples": {
-            "SAMPLE1": {
-                "reads.1": "sample1_R1.fastq.gz",
-                "reads.2": "sample1_R2.fastq.gz"
-            },
-            "SAMPLE2": {
-                "reads.1": "sample2_R1.fastq.gz",
-                "reads.2": "sample2_R2.fastq.gz"
-            },
-            "SAMPLE3": {
-                "reads.1": "sample1_R1.fastq.gz",
-                "reads.2": "null"
+            "sampleQ": {
+                "address": "1.1.3",
             }
         }
     }
 }
 ```
 
-Within the `files` section of this JSON file, all of the output paths are relative to the `outdir`. Therefore, `"path": "assembly/SAMPLE1.assembly.fa.gz"` refers to a file located within `outdir/assembly/SAMPLE1.assembly.fa.gz`.
-
-There is also a pipeline execution summary output file provided (specified in the above JSON as `"global": [{"path":"summary/summary.txt.gz"}]`). However, there is no formatting specification for this file.
+Within the `files` section of this JSON file, all of the output paths are relative to the `outdir`. Therefore, `"path": "input/sampleF_error_report.csv"` refers to a file located within `outdir/input/sampleF_error_report.csv`. This file is generated only if a sample fails the input check during samplesheet assessment.
 
 ## Test profile
 
@@ -95,7 +127,7 @@ nextflow run phac-nml/gasnomenclature -profile docker,test -r main -latest --out
 
 # Legal
 
-Copyright 2023 Government of Canada
+Copyright 2024 Government of Canada
 
 Licensed under the MIT License (the "License"); you may not use
 this work except in compliance with the License. You may obtain a copy of the

diff --git a/assets/samplesheet.csv b/assets/samplesheet.csv
@@ -1,4 +1,5 @@
-sample,fastq_1,fastq_2
-SAMPLE1,https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample1_R1.fastq.gz,https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample1_R2.fastq.gz
-SAMPLE2,https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample2_R1.fastq.gz,https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample2_R2.fastq.gz
-SAMPLE3,https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample1_R1.fastq.gz,
+sample,mlst_alleles,address
+sampleQ,https://raw.githubusercontent.com/phac-nml/gasnomenclature/dev/tests/data/reports/sampleQ.mlst.json,
+sample1,https://raw.githubusercontent.com/phac-nml/gasnomenclature/dev/tests/data/reports/sample1.mlst.json,1.1.1
+sample2,https://raw.githubusercontent.com/phac-nml/gasnomenclature/dev/tests/data/reports/sample2.mlst.json,1.1.1
+sample3,https://raw.githubusercontent.com/phac-nml/gasnomenclature/dev/tests/data/reports/sample3.mlst.json,1.1.2
diff --git a/assets/schema_input.json b/assets/schema_input.json
@@ -14,19 +14,20 @@
                 "unique": true,
                 "errorMessage": "Sample name must be provided and cannot contain spaces"
             },
-            "profile_type": {
-                "meta": ["profile_type"],
-                "description": "Determines has already been clustered (True) or if it is new, and requiring nomenclature assignment (False)",
-                "errorMessage": "Please specify if the mlst profile has already been clustered (True) or if it is new and requires nomenclature assignment (False)",
-                "type": "boolean"
-            },
             "mlst_alleles": {
                 "type": "string",
                 "format": "file-path",
-                "pattern": "^\\S+\\.mlst\\.json(\\.gz)?$",
-                "errorMessage": "MLST JSON file from locidex report, cannot contain spaces and must have the extension: '.mlst.json' or '.mlst.json.gz'"
+                "pattern": "^\\S+\\.mlst(\\.subtyping)?\\.json(\\.gz)?$",
+                "errorMessage": "MLST JSON file from locidex report, cannot contain spaces and must have the extension: '.mlst.json', '.mlst.json.gz', '.mlst.subtyping.json', or 'mlst.subtyping.json.gz'"
+            },
+            "address": {
+                "type": "string",
+                "pattern": "^\\d+(\\.\\d+)*$",
+                "meta": ["address"],
+                "description": "The loci-based typing identifier (address) of the sample",
+                "error_message": "Invalid loci-based typing identifier. Please ensure that the address follows the correct format, consisting of one or more digits separated by periods. Example of a valid identifier: '1.1.1'. Please review and correct the entry"
             }
         },
-        "required": ["sample", "profile_type", "mlst_alleles"]
+        "required": ["sample", "mlst_alleles"]
     }
 }