Skip to content

Commit

Permalink
Merge pull request #18 from phac-nml/dev
Browse files Browse the repository at this point in the history
Version 0.1.0 Release
  • Loading branch information
kylacochrane authored Jun 28, 2024
2 parents cb8aa4b + 08582c8 commit e7e73cf
Show file tree
Hide file tree
Showing 108 changed files with 1,938 additions and 1,142 deletions.
19 changes: 9 additions & 10 deletions .github/workflows/linting.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,12 @@ jobs:
pre-commit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4
- uses: actions/checkout@0ad4b8fadaa221de15dcec353f45205ec38ea70b # v4

- name: Set up Python 3.11
uses: actions/setup-python@0a5c61591373683505ea898e09a3ea4f39ef2b9c # v5
- name: Set up Python 3.12
uses: actions/setup-python@82c7e631bb3cdc910f68e0081d67478d79c6982d # v5
with:
python-version: 3.11
cache: "pip"
python-version: "3.12"

- name: Install pre-commit
run: pip install pre-commit
Expand All @@ -32,14 +31,14 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Check out pipeline code
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4
uses: actions/checkout@0ad4b8fadaa221de15dcec353f45205ec38ea70b # v4

- name: Install Nextflow
uses: nf-core/setup-nextflow@v1
uses: nf-core/setup-nextflow@v2

- uses: actions/setup-python@0a5c61591373683505ea898e09a3ea4f39ef2b9c # v5
- uses: actions/setup-python@82c7e631bb3cdc910f68e0081d67478d79c6982d # v5
with:
python-version: "3.11"
python-version: "3.12"
architecture: "x64"

- name: Install dependencies
Expand All @@ -60,7 +59,7 @@ jobs:

- name: Upload linting log file artifact
if: ${{ always() }}
uses: actions/upload-artifact@5d5d22a31266ced268874388b861e4b58bb5c2f3 # v4
uses: actions/upload-artifact@65462800fd760344b1a7b4382951275a0abb4808 # v4
with:
name: linting-logs
path: |
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/linting_comment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Download lint results
uses: dawidd6/action-download-artifact@f6b0bace624032e30a85a8fd9c1a7f8f611f5737 # v3
uses: dawidd6/action-download-artifact@09f2f74827fd3a8607589e5ad7f9398816f540fe # v3
with:
workflow: linting.yml
workflow_conclusion: completed
Expand Down
1 change: 1 addition & 0 deletions .nf-core.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
repository_type: pipeline

nf_core_version: "2.14.1"
lint:
files_exist:
- assets/nf-core-gasnomenclature_logo_light.png
Expand Down
29 changes: 5 additions & 24 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,32 +3,13 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## In-development
## [0.1.0] - 2024/06/28

- Fixed nf-core tools linting failures introduced in version 2.12.1.
- Added phac-nml prefix to nf-core config

## 1.0.3 - 2024/02/23

- Pinned [email protected] plugin

## 1.0.2 - 2023/12/18

- Removed GitHub workflows that weren't needed.
- Adding additional parameters for testing purposes.

## 1.0.1 - 2023/12/06

Allowing non-gzipped FASTQ files as input. Default branch is now main.

## 1.0.0 - 2023/11/30

Initial release of phac-nml/gasnomenclature, created with the [nf-core](https://nf-co.re/) template.
Initial release of the Genomic Address Nomenclature pipeline to be used to assign cluster addresses to samples based on an existing cluster designations.

### `Added`

### `Fixed`

### `Dependencies`
- Input of cg/wgMLST allele calls produced from [locidex](https://github.com/phac-nml/locidex).
- Output of assigned cluster addresses for any **query** samples using [profile_dists](https://github.com/phac-nml/profile_dists) and [gas call](https://github.com/phac-nml/genomic_address_service).

### `Deprecated`
[0.1.0]: https://github.com/phac-nml/gasnomenclature/releases/tag/0.1.0
12 changes: 12 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,18 @@
## Pipeline tools

- [locidex](https://github.com/phac-nml/locidex) (in-development, citation subject to change)

> Robertson, James, Wells, Matthew, Christy-Lynn, Peterson, Kyrylo Bessonov, Reimer, Aleisha, Schonfeld, Justin. LOCIDEX: Distributed allele calling engine. 2024. https://github.com/phac-nml/locidex
- [profile_dists](https://github.com/phac-nml/profile_dists) (in-development, citation subject to change)

> Robertson, James, Wells, Matthew, Schonfeld, Justin, Reimer, Aleisha. Profile Dists: Convenient package for comparing genetic similarity of samples based on allelic profiles. 2023. https://github.com/phac-nml/profile_dists
- [genomic_address_service (GAS)](https://github.com/phac-nml/genomic_address_service) (in-development, citation subject to change)

> Robertson, James, Wells, Matthew, Schonfeld, Justin, Reimer, Aleisha. Genomic Address Service: Convenient package for de novo clustering and sample assignment to existing clusters. 2023. https://github.com/phac-nml/genomic_address_service
## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)
Expand Down
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) Aaron Petkau
Copyright (c) Government of Canada

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
106 changes: 69 additions & 37 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,80 @@
[![Nextflow](https://img.shields.io/badge/nextflow-%E2%89%A523.04.3-brightgreen.svg)](https://www.nextflow.io/)

# Example Pipeline for IRIDA Next
# Genomic Address Service Nomenclature Workflow

This is an example pipeline to be used for integration with IRIDA Next.
This workflow takes provided JSON-formatted MLST allelic profiles and assigns cluster addresses to samples based on an existing cluster designations. This pipeline is designed to be integrated into IRIDA Next. However, it may be run as a stand-alone pipeline.

A brief overview of the usage of this pipeline is given below. Detailed documentation can be found in the [docs/](docs/) directory.

# Input

The input to the pipeline is a standard sample sheet (passed as `--input samplesheet.csv`) that looks like:

| sample | fastq_1 | fastq_2 |
| ------- | --------------- | --------------- |
| SampleA | file_1.fastq.gz | file_2.fastq.gz |
| sample | mlst_alleles | address |
| ------- | ----------------- | ------- |
| sampleA | sampleA.mlst.json | 1.1.1 |
| sampleQ | sampleQ.mlst.json | |
| sampleF | sampleF.mlst.json | |

The structure of this file is defined in [assets/schema_input.json](assets/schema_input.json). Validation of the sample sheet is performed by [nf-validation](https://nextflow-io.github.io/nf-validation/).

Details on the columns can be found in the [Full samplesheet](docs/usage.md#full-samplesheet) documentation.

# Parameters

The main parameters are `--input` as defined above and `--output` for specifying the output results directory. You may wish to provide `-profile singularity` to specify the use of singularity containers and `-r [branch]` to specify which GitHub branch you would like to run.

## Distance Method and Thresholds

Profile_Dists and the Genomic Address Service workflows can use two distance methods: hamming or scaled.

### Hamming Distances

Hamming distances are integers representing the number of differing loci between two sequences and will range between [0, n], where `n` is the total number of loci. When using Hamming distances, you must specify `--pd_distm hamming` and provide Hamming distance thresholds as integers between [0, n]: `--gm_thresholds "10,5,0"` (10, 5, and 0 loci).

### Scaled Distances

Scaled distances are floats representing the percentage of differing loci between two sequences and will range between [0.0, 100.0]. When using scaled distances, you must specify `--pd_distm scaled` and provide percentages between [0.0, 100.0] as thresholds: `--gm_thresholds "50,20,0"` (50%, 20%, and 0% of loci).

### Thresholds and Linkage Methods

The `--gm_thresholds` parameter sets thresholds for each cluster level, which dictate how sequences are assigned cluster codes. These thresholds specify the maximum allowable differences in loci between sequences sharing the same cluster code at each level. The consistency of these thresholds in ensuring uniform cluster codes across levels depends on the `--gm_method` parameter, which determines the linkage method used for clustering.

- _Complete Linkage_: When using complete linkage clustering, sequences are grouped such that identical cluster codes at a particular level guarantee that all sequences in that cluster are within the specified threshold distance. For example, specifying `--pd_distm hamming` and `--gm_thresholds "10,5,0"` would mean that sequences with no more than 10 loci differences are assigned the same cluster code at the first level, no more than 5 differences at the second level, and identical sequences at the third level.

- _Average Linkage_: With average linkage clustering, sequences may share the same cluster code if their average distance is below the specified threshold. For instance, sequences with average distances less than 10, 5, and 0 for each level respectively may share the same cluster code.

- _Single Linkage_: Single linkage clustering can result in merging distant samples into the same cluster if there exists a third sample that bridges the distance between them. This method does not provide strict guarantees on the maximum distance within a cluster, potentially allowing distant sequences to share the same cluster code.

## Profile_dists

The following can be used to adjust parameters for the [profile_dists][] tool.

- `--pd_distm`: The distance method/unit, either _hamming_ or _scaled_. For _hamming_ distances, the distance values will be a non-negative integer. For _scaled_ distances, the distance values are between 0.0 and 100.0. Please see the [Distance Method and Thresholds](#distance-method-and-thresholds) section for more information.
- `--pd_missing_threshold`: The maximum proportion of missing data per locus for a locus to be kept in the analysis. Values from 0 to 1.
- `--pd_sample_quality_threshold`: The maximum proportion of missing data per sample for a sample to be kept in the analysis. Values from 0 to 1.
- `--pd_file_type`: Output format file type. One of _text_ or _parquet_.
- `--pd_mapping_file`: A file used to map allele codes to integers for internal distance calculations. This is the same file as produced from the _profile dists_ step (the [allele_map.json](docs/output.md#profile-dists) file). Normally, this is unneeded unless you wish to override the automated process of mapping alleles to integers.
- `--pd_skip`: Skip QA/QC steps. Can be used as a flag, `--pd_skip`, or passing a boolean, `--pd_skip true` or `--pd_skip false`.
- `--pd_columns`: Path to a file that defines the loci to keep within the analysis (default when unset is to keep all loci). Formatted as a single column file with one locus name per line. For example:
- **Single column format**
```
loci1
loci2
loci3
```
- `--pd_count_missing`: Count missing alleles as different. Can be used as a flag, `--pd_count_missing`, or passing a boolean, `--pd_count_missing true` or `--pd_count_missing false`. If true, will consider missing allele calls for the same locus between samples as a difference, increasing the distance counts.
## GAS CALL
The following can be used to adjust parameters for the [gas call][] tool.
- `--gm_thresholds`: Thresholds delimited by `,`. Values should match units from `--pd_distm` (either _hamming_ or _scaled_). Please see the [Distance Method and Thresholds](#distance-method-and-thresholds) section for more information.
- `--gm_method`: The linkage method to use for clustering. Value should be one of _single_, _average_, or _complete_.
- `--gm_delimiter`: Delimiter desired for nomenclature code. Must be alphanumeric or one of `._-`.
## Other
Other parameters (defaults from nf-core) are defined in [nextflow_schema.json](nextflow_schmea.json).
# Running
Expand All @@ -39,51 +96,26 @@ An example of the what the contents of the IRIDA Next JSON file looks like for t
```
{
"files": {
"global": [
{
"path": "summary/summary.txt.gz"
}
],
"global": [],
"samples": {
"SAMPLE1": [
"sampleF": [
{
"path": "assembly/SAMPLE1.assembly.fa.gz"
"path": "input/sampleF_error_report.csv"
}
],
"SAMPLE2": [
{
"path": "assembly/SAMPLE2.assembly.fa.gz"
}
],
"SAMPLE3": [
{
"path": "assembly/SAMPLE3.assembly.fa.gz"
}
]
}
},
"metadata": {
"samples": {
"SAMPLE1": {
"reads.1": "sample1_R1.fastq.gz",
"reads.2": "sample1_R2.fastq.gz"
},
"SAMPLE2": {
"reads.1": "sample2_R1.fastq.gz",
"reads.2": "sample2_R2.fastq.gz"
},
"SAMPLE3": {
"reads.1": "sample1_R1.fastq.gz",
"reads.2": "null"
"sampleQ": {
"address": "1.1.3",
}
}
}
}
```

Within the `files` section of this JSON file, all of the output paths are relative to the `outdir`. Therefore, `"path": "assembly/SAMPLE1.assembly.fa.gz"` refers to a file located within `outdir/assembly/SAMPLE1.assembly.fa.gz`.

There is also a pipeline execution summary output file provided (specified in the above JSON as `"global": [{"path":"summary/summary.txt.gz"}]`). However, there is no formatting specification for this file.
Within the `files` section of this JSON file, all of the output paths are relative to the `outdir`. Therefore, `"path": "input/sampleF_error_report.csv"` refers to a file located within `outdir/input/sampleF_error_report.csv`. This file is generated only if a sample fails the input check during samplesheet assessment.

## Test profile

Expand All @@ -95,7 +127,7 @@ nextflow run phac-nml/gasnomenclature -profile docker,test -r main -latest --out

# Legal

Copyright 2023 Government of Canada
Copyright 2024 Government of Canada

Licensed under the MIT License (the "License"); you may not use
this work except in compliance with the License. You may obtain a copy of the
Expand Down
9 changes: 5 additions & 4 deletions assets/samplesheet.csv
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
sample,fastq_1,fastq_2
SAMPLE1,https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample1_R1.fastq.gz,https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample1_R2.fastq.gz
SAMPLE2,https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample2_R1.fastq.gz,https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample2_R2.fastq.gz
SAMPLE3,https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample1_R1.fastq.gz,
sample,mlst_alleles,address
sampleQ,https://raw.githubusercontent.com/phac-nml/gasnomenclature/dev/tests/data/reports/sampleQ.mlst.json,
sample1,https://raw.githubusercontent.com/phac-nml/gasnomenclature/dev/tests/data/reports/sample1.mlst.json,1.1.1
sample2,https://raw.githubusercontent.com/phac-nml/gasnomenclature/dev/tests/data/reports/sample2.mlst.json,1.1.1
sample3,https://raw.githubusercontent.com/phac-nml/gasnomenclature/dev/tests/data/reports/sample3.mlst.json,1.1.2
19 changes: 10 additions & 9 deletions assets/schema_input.json
Original file line number Diff line number Diff line change
Expand Up @@ -14,19 +14,20 @@
"unique": true,
"errorMessage": "Sample name must be provided and cannot contain spaces"
},
"profile_type": {
"meta": ["profile_type"],
"description": "Determines has already been clustered (True) or if it is new, and requiring nomenclature assignment (False)",
"errorMessage": "Please specify if the mlst profile has already been clustered (True) or if it is new and requires nomenclature assignment (False)",
"type": "boolean"
},
"mlst_alleles": {
"type": "string",
"format": "file-path",
"pattern": "^\\S+\\.mlst\\.json(\\.gz)?$",
"errorMessage": "MLST JSON file from locidex report, cannot contain spaces and must have the extension: '.mlst.json' or '.mlst.json.gz'"
"pattern": "^\\S+\\.mlst(\\.subtyping)?\\.json(\\.gz)?$",
"errorMessage": "MLST JSON file from locidex report, cannot contain spaces and must have the extension: '.mlst.json', '.mlst.json.gz', '.mlst.subtyping.json', or 'mlst.subtyping.json.gz'"
},
"address": {
"type": "string",
"pattern": "^\\d+(\\.\\d+)*$",
"meta": ["address"],
"description": "The loci-based typing identifier (address) of the sample",
"error_message": "Invalid loci-based typing identifier. Please ensure that the address follows the correct format, consisting of one or more digits separated by periods. Example of a valid identifier: '1.1.1'. Please review and correct the entry"
}
},
"required": ["sample", "profile_type", "mlst_alleles"]
"required": ["sample", "mlst_alleles"]
}
}
Loading

0 comments on commit e7e73cf

Please sign in to comment.