nf-core · vagkaratzas · Nov 5, 2024 · Nov 6, 2024 · Nov 6, 2024 · Nov 6, 2024
diff --git a/.nf-core.yml b/.nf-core.yml
@@ -1,9 +1,9 @@
 lint:
   files_exist:
-    - conf/igenomes.config
-    - conf/igenomes_ignored.config
-    - conf/igenomes.config
-    - conf/igenomes_ignored.config
+  - conf/igenomes.config
+  - conf/igenomes_ignored.config
+  - conf/igenomes.config
+  - conf/igenomes_ignored.config
 nf_core_version: 3.2.0
 repository_type: pipeline
 template:
@@ -16,6 +16,6 @@ template:
   org: nf-core
   outdir: .
   skip_features:
-    - igenomes
-    - fastqc
-  version: 1.0.0dev
+  - igenomes
+  - fastqc
+  version: 1.0.0
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,14 +3,16 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
-## v1.0.0dev - [date]
+## v1.0.0 - [2025/02/03]
-## v1.0.0 - [2025/02/03]
+## v1.0.0 - [2025/02/05]
-## v1.0.0 - [2025/02/03]
+## v1.0.0 - [2025/02/05]
 
 Initial release of nf-core/proteinfamilies, created with the [nf-core](https://nf-co.re/) template.
 
 ### `Added`
 
-### `Fixed`
-
-### `Dependencies`
-
-### `Deprecated`
+- Amino acid sequence clustering (mmseqs)
+- Multiple sequence alignment (famsa, mafft, clipkit)
+- Hidden Markov Model generation (hmmer)
+- Between families redundancy removal (hmmer)
+- In-family sequence redundancy removal (mmseqs)
+- Family updating (hmmer, seqkit, mmseqs, famsa, mafft, clipkit)
+- Family statistics presentation (multiqc)
diff --git a/CITATIONS.md b/CITATIONS.md
@@ -10,6 +10,34 @@
 
 ## Pipeline tools
 
+- [MMseqs2](https://pubmed.ncbi.nlm.nih.gov/33734313/)
+
+> Mirdita M, Steinegger M, Breitwieser F, Söding J, Levy Karin E. Fast and sensitive taxonomic assignment to metagenomic contigs. Bioinformatics. 2021 Sep 15;37(18):3029-31. doi: 10.1093/bioinformatics/btab184. PubMed PMID: 33734313; PubMed Central PMCID: PMC8479651.
+
+- [FAMSA](https://pubmed.ncbi.nlm.nih.gov/27670777/)
+
+> Deorowicz S, Debudaj-Grabysz A, Gudyś A. FAMSA: Fast and accurate multiple sequence alignment of huge protein families. Scientific reports. 2016 Sep 27;6(1):33964. doi: 10.1038/srep33964. PubMed PMID: 27670777; PubMed Central PMCID: PMC5037421.
+
+- [mafft](https://pubmed.ncbi.nlm.nih.gov/23329690/)
+
+> Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular biology and evolution. 2013 Jan 16;30(4):772-80. doi: 10.1093/molbev/mst010. PubMed PMID: 23329690; PubMed Central PMCID: PMC3603318.
+
+- [ClipKIT](https://pubmed.ncbi.nlm.nih.gov/33264284/)
+
+> Steenwyk JL, Buida III TJ, Li Y, Shen XX, Rokas A. ClipKIT: a multiple sequence alignment trimming software for accurate phylogenomic inference. PLoS biology. 2020 Dec 2;18(12):e3001007. doi: 10.1371/journal.pbio.3001007. PubMed PMID: 33264284; PubMed Central PMCID: PMC7735675.
+
+- [hmmer](https://pubmed.ncbi.nlm.nih.gov/29905871/)
+
+> Potter SC, Luciani A, Eddy SR, Park Y, Lopez R, Finn RD. Nucleic acids research. 2018 Jul 2;46(W1):W200-4. doi: 10.1093/nar/gky448. PubMed PMID: 29905871; PubMed Central PMCID: PMC6030962.
+
+- [SeqKit](https://pubmed.ncbi.nlm.nih.gov/38898985/)
+
+> Shen W, Sipos B, Zhao L. SeqKit2: A Swiss army knife for sequence and alignment processing. iMeta. 2024 Apr 5:e191. doi: 10.1002/imt2.191. PubMed PMID: 38898985; PubMed Central PMCID: PMC11183193.
+
+- [Biopython](https://pubmed.ncbi.nlm.nih.gov/19304878/)
+
+> Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, De Hoon MJ. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009 Jun 6;25(11):1422. doi: 10.1093/bioinformatics/btp163. PubMed PMID: 19304878; PubMed Central PMCID: PMC2682512.
+
 - [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
 
 > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

diff --git a/README.md b/README.md
@@ -19,43 +19,57 @@
 
 ## Introduction
 
-**nf-core/proteinfamilies** is a bioinformatics pipeline that ...
+**nf-core/proteinfamilies** is a bioinformatics pipeline that generates protein families from amino acid sequences and/or updates existing families with new sequences.
+It takes a protein fasta file as input, clusters the sequences and then generates protein family Hiden Markov Models (HMMs) along with their multiple sequence alignments (MSAs).
+Optionally, paths to existing family HMMs and MSAs can be given (must have matching base filenames one-to-one) in order to update with new sequences in case of matching hits.
 
-<!-- TODO nf-core:
-   Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the
-   major pipeline sections and the types of output it produces. You're giving an overview to someone new
-   to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
--->
 
-<!-- TODO nf-core: Include a figure that guides the user through the major workflow steps. Many nf-core
-     workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples.   -->
-<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
+<p align="center">
+    <img src="docs/images/proteinfamilies_workflow.png" alt="nf-core/proteinfamilies workflow overview">
+</p>
+
+### Create families
+
+1. Cluster sequences ([`MMseqs2`](https://github.com/soedinglab/MMseqs2/))
+2. Perform multiple sequence alignment (MSA) ([`FAMSA`](https://github.com/refresh-bio/FAMSA/) or [`mafft`](https://github.com/GSLBiotech/mafft/))
+3. Optionally, clip gap parts of the MSA ([`ClipKIT`](https://github.com/JLSteenwyk/ClipKIT/))
+4. Generate family HMMs and fish additional sequences into the family ([`hmmer`](https://github.com/EddyRivasLab/hmmer/))
+5. Optionally, remove redundant families by comparing family representative sequences against family models with ([`hmmer`](https://github.com/EddyRivasLab/hmmer/))
+6. Optionally, from the remaining families, remove in-family redundant sequences by strictly clustering with ([`MMseqs2`](https://github.com/soedinglab/MMseqs2/)) and keep cluster representatives
+7. Present statistics for remaining/updated families size distributions and representative sequence lengths ([`MultiQC`](http://multiqc.info/))
+
+### Update families
+
+1. Find which families to update by comparing the input sequences against existing family models with ([`hmmer`](https://github.com/EddyRivasLab/hmmer/))
+2. For non hit sequences continue with the above: A. Create families. For hit sequences and families continue to: 3
+3. Extract family sequences ([`SeqKit`](https://github.com/shenwei356/seqkit/)) and concatenate with filtered hit sequences of each family
+4. Optionally, remove in-family redundant sequences by strictly clustering with ([`MMseqs2`](https://github.com/soedinglab/MMseqs2/)) and keeping cluster representatives
+5. Perform multiple sequence alignment (MSA) ([`FAMSA`](https://github.com/refresh-bio/FAMSA/) or [`mafft`](https://github.com/GSLBiotech/mafft/))
+6. Optionally, clip gap parts of the MSA ([`ClipKIT`](https://github.com/JLSteenwyk/ClipKIT/))
+7. Update family HMM with ([`hmmer`](https://github.com/EddyRivasLab/hmmer/))
 
 ## Usage
 
 > [!NOTE]
 > If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.
 
-<!-- TODO nf-core: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
-     Explain what rows and columns represent. For instance (please edit as appropriate):
 
 First, prepare a samplesheet with your input data that looks as follows:
 
 `samplesheet.csv`:
 
 ```csv
-sample,fastq_1,fastq_2
-CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
+sample,fasta,existing_hmms_to_update,existing_msas_to_update
+CONTROL_REP1,input/mgnifams_input_small.fa,,
 ```
 
-Each row represents a fastq file (single-end) or a pair of fastq files (paired end).
-
--->
+Each row contains a fasta file with amino acid sequences (can be zipped or unzipped).
+Optionally, a row may contain tarball archives (tar.gz) of existing families' HMM and MSA folders, in order to be updated.
+In this case, the HMM and MSA files must be matching in numbers and in base filenames (not the extension).
+Hit families/sequences will be updated, while no hit sequences will create new families.
 
 Now, you can run the pipeline using:
 
-<!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->
-
 ```bash
 nextflow run nf-core/proteinfamilies \
    -profile <docker/singularity/.../institute> \
@@ -80,7 +94,7 @@ nf-core/proteinfamilies was originally written by Evangelos Karatzas.
 
 We thank the following people for their extensive assistance in the development of this pipeline:
 
-<!-- TODO nf-core: If applicable, make list of people who have also contributed -->
+- [Martin Beracochea](https://github.com/mberacochea)
 
 ## Contributions and Support
 
@@ -93,8 +107,6 @@ For further information or help, don't hesitate to get in touch on the [Slack `#
 <!-- TODO nf-core: Add citation for pipeline after first release. Uncomment lines below and update Zenodo doi and badge at the top of this file. -->
 <!-- If you use nf-core/proteinfamilies for your analysis, please cite it using the following doi: [10.5281/zenodo.XXXXXX](https://doi.org/10.5281/zenodo.XXXXXX) -->
 
-<!-- TODO nf-core: Add bibliography of tools and data used in your pipeline -->
-
 An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.
 
 You can cite the `nf-core` publication as follows:

diff --git a/assets/multiqc_config.yml b/assets/multiqc_config.yml
@@ -1,7 +1,8 @@
 report_comment: >
-  This report has been generated by the <a href="https://github.com/nf-core/proteinfamilies/tree/dev" target="_blank">nf-core/proteinfamilies</a>
-  analysis pipeline. For information about how to interpret these results, please see the
-  <a href="https://nf-co.re/proteinfamilies/dev/docs/output" target="_blank">documentation</a>.
+  This report has been generated by the <a href="https://github.com/nf-core/proteinfamilies/releases/tag/1.0.0"
+  target="_blank">nf-core/proteinfamilies</a> analysis pipeline. For information about
+  how to interpret these results, please see the <a href="https://nf-co.re/proteinfamilies/1.0.0/docs/output"
+  target="_blank">documentation</a>.
 report_section_order:
   "nf-core-proteinfamilies-methods-description":
     order: -1000

diff --git a/assets/samplesheet.csv b/assets/samplesheet.csv
@@ -1,3 +1,2 @@
-sample,fastq_1,fastq_2
-SAMPLE_PAIRED_END,/path/to/fastq/files/AEG588A1_S1_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A1_S1_L002_R2_001.fastq.gz
-SAMPLE_SINGLE_END,/path/to/fastq/files/AEG588A4_S4_L003_R1_001.fastq.gz,
+sample,fasta,existing_hmms_to_update,existing_msas_to_update
+mgnifams_test,https://github.com/nf-core/test-datasets/raw/proteinfamilies/test_data/mgnifams_input_small.fa,,
diff --git a/assets/schema_input.json b/assets/schema_input.json
@@ -13,21 +13,28 @@
                 "errorMessage": "Sample name must be provided and cannot contain spaces",
                 "meta": ["id"]
             },
-            "fastq_1": {
+            "fasta": {
                 "type": "string",
                 "format": "file-path",
                 "exists": true,
-                "pattern": "^\\S+\\.f(ast)?q\\.gz$",
-                "errorMessage": "FastQ file for reads 1 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
+                "pattern": "^\\S+\\.(fa|fasta|faa|fas)(\\.gz)?$",
+                "errorMessage": "Fasta file for amino acid sequences must be provided, cannot contain spaces and must have extension '.fa', '.fasta', '.faa', '.fas', '.fa.gz', '.fasta.gz', '.faa.gz' or '.fas.gz'"
             },
-            "fastq_2": {
+            "existing_hmms_to_update": {
                 "type": "string",
                 "format": "file-path",
                 "exists": true,
-                "pattern": "^\\S+\\.f(ast)?q\\.gz$",
-                "errorMessage": "FastQ file for reads 2 cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
+                "pattern": "^\\S+\\.tar\\.gz$",
+                "description": "Gzipped tarball file containing existing protein family HMMs. These models will be used to 'fish' new sequences from the input and then be updated accordingly."
+            },
+            "existing_msas_to_update": {
+                "type": "string",
+                "format": "file-path",
+                "exists": true,
+                "pattern": "^\\S+\\.tar\\.gz$",
+                "description": "Tarball file containing multiple sequence alignments (MSAs) for the families to be updated. These alignments are essential for the update process and should match the HMM filenames one by one."
             }
         },
-        "required": ["sample", "fastq_1"]
+        "required": ["sample", "fasta"]
     }
 }