Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial pipeline pseudo PR #17

Draft
wants to merge 163 commits into
base: TEMPLATE
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 161 commits
Commits
Show all changes
163 commits
Select commit Hold shift + click to select a range
7a768c5
modules installed
vagkaratzas Nov 5, 2024
821f73e
input samplesheet update to fasta
vagkaratzas Nov 6, 2024
84f8965
clustering modules chained
vagkaratzas Nov 6, 2024
e5bc24f
nf-core createtsv test init
vagkaratzas Nov 6, 2024
1802206
debugged targetdb copy by staging
vagkaratzas Nov 6, 2024
40e02f2
syncing samples by meta, module diffs
vagkaratzas Nov 8, 2024
d362d9e
EXECUTE_CLUSTERING subworkflow done
vagkaratzas Nov 8, 2024
af91b89
Merge pull request #1 from vagkaratzas/createtsv-tests
vagkaratzas Nov 8, 2024
0c56473
chunk_clusters nf and py init
vagkaratzas Nov 11, 2024
ace2dbd
cluster chunkcs complete
vagkaratzas Nov 11, 2024
65381a7
Merge pull request #2 from vagkaratzas/chunking_clusters
vagkaratzas Nov 11, 2024
b1309f5
seqera container for local python module
vagkaratzas Nov 11, 2024
a84dbe6
clustering parameters added, chunking fasta instead of tsv to then fe…
vagkaratzas Nov 11, 2024
2e46012
removed pandas from chunking script, optimised for memory
vagkaratzas Nov 11, 2024
f2983bc
msa input formatted, famsa run properly
vagkaratzas Nov 12, 2024
6f52683
meta updated to chunks level, hmmbuild chained to workflow
vagkaratzas Nov 12, 2024
2d48faf
HMMER_HMMSEARCH chained, inputs combined successfully on meta id, met…
vagkaratzas Nov 12, 2024
54f7a05
minor publish path change
vagkaratzas Nov 12, 2024
5d9e7e9
hmmsearch evalue cutoff and output full MSA by default
vagkaratzas Nov 13, 2024
a39d649
generate_families converted into subworkflow
vagkaratzas Nov 13, 2024
e203af3
mafft alternative alignment tool
vagkaratzas Nov 13, 2024
e24a63b
hmmbuild updated
vagkaratzas Nov 14, 2024
4c639f0
added support for .gz input fasta
vagkaratzas Nov 14, 2024
b4a054f
Preparing input channel to extract family representative sequences fo…
vagkaratzas Nov 14, 2024
2b6b4a8
added clipkit module, fixing some linting warnings, updated subworkflows
vagkaratzas Nov 14, 2024
c136435
extract family reps script
vagkaratzas Nov 19, 2024
5e068d1
removed prokka module
vagkaratzas Nov 19, 2024
b0ee92d
export fam to rep mapping
vagkaratzas Nov 21, 2024
82742e7
EXTRACT_FAMILY_REPS publishDir
vagkaratzas Nov 21, 2024
8b444df
Merge pull request #4 from vagkaratzas/prokka
vagkaratzas Nov 21, 2024
ea8e300
custom multiqc with family metadata
vagkaratzas Nov 22, 2024
dfaae41
added label and when blocks to custom modules to fix linting warnings
vagkaratzas Nov 22, 2024
53ea42f
input samplesheet and test data hosted in nf-core test-datasets repo,…
vagkaratzas Nov 22, 2024
eb08a3c
updated mafft module to mafft-align
vagkaratzas Dec 5, 2024
5283390
subworkflows updated
vagkaratzas Dec 5, 2024
18be6da
updated readme and citations doc
vagkaratzas Dec 5, 2024
f7bc3f5
usage.md updated
vagkaratzas Dec 5, 2024
fb54976
clustering outputs updated and written in output.md
vagkaratzas Dec 5, 2024
c9ac121
aligners outputs updated
vagkaratzas Dec 5, 2024
1aa92cd
hmmer outputs updated + documentation
vagkaratzas Dec 5, 2024
619f701
gap_threshold aprameter for clipkit
vagkaratzas Dec 6, 2024
39a037f
mmseqs suite version bump
vagkaratzas Dec 6, 2024
e95a1ff
clip_ends only local module
vagkaratzas Dec 6, 2024
cbd1292
Merge pull request #5 from vagkaratzas/clip-ends-only
vagkaratzas Dec 6, 2024
86a950b
nf-core subworkflows update
vagkaratzas Dec 8, 2024
2df0b34
hmmsearch results filtering by length, fasta instead of stockholm ref…
vagkaratzas Dec 9, 2024
1a2d333
init redundancy_check subworkflow
vagkaratzas Dec 9, 2024
98fb26c
execute_clustering subworkflow updated and reused in redundancy remov…
vagkaratzas Dec 10, 2024
8ae6410
TODOs added to continue with redundancy mechanisms
vagkaratzas Dec 10, 2024
b42db21
remove_redundant_seqs module done, sequence align converted to subwor…
vagkaratzas Dec 11, 2024
7aa1e3c
concat_hmms module
vagkaratzas Dec 11, 2024
21c3d19
CONCAT_HMMS module, HMMER_HMMSEARCH for family redundancy checking, R…
vagkaratzas Dec 11, 2024
39afbb9
remove_self_hits function
vagkaratzas Dec 11, 2024
5ddbf77
filter_by_length
vagkaratzas Dec 11, 2024
69b054e
remove_redundant_fams module done and container updated to pandas
vagkaratzas Dec 11, 2024
480bc2a
filtered sequence names updated, results folder architecture updated
vagkaratzas Dec 12, 2024
e5e28dc
Merge pull request #6 from vagkaratzas/remove-family-redundancy
vagkaratzas Dec 12, 2024
178ff6b
Merge pull request #7 from vagkaratzas/redundancy_check
vagkaratzas Dec 12, 2024
2f0e75c
readme and output.md doc updates
vagkaratzas Dec 13, 2024
7b16c7c
metro map updated
vagkaratzas Dec 13, 2024
dfe39fb
non redundant hmm filtering
vagkaratzas Dec 16, 2024
940659c
Merge branch 'sync' into merge_template
vagkaratzas Dec 16, 2024
9f22b4e
Merge pull request #8 from vagkaratzas/merge_template
vagkaratzas Dec 16, 2024
92a3fdb
manifest update and modules config proper sourcing
vagkaratzas Dec 16, 2024
8e6d4bd
Merge pull request #9 from vagkaratzas/sync
vagkaratzas Dec 16, 2024
baf48e9
future proofing multiqc numeric protein name bug (proper strings will…
vagkaratzas Dec 16, 2024
97fcd22
smallest test skipping clipping of msas and redundancy removal mechanism
vagkaratzas Dec 16, 2024
ca3b434
make recruiting of sequences with hmmsearch optional, both tests runn…
vagkaratzas Dec 16, 2024
f64d5de
Merge pull request #11 from vagkaratzas/bug-fix--multiqc-numeric-prot…
vagkaratzas Dec 17, 2024
0f2bda3
minor reformats
vagkaratzas Dec 17, 2024
7835d65
Merge pull request #10 from vagkaratzas/new-test--minimal-test-withou…
vagkaratzas Dec 17, 2024
ec73bfa
This changeset included a WIP update workflow.
mberacochea Dec 20, 2024
6d1083b
Fix a missing bit in the update
mberacochea Dec 22, 2024
5ab2ce6
This changeset included a WIP update workflow.
mberacochea Dec 20, 2024
42a6105
Fix a missing bit in the update
mberacochea Dec 22, 2024
deb184b
Merge branch 'update_subworkflow_vag' of https://github.com/vagkaratz…
vagkaratzas Dec 23, 2024
185802e
correct and rearrange stuff in remove_redundancy subworkflow
vagkaratzas Dec 23, 2024
1cb45ff
sync to latest template
vagkaratzas Dec 23, 2024
8b36bd3
nf-core yml version updated
vagkaratzas Dec 23, 2024
0a6b233
multiQC stuff undo linting
vagkaratzas Dec 23, 2024
8437b96
update_families moved to its own subworkflow
vagkaratzas Dec 23, 2024
2dd40f7
restructured update_families until hmmer_hmmsearch
vagkaratzas Dec 23, 2024
c7c3df6
branch hits fasta
vagkaratzas Dec 24, 2024
71c39a1
cat_cat module config
vagkaratzas Dec 24, 2024
7d0ee54
removed unused modules, binding proper family msas with recruited seq…
vagkaratzas Dec 24, 2024
c2e20f1
seqkit + cat fasta with family specific msa
vagkaratzas Dec 24, 2024
7f1a7c6
prefixes up to clustering updated
vagkaratzas Dec 24, 2024
ae129a2
update_families subworkflow done
vagkaratzas Dec 24, 2024
7aa8c0b
clipends file extension updated
vagkaratzas Dec 24, 2024
a723a42
metro map and output.md updated
vagkaratzas Dec 25, 2024
3d17c4f
updated readme and citations
vagkaratzas Dec 25, 2024
59b006a
validation schema updated to allow either existing hmm and msa folder…
vagkaratzas Dec 26, 2024
b71c210
untar logic implemented, removed support for simple folders due to co…
vagkaratzas Dec 26, 2024
e31cdc9
test configs updated
vagkaratzas Dec 26, 2024
6c9008c
samplesheet description updated
vagkaratzas Dec 26, 2024
da80843
Merge pull request #14 from vagkaratzas/test_data_update
vagkaratzas Jan 2, 2025
01502b7
Update README.md
vagkaratzas Jan 2, 2025
27de55f
python errors to stderr
vagkaratzas Jan 2, 2025
e81cc77
Update workflows/proteinfamilies.nf
vagkaratzas Jan 2, 2025
2ebf65a
Update docs/output.md
vagkaratzas Jan 2, 2025
37a2c64
Update docs/output.md
vagkaratzas Jan 2, 2025
3406075
Update docs/output.md
vagkaratzas Jan 2, 2025
5d93651
syncing of seqs and clusters after clustering done inside the execute…
vagkaratzas Jan 2, 2025
99a62a9
file.getSimpleName
vagkaratzas Jan 2, 2025
b542eb6
chunk in REMOVE_REDUNDANT_SEQS tag
vagkaratzas Jan 2, 2025
b98c514
Update docs/output.md
vagkaratzas Jan 2, 2025
c0cbae9
regex for parsed hit sequence name and range
vagkaratzas Jan 3, 2025
81d0d67
validateMatchingFolders groovy function to check for valid existing H…
vagkaratzas Jan 3, 2025
bb82f71
Merge pull request #13 from vagkaratzas/update_subworkflow_vag
vagkaratzas Jan 7, 2025
de8421f
branch hits fasta updated to use env coords of domtbl
vagkaratzas Jan 9, 2025
ed1cc98
tryCatch block added in filter_recruited function of branch_hits_fasta
vagkaratzas Jan 9, 2025
040464e
filter recruited updated -not using auto full alignment from hmmsearc…
vagkaratzas Jan 10, 2025
7577bcc
HMMER_HMMALIGN added, checkpoint
vagkaratzas Jan 10, 2025
85272bd
remove redundancy updated to work with sto alignments as well
vagkaratzas Jan 10, 2025
7a74d6e
metro map updated with hmmalign
vagkaratzas Jan 10, 2025
49acd97
metro updated with hmmalign
vagkaratzas Jan 10, 2025
54980f7
output.md updated
vagkaratzas Jan 10, 2025
51e0b9b
filter_non_redundant_hmms updated to work properly even without fishi…
vagkaratzas Jan 10, 2025
bf099ff
Merge pull request #1 from vagkaratzas/using-envelope-coords
vagkaratzas Jan 10, 2025
dc80ef6
Merge pull request #15 from vagkaratzas/dev
vagkaratzas Jan 10, 2025
f4849e9
usage.md samplesheet input description update
vagkaratzas Jan 10, 2025
9f022f3
Merge pull request #18 from vagkaratzas/dev
vagkaratzas Jan 10, 2025
36bc897
Merge branch 'dev' into nf-core-template-merge-3.1.2
vagkaratzas Jan 27, 2025
6d4442c
Merge pull request #19 from nf-core/nf-core-template-merge-3.1.2
vagkaratzas Jan 27, 2025
dc5a884
modules update (mmseqs, multiqc, seqkit)
vagkaratzas Jan 27, 2025
f630ac0
Merge pull request #20 from vagkaratzas/dev
vagkaratzas Jan 27, 2025
3b377ab
Update README.md
vagkaratzas Jan 28, 2025
251f7e7
Merge branch 'dev' into nf-core-template-merge-3.2.0
vagkaratzas Jan 28, 2025
f3caa75
Merge pull request #21 from nf-core/nf-core-template-merge-3.2.0
vagkaratzas Jan 28, 2025
6b877be
local modules mem reqs in base.config
vagkaratzas Jan 28, 2025
906eec2
Merge pull request #22 from vagkaratzas/dev
vagkaratzas Jan 28, 2025
704ebe4
metro map updated and svg added
vagkaratzas Jan 31, 2025
a5a8383
changelog updated for first release
vagkaratzas Jan 31, 2025
02ec8aa
Biopython citation added
vagkaratzas Jan 31, 2025
891706b
schema pattern updated to include faa and fas input files
vagkaratzas Jan 31, 2025
228659e
Update assets/schema_input.json
vagkaratzas Jan 31, 2025
a777d52
gzipped tarball description specification
vagkaratzas Jan 31, 2025
e83d6b4
python script licensing
vagkaratzas Jan 31, 2025
00c2c6f
Update conf/test_full.config
vagkaratzas Jan 31, 2025
e3db3aa
resourceLimits removed from test_full
vagkaratzas Jan 31, 2025
bdb1bec
removed unused process block from test_full config
vagkaratzas Jan 31, 2025
bc594f8
Parameter specifications in usage.md
vagkaratzas Jan 31, 2025
dd2cc97
Update subworkflows/local/utils_nfcore_proteinfamilies_pipeline/main.nf
vagkaratzas Jan 31, 2025
07d52a3
checkIfExists: true added to file()
vagkaratzas Jan 31, 2025
f6cb471
Update nextflow_schema.json
vagkaratzas Jan 31, 2025
0ce6ee0
Update nextflow.config
vagkaratzas Jan 31, 2025
f050d84
Update docs/output.md
vagkaratzas Jan 31, 2025
1b2d81b
Update docs/usage.md
vagkaratzas Jan 31, 2025
921cf06
Update docs/output.md
vagkaratzas Jan 31, 2025
bdc8627
batch commit suggestions from james
vagkaratzas Jan 31, 2025
0ab063f
ch_ notation instead of set {}, .map changed to multiline everywhere,…
vagkaratzas Jan 31, 2025
9be00f7
assets/samplesheet.csv added back
vagkaratzas Jan 31, 2025
e39a349
modules and subworkflows moved to respective folders + env
vagkaratzas Jan 31, 2025
f64c769
prefix instead of meta.id for output naming in two of the local modules
vagkaratzas Jan 31, 2025
d6e2e3e
only main author mail kept
vagkaratzas Jan 31, 2025
86cdb61
removed duplicate includeConfig 'conf/modules.config'
vagkaratzas Jan 31, 2025
99de435
interpretation of folder outputs in output.md, strict clustering thre…
vagkaratzas Jan 31, 2025
ada9368
help_text in most schema params
vagkaratzas Jan 31, 2025
acb9465
Merge pull request #23 from vagkaratzas/dev
vagkaratzas Jan 31, 2025
d3a9e21
first release version bump (1.0.0)
vagkaratzas Feb 3, 2025
16c1ca9
Merge pull request #24 from vagkaratzas/dev
vagkaratzas Feb 3, 2025
6ffb2df
Update CHANGELOG.md
jfy133 Feb 5, 2025
a380e49
[automated] Fix code linting
nf-core-bot Feb 5, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions .nf-core.yml
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
lint:
files_exist:
- conf/igenomes.config
- conf/igenomes_ignored.config
- conf/igenomes.config
- conf/igenomes_ignored.config
- conf/igenomes.config
- conf/igenomes_ignored.config
- conf/igenomes.config
- conf/igenomes_ignored.config
nf_core_version: 3.2.0
repository_type: pipeline
template:
Expand All @@ -16,6 +16,6 @@ template:
org: nf-core
outdir: .
skip_features:
- igenomes
- fastqc
version: 1.0.0dev
- igenomes
- fastqc
version: 1.0.0
14 changes: 8 additions & 6 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,16 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## v1.0.0dev - [date]
## v1.0.0 - [2025/02/03]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## v1.0.0 - [2025/02/03]
## v1.0.0 - [2025/02/05]


Initial release of nf-core/proteinfamilies, created with the [nf-core](https://nf-co.re/) template.

### `Added`

### `Fixed`

### `Dependencies`

### `Deprecated`
- Amino acid sequence clustering (mmseqs)
- Multiple sequence alignment (famsa, mafft, clipkit)
- Hidden Markov Model generation (hmmer)
- Between families redundancy removal (hmmer)
- In-family sequence redundancy removal (mmseqs)
- Family updating (hmmer, seqkit, mmseqs, famsa, mafft, clipkit)
- Family statistics presentation (multiqc)
28 changes: 28 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,34 @@

## Pipeline tools

- [MMseqs2](https://pubmed.ncbi.nlm.nih.gov/33734313/)
vagkaratzas marked this conversation as resolved.
Show resolved Hide resolved

> Mirdita M, Steinegger M, Breitwieser F, Söding J, Levy Karin E. Fast and sensitive taxonomic assignment to metagenomic contigs. Bioinformatics. 2021 Sep 15;37(18):3029-31. doi: 10.1093/bioinformatics/btab184. PubMed PMID: 33734313; PubMed Central PMCID: PMC8479651.

- [FAMSA](https://pubmed.ncbi.nlm.nih.gov/27670777/)

> Deorowicz S, Debudaj-Grabysz A, Gudyś A. FAMSA: Fast and accurate multiple sequence alignment of huge protein families. Scientific reports. 2016 Sep 27;6(1):33964. doi: 10.1038/srep33964. PubMed PMID: 27670777; PubMed Central PMCID: PMC5037421.

- [mafft](https://pubmed.ncbi.nlm.nih.gov/23329690/)

> Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular biology and evolution. 2013 Jan 16;30(4):772-80. doi: 10.1093/molbev/mst010. PubMed PMID: 23329690; PubMed Central PMCID: PMC3603318.

- [ClipKIT](https://pubmed.ncbi.nlm.nih.gov/33264284/)

> Steenwyk JL, Buida III TJ, Li Y, Shen XX, Rokas A. ClipKIT: a multiple sequence alignment trimming software for accurate phylogenomic inference. PLoS biology. 2020 Dec 2;18(12):e3001007. doi: 10.1371/journal.pbio.3001007. PubMed PMID: 33264284; PubMed Central PMCID: PMC7735675.

- [hmmer](https://pubmed.ncbi.nlm.nih.gov/29905871/)

> Potter SC, Luciani A, Eddy SR, Park Y, Lopez R, Finn RD. Nucleic acids research. 2018 Jul 2;46(W1):W200-4. doi: 10.1093/nar/gky448. PubMed PMID: 29905871; PubMed Central PMCID: PMC6030962.

- [SeqKit](https://pubmed.ncbi.nlm.nih.gov/38898985/)

> Shen W, Sipos B, Zhao L. SeqKit2: A Swiss army knife for sequence and alignment processing. iMeta. 2024 Apr 5:e191. doi: 10.1002/imt2.191. PubMed PMID: 38898985; PubMed Central PMCID: PMC11183193.

- [Biopython](https://pubmed.ncbi.nlm.nih.gov/19304878/)

> Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, De Hoon MJ. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009 Jun 6;25(11):1422. doi: 10.1093/bioinformatics/btp163. PubMed PMID: 19304878; PubMed Central PMCID: PMC2682512.

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
Expand Down
54 changes: 33 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,43 +19,57 @@

## Introduction

**nf-core/proteinfamilies** is a bioinformatics pipeline that ...
**nf-core/proteinfamilies** is a bioinformatics pipeline that generates protein families from amino acid sequences and/or updates existing families with new sequences.
It takes a protein fasta file as input, clusters the sequences and then generates protein family Hiden Markov Models (HMMs) along with their multiple sequence alignments (MSAs).
Optionally, paths to existing family HMMs and MSAs can be given (must have matching base filenames one-to-one) in order to update with new sequences in case of matching hits.

<!-- TODO nf-core:
Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the
major pipeline sections and the types of output it produces. You're giving an overview to someone new
to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
-->

<!-- TODO nf-core: Include a figure that guides the user through the major workflow steps. Many nf-core
workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples. -->
<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
<p align="center">
<img src="docs/images/proteinfamilies_workflow.png" alt="nf-core/proteinfamilies workflow overview">
</p>

### Create families

1. Cluster sequences ([`MMseqs2`](https://github.com/soedinglab/MMseqs2/))
2. Perform multiple sequence alignment (MSA) ([`FAMSA`](https://github.com/refresh-bio/FAMSA/) or [`mafft`](https://github.com/GSLBiotech/mafft/))
3. Optionally, clip gap parts of the MSA ([`ClipKIT`](https://github.com/JLSteenwyk/ClipKIT/))
4. Generate family HMMs and fish additional sequences into the family ([`hmmer`](https://github.com/EddyRivasLab/hmmer/))
5. Optionally, remove redundant families by comparing family representative sequences against family models with ([`hmmer`](https://github.com/EddyRivasLab/hmmer/))
6. Optionally, from the remaining families, remove in-family redundant sequences by strictly clustering with ([`MMseqs2`](https://github.com/soedinglab/MMseqs2/)) and keep cluster representatives
7. Present statistics for remaining/updated families size distributions and representative sequence lengths ([`MultiQC`](http://multiqc.info/))

### Update families

1. Find which families to update by comparing the input sequences against existing family models with ([`hmmer`](https://github.com/EddyRivasLab/hmmer/))
2. For non hit sequences continue with the above: A. Create families. For hit sequences and families continue to: 3
3. Extract family sequences ([`SeqKit`](https://github.com/shenwei356/seqkit/)) and concatenate with filtered hit sequences of each family
4. Optionally, remove in-family redundant sequences by strictly clustering with ([`MMseqs2`](https://github.com/soedinglab/MMseqs2/)) and keeping cluster representatives
5. Perform multiple sequence alignment (MSA) ([`FAMSA`](https://github.com/refresh-bio/FAMSA/) or [`mafft`](https://github.com/GSLBiotech/mafft/))
6. Optionally, clip gap parts of the MSA ([`ClipKIT`](https://github.com/JLSteenwyk/ClipKIT/))
7. Update family HMM with ([`hmmer`](https://github.com/EddyRivasLab/hmmer/))

## Usage

> [!NOTE]
> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.

<!-- TODO nf-core: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
Explain what rows and columns represent. For instance (please edit as appropriate):

First, prepare a samplesheet with your input data that looks as follows:

`samplesheet.csv`:

```csv
sample,fastq_1,fastq_2
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
sample,fasta,existing_hmms_to_update,existing_msas_to_update
CONTROL_REP1,input/mgnifams_input_small.fa,,
```

Each row represents a fastq file (single-end) or a pair of fastq files (paired end).

-->
Each row contains a fasta file with amino acid sequences (can be zipped or unzipped).
Optionally, a row may contain tarball archives (tar.gz) of existing families' HMM and MSA folders, in order to be updated.
In this case, the HMM and MSA files must be matching in numbers and in base filenames (not the extension).
Hit families/sequences will be updated, while no hit sequences will create new families.

Now, you can run the pipeline using:

<!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->

```bash
nextflow run nf-core/proteinfamilies \
-profile <docker/singularity/.../institute> \
Expand All @@ -80,7 +94,7 @@ nf-core/proteinfamilies was originally written by Evangelos Karatzas.

We thank the following people for their extensive assistance in the development of this pipeline:

<!-- TODO nf-core: If applicable, make list of people who have also contributed -->
- [Martin Beracochea](https://github.com/mberacochea)

## Contributions and Support

Expand All @@ -93,8 +107,6 @@ For further information or help, don't hesitate to get in touch on the [Slack `#
<!-- TODO nf-core: Add citation for pipeline after first release. Uncomment lines below and update Zenodo doi and badge at the top of this file. -->
<!-- If you use nf-core/proteinfamilies for your analysis, please cite it using the following doi: [10.5281/zenodo.XXXXXX](https://doi.org/10.5281/zenodo.XXXXXX) -->

<!-- TODO nf-core: Add bibliography of tools and data used in your pipeline -->

An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.

You can cite the `nf-core` publication as follows:
Expand Down
7 changes: 4 additions & 3 deletions assets/multiqc_config.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
report_comment: >
This report has been generated by the <a href="https://github.com/nf-core/proteinfamilies/tree/dev" target="_blank">nf-core/proteinfamilies</a>
analysis pipeline. For information about how to interpret these results, please see the
<a href="https://nf-co.re/proteinfamilies/dev/docs/output" target="_blank">documentation</a>.
This report has been generated by the <a href="https://github.com/nf-core/proteinfamilies/releases/tag/1.0.0"
target="_blank">nf-core/proteinfamilies</a> analysis pipeline. For information about
how to interpret these results, please see the <a href="https://nf-co.re/proteinfamilies/1.0.0/docs/output"
target="_blank">documentation</a>.
report_section_order:
"nf-core-proteinfamilies-methods-description":
order: -1000
Expand Down
5 changes: 2 additions & 3 deletions assets/samplesheet.csv
vagkaratzas marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -1,3 +1,2 @@
sample,fastq_1,fastq_2
SAMPLE_PAIRED_END,/path/to/fastq/files/AEG588A1_S1_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A1_S1_L002_R2_001.fastq.gz
vagkaratzas marked this conversation as resolved.
Show resolved Hide resolved
SAMPLE_SINGLE_END,/path/to/fastq/files/AEG588A4_S4_L003_R1_001.fastq.gz,
sample,fasta,existing_hmms_to_update,existing_msas_to_update
mgnifams_test,https://github.com/nf-core/test-datasets/raw/proteinfamilies/test_data/mgnifams_input_small.fa,,
21 changes: 14 additions & 7 deletions assets/schema_input.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,21 +13,28 @@
"errorMessage": "Sample name must be provided and cannot contain spaces",
"meta": ["id"]
},
"fastq_1": {
"fasta": {
"type": "string",
"format": "file-path",
"exists": true,
"pattern": "^\\S+\\.f(ast)?q\\.gz$",
"errorMessage": "FastQ file for reads 1 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
"pattern": "^\\S+\\.(fa|fasta|faa|fas)(\\.gz)?$",
"errorMessage": "Fasta file for amino acid sequences must be provided, cannot contain spaces and must have extension '.fa', '.fasta', '.faa', '.fas', '.fa.gz', '.fasta.gz', '.faa.gz' or '.fas.gz'"
},
"fastq_2": {
"existing_hmms_to_update": {
"type": "string",
"format": "file-path",
"exists": true,
"pattern": "^\\S+\\.f(ast)?q\\.gz$",
"errorMessage": "FastQ file for reads 2 cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
"pattern": "^\\S+\\.tar\\.gz$",
"description": "Gzipped tarball file containing existing protein family HMMs. These models will be used to 'fish' new sequences from the input and then be updated accordingly."
},
"existing_msas_to_update": {
"type": "string",
"format": "file-path",
"exists": true,
"pattern": "^\\S+\\.tar\\.gz$",
"description": "Tarball file containing multiple sequence alignments (MSAs) for the families to be updated. These alignments are essential for the update process and should match the HMM filenames one by one."
}
},
"required": ["sample", "fastq_1"]
"required": ["sample", "fasta"]
}
}
Loading
Loading