diff --git a/.editorconfig b/.editorconfig index b6b31907..b78de6e6 100644 --- a/.editorconfig +++ b/.editorconfig @@ -8,7 +8,7 @@ trim_trailing_whitespace = true indent_size = 4 indent_style = space -[*.{md,yml,yaml,html,css,scss,js}] +[*.{md,yml,yaml,html,css,scss,js,cff}] indent_size = 2 # These files are edited and tested upstream in nf-core/modules diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md index 99a4cd8e..8c2f23d5 100644 --- a/.github/CONTRIBUTING.md +++ b/.github/CONTRIBUTING.md @@ -1,22 +1,20 @@ -# nf-core/tva: Contributing Guidelines +# CenterForMedicalGeneticsGhent/nf-cmgg-germline: Contributing Guidelines Hi there! -Many thanks for taking an interest in improving nf-core/tva. +Many thanks for taking an interest in improving CenterForMedicalGeneticsGhent/nf-cmgg-germline. -We try to manage the required tasks for nf-core/tva using GitHub issues, you probably came to this page when creating one. +We try to manage the required tasks for CenterForMedicalGeneticsGhent/nf-cmgg-germline using GitHub issues, you probably came to this page when creating one. Please use the pre-filled template to save time. However, don't be put off by this template - other more general issues and suggestions are welcome! Contributions to the code are even more welcome ;) -> If you need help using or modifying nf-core/tva then the best place to ask is on the nf-core Slack [#tva](https://nfcore.slack.com/channels/tva) channel ([join our Slack here](https://nf-co.re/join/slack)). - ## Contribution workflow -If you'd like to write some code for nf-core/tva, the standard workflow is as follows: +If you'd like to write some code for CenterForMedicalGeneticsGhent/nf-cmgg-germline, the standard workflow is as follows: -1. Check that there isn't already an issue about your idea in the [nf-core/tva issues](https://github.com/nf-core/tva/issues) to avoid duplicating work. If there isn't one already, please create one so that others know you're working on this -2. [Fork](https://help.github.com/en/github/getting-started-with-github/fork-a-repo) the [nf-core/tva repository](https://github.com/nf-core/tva) to your GitHub account +1. Check that there isn't already an issue about your idea in the [CenterForMedicalGeneticsGhent/nf-cmgg-germline issues](https://github.com/CenterForMedicalGeneticsGhent/nf-cmgg-germline/issues) to avoid duplicating work. If there isn't one already, please create one so that others know you're working on this +2. [Fork](https://help.github.com/en/github/getting-started-with-github/fork-a-repo) the [CenterForMedicalGeneticsGhent/nf-cmgg-germline repository](https://github.com/CenterForMedicalGeneticsGhent/nf-cmgg-germline) to your GitHub account 3. Make the necessary changes / additions within your forked repository following [Pipeline conventions](#pipeline-contribution-conventions) 4. Use `nf-core schema build` and add any new parameters to the pipeline JSON schema (requires [nf-core tools](https://github.com/nf-core/tools) >= 1.10). 5. Submit a Pull Request against the `dev` branch and wait for the code to be reviewed and merged @@ -39,7 +37,7 @@ If any failures or warnings are encountered, please follow the listed URL for mo ### Pipeline tests -Each `nf-core` pipeline should be set up with a minimal set of test-data. +Each pipeline should be set up with a minimal set of test-data. `GitHub Actions` then runs the pipeline on this data to ensure that it exits successfully. If there are any failures then the automated tests fail. These tests are run both with the latest available version of `Nextflow` and also the minimum required version that is stated in the pipeline code. @@ -52,13 +50,9 @@ These tests are run both with the latest available version of `Nextflow` and als - Fix the bug, and bump version (X.Y.Z+1). - A PR should be made on `master` from patch to directly this particular bug. -## Getting help - -For further information/help, please consult the [nf-core/tva documentation](https://nf-co.re/tva/usage) and don't hesitate to get in touch on the nf-core Slack [#tva](https://nfcore.slack.com/channels/tva) channel ([join our Slack here](https://nf-co.re/join/slack)). - ## Pipeline contribution conventions -To make the nf-core/tva code and processing logic more understandable for new contributors and to ensure quality, we semi-standardise the way the code and other contributions are written. +To make the CenterForMedicalGeneticsGhent/nf-cmgg-germline code and processing logic more understandable for new contributors and to ensure quality, we semi-standardise the way the code and other contributions are written. ### Adding a new step @@ -85,7 +79,7 @@ Once there, use `nf-core schema build` to add to `nextflow_schema.json`. Sensible defaults for process resource requirements (CPUs / memory / time) for a process should be defined in `conf/base.config`. These should generally be specified generic with `withLabel:` selectors so they can be shared across multiple processes/steps of the pipeline. A nf-core standard set of labels that should be followed where possible can be seen in the [nf-core pipeline template](https://github.com/nf-core/tools/blob/master/nf_core/pipeline-template/conf/base.config), which has the default process as a single core-process, and then different levels of multi-core configurations for increasingly large memory requirements defined with standardised labels. -The process resources can be passed on to the tool dynamically within the process with the `${task.cpu}` and `${task.memory}` variables in the `script:` block. +The process resources can be passed on to the tool dynamically within the process with the `${task.cpus}` and `${task.memory}` variables in the `script:` block. ### Naming schemes diff --git a/.github/ISSUE_TEMPLATE/bug_report.yml b/.github/ISSUE_TEMPLATE/bug_report.yml index 9ff337da..a63e00de 100644 --- a/.github/ISSUE_TEMPLATE/bug_report.yml +++ b/.github/ISSUE_TEMPLATE/bug_report.yml @@ -2,14 +2,6 @@ name: Bug report description: Report something that is broken or incorrect labels: bug body: - - type: markdown - attributes: - value: | - Before you post this issue, please check the documentation: - - - [nf-core website: troubleshooting](https://nf-co.re/usage/troubleshooting) - - [nf-core/tva pipeline documentation](https://nf-co.re/tva/usage) - - type: textarea id: description attributes: @@ -47,4 +39,4 @@ body: * Executor _(eg. slurm, local, awsbatch)_ * Container engine: _(e.g. Docker, Singularity, Conda, Podman, Shifter or Charliecloud)_ * OS _(eg. CentOS Linux, macOS, Linux Mint)_ - * Version of nf-core/tva _(eg. 1.1, 1.5, 1.8.2)_ + * Version of CenterForMedicalGeneticsGhent/nf-cmgg-germline _(eg. 1.1, 1.5, 1.8.2)_ diff --git a/.github/ISSUE_TEMPLATE/config.yml b/.github/ISSUE_TEMPLATE/config.yml deleted file mode 100644 index e2418538..00000000 --- a/.github/ISSUE_TEMPLATE/config.yml +++ /dev/null @@ -1,7 +0,0 @@ -contact_links: - - name: Join nf-core - url: https://nf-co.re/join - about: Please join the nf-core community here - - name: "Slack #tva channel" - url: https://nfcore.slack.com/channels/tva - about: Discussion about the nf-core/tva pipeline diff --git a/.github/ISSUE_TEMPLATE/feature_request.yml b/.github/ISSUE_TEMPLATE/feature_request.yml index e00b1d0a..8fd7978b 100644 --- a/.github/ISSUE_TEMPLATE/feature_request.yml +++ b/.github/ISSUE_TEMPLATE/feature_request.yml @@ -1,5 +1,5 @@ name: Feature request -description: Suggest an idea for the nf-core/tva pipeline +description: Suggest an idea for the CenterForMedicalGeneticsGhent/nf-cmgg-germline pipeline labels: enhancement body: - type: textarea diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index 4f96d722..7e70196b 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -1,22 +1,21 @@ ## PR checklist - [ ] This comment contains a description of changes (with reason). - [ ] If you've fixed a bug or added code that should be tested, add tests! - - [ ] If you've added a new tool - have you followed the pipeline conventions in the [contribution docs](https://github.com/nf-core/tva/tree/master/.github/CONTRIBUTING.md) - - [ ] If necessary, also make a PR on the nf-core/tva _branch_ on the [nf-core/test-datasets](https://github.com/nf-core/test-datasets) repository. + - [ ] If you've added a new tool - have you followed the pipeline conventions in the [contribution docs](https://github.com/CenterForMedicalGeneticsGhent/nf-cmgg-germline/blob/master/.github/CONTRIBUTING.md) - [ ] Make sure your code lints (`nf-core lint`). - [ ] Ensure the test suite passes (`nextflow run . -profile test,docker --outdir `). - [ ] Usage Documentation in `docs/usage.md` is updated. diff --git a/.github/workflows/awsfulltest.yml b/.github/workflows/awsfulltest.yml deleted file mode 100644 index f9c4782a..00000000 --- a/.github/workflows/awsfulltest.yml +++ /dev/null @@ -1,30 +0,0 @@ -name: nf-core AWS full size tests -# This workflow is triggered on published releases. -# It can be additionally triggered manually with GitHub actions workflow dispatch button. -# It runs the -profile 'test_full' on AWS batch - -on: - release: - types: [published] - workflow_dispatch: -jobs: - run-tower: - name: Run AWS full tests - if: github.repository == 'nf-core/tva' - runs-on: ubuntu-latest - steps: - - name: Launch workflow via tower - uses: nf-core/tower-action@v3 - # TODO nf-core: You can customise AWS full pipeline tests as required - # Add full size test data (but still relatively small datasets for few samples) - # on the `test_full.config` test runs with only one set of parameters - with: - workspace_id: ${{ secrets.TOWER_WORKSPACE_ID }} - access_token: ${{ secrets.TOWER_ACCESS_TOKEN }} - compute_env: ${{ secrets.TOWER_COMPUTE_ENV }} - workdir: s3://${{ secrets.AWS_S3_BUCKET }}/work/tva/work-${{ github.sha }} - parameters: | - { - "outdir": "s3://${{ secrets.AWS_S3_BUCKET }}/tva/results-${{ github.sha }}" - } - profiles: test_full,aws_tower diff --git a/.github/workflows/awstest.yml b/.github/workflows/awstest.yml deleted file mode 100644 index fa32165a..00000000 --- a/.github/workflows/awstest.yml +++ /dev/null @@ -1,25 +0,0 @@ -name: nf-core AWS test -# This workflow can be triggered manually with the GitHub actions workflow dispatch button. -# It runs the -profile 'test' on AWS batch - -on: - workflow_dispatch: -jobs: - run-tower: - name: Run AWS tests - if: github.repository == 'nf-core/tva' - runs-on: ubuntu-latest - steps: - # Launch workflow using Tower CLI tool action - - name: Launch workflow via tower - uses: nf-core/tower-action@v3 - with: - workspace_id: ${{ secrets.TOWER_WORKSPACE_ID }} - access_token: ${{ secrets.TOWER_ACCESS_TOKEN }} - compute_env: ${{ secrets.TOWER_COMPUTE_ENV }} - workdir: s3://${{ secrets.AWS_S3_BUCKET }}/work/tva/work-${{ github.sha }} - parameters: | - { - "outdir": "s3://${{ secrets.AWS_S3_BUCKET }}/tva/results-test-${{ github.sha }}" - } - profiles: test,aws_tower diff --git a/.github/workflows/branch.yml b/.github/workflows/branch.yml index c322d39a..bd273267 100644 --- a/.github/workflows/branch.yml +++ b/.github/workflows/branch.yml @@ -11,9 +11,9 @@ jobs: steps: # PRs to the nf-core repo master branch are only ok if coming from the nf-core repo `dev` or any `patch` branches - name: Check PRs - if: github.repository == 'nf-core/tva' + if: github.repository == 'CenterForMedicalGeneticsGhent/nf-cmgg-germline' run: | - { [[ ${{github.event.pull_request.head.repo.full_name }} == nf-core/tva ]] && [[ $GITHUB_HEAD_REF = "dev" ]]; } || [[ $GITHUB_HEAD_REF == "patch" ]] + { [[ ${{github.event.pull_request.head.repo.full_name }} == CenterForMedicalGeneticsGhent/nf-cmgg-germline ]] && [[ $GITHUB_HEAD_REF = "dev" ]]; } || [[ $GITHUB_HEAD_REF == "patch" ]] # If the above check failed, post a comment on the PR explaining the failure # NOTE - this doesn't currently work if the PR is coming from a fork, due to limitations in GitHub actions secrets diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index cc735ff4..7f14272b 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -10,41 +10,65 @@ on: env: NXF_ANSI_LOG: false - CAPSULE_LOG: none jobs: - test: - name: Run pipeline with test data + test_all: + name: Run pipeline with test data (complete) # Only run on push if this is the nf-core dev branch (merged PRs) - if: "${{ github.event_name != 'push' || (github.event_name == 'push' && github.repository == 'nf-core/tva') }}" + if: "${{ github.event_name != 'push' || (github.event_name == 'push' && github.repository == 'CenterForMedicalGeneticsGhent/nf-cmgg-germline') }}" runs-on: ubuntu-latest strategy: matrix: - # Nextflow versions - include: - # Test pipeline minimum Nextflow version - - NXF_VER: "21.10.3" - NXF_EDGE: "" - # Test latest edge release of Nextflow - - NXF_VER: "" - NXF_EDGE: "1" + nxf_ver: ["21.10.3", "22.04.0"] + test: + - "default" + - "fails" + - "seqplorer_min" + - "seqplorer_vcfanno" + - "seqr_full" + - "seqr_no_genotyping" steps: - name: Check out pipeline code uses: actions/checkout@v2 - - name: Install Nextflow - env: - NXF_VER: ${{ matrix.NXF_VER }} - # Uncomment only if the edge release is more recent than the latest stable release - # See https://github.com/nextflow-io/nextflow/issues/2467 - # NXF_EDGE: ${{ matrix.NXF_EDGE }} + - name: Install nextflow run: | - wget -qO- get.nextflow.io | bash - sudo mv nextflow /usr/local/bin/ + sudo bash; mkdir /opt/nextflow; cd /opt/nextflow; wget https://github.com/nextflow-io/nextflow/releases/download/v${{ matrix.nxf_ver }}/nextflow; chmod +x nextflow; + echo "/opt/nextflow" >> $GITHUB_PATH; + + # - name: Install Nextflow + # uses: nf-core/setup-nextflow@v1 + # with: + # version: "${{ matrix.nxf_ver }}" + + - name: Install nf-test + run: | + sudo bash; mkdir /opt/nf-test; cd /opt/nf-test; wget https://github.com/askimed/nf-test/releases/download/v0.6.0/nf-test-0.6.0.tar.gz; tar xvfz nf-test-0.6.0.tar.gz; chmod +x nf-test; + echo "/opt/nf-test" >> $GITHUB_PATH; + + - name: Free some space + run: | + sudo rm -rf "/usr/local/share/boost" + sudo rm -rf "$AGENT_TOOLSDIRECTORY" - name: Run pipeline with test data - # TODO nf-core: You can customise CI pipeline run tests as required - # For example: adding multiple test runs with different parameters - # Remember that you can parallelise this by using strategy.matrix run: | - nextflow run ${GITHUB_WORKSPACE} -profile test,docker --outdir ./results + nf-test test tests/${{ matrix.test }}.test + + - name: Output log on failure + if: failure() + run: | + sudo apt install bat > /dev/null + batcat --decorations=always --color=always .nf-test/tests/*/meta/std.{out,err} + - name: Upload logs on failure + if: failure() + uses: actions/upload-artifact@v2 + with: + name: nf-test-logs + path: | + .nf-test/tests/*/meta/nextflow.log + .nf-test/tests/*/meta/std.out + .nf-test/tests/*/meta/std.err + .nf-test/tests/*/meta/trace.csv + .nf-test/tests/*/work + .nf-test/*/output diff --git a/.github/workflows/fix-linting.yml b/.github/workflows/fix-linting.yml deleted file mode 100644 index 6caed25c..00000000 --- a/.github/workflows/fix-linting.yml +++ /dev/null @@ -1,55 +0,0 @@ -name: Fix linting from a comment -on: - issue_comment: - types: [created] - -jobs: - deploy: - # Only run if comment is on a PR with the main repo, and if it contains the magic keywords - if: > - contains(github.event.comment.html_url, '/pull/') && - contains(github.event.comment.body, '@nf-core-bot fix linting') && - github.repository == 'nf-core/tva' - runs-on: ubuntu-latest - steps: - # Use the @nf-core-bot token to check out so we can push later - - uses: actions/checkout@v3 - with: - token: ${{ secrets.nf_core_bot_auth_token }} - - # Action runs on the issue comment, so we don't get the PR by default - # Use the gh cli to check out the PR - - name: Checkout Pull Request - run: gh pr checkout ${{ github.event.issue.number }} - env: - GITHUB_TOKEN: ${{ secrets.nf_core_bot_auth_token }} - - - uses: actions/setup-node@v2 - - - name: Install Prettier - run: npm install -g prettier @prettier/plugin-php - - # Check that we actually need to fix something - - name: Run 'prettier --check' - id: prettier_status - run: | - if prettier --check ${GITHUB_WORKSPACE}; then - echo "::set-output name=result::pass" - else - echo "::set-output name=result::fail" - fi - - - name: Run 'prettier --write' - if: steps.prettier_status.outputs.result == 'fail' - run: prettier --write ${GITHUB_WORKSPACE} - - - name: Commit & push changes - if: steps.prettier_status.outputs.result == 'fail' - run: | - git config user.email "core@nf-co.re" - git config user.name "nf-core-bot" - git config push.default upstream - git add . - git status - git commit -m "[automated] Fix linting with Prettier" - git push diff --git a/.github/workflows/linting.yml b/.github/workflows/linting.yml index 77358dee..ce1c36dd 100644 --- a/.github/workflows/linting.yml +++ b/.github/workflows/linting.yml @@ -9,31 +9,48 @@ on: types: [published] jobs: - EditorConfig: + Prettier: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - uses: actions/setup-node@v2 - - name: Install editorconfig-checker - run: npm install -g editorconfig-checker + - name: Install Prettier + run: npm install -g prettier - - name: Run ECLint check - run: editorconfig-checker -exclude README.md $(find .* -type f | grep -v '.git\|.py\|.md\|json\|yml\|yaml\|html\|css\|work\|.nextflow\|build\|nf_core.egg-info\|log.txt\|Makefile') + - name: Run Prettier --check + run: prettier --check ${GITHUB_WORKSPACE} - Prettier: + PythonBlack: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - - uses: actions/setup-node@v2 + - name: Activate Black + uses: psf/black@stable - - name: Install Prettier - run: npm install -g prettier + # If the above check failed, post a comment on the PR explaining the failure + - name: Post PR comment + if: failure() + uses: mshick/add-pr-comment@v1 + with: + message: | + ## Python linting (`black`) is failing - - name: Run Prettier --check - run: prettier --check ${GITHUB_WORKSPACE} + To keep the code consistent with lots of contributors, we run automated code consistency checks. + To fix this CI test, please run: + + * Install [`black`](https://black.readthedocs.io/en/stable/): `pip install black` + * Fix formatting errors in your pipeline: `black .` + + Once you push these changes the test should pass, and you can hide this comment :+1: + + We highly recommend setting up Black in your code editor so that this formatting is done automatically on save. Ask about it on Slack for help! + + Thanks again for your contribution! + repo-token: ${{ secrets.GITHUB_TOKEN }} + allow-repeats: false nf-core: runs-on: ubuntu-latest @@ -42,15 +59,11 @@ jobs: uses: actions/checkout@v2 - name: Install Nextflow - env: - CAPSULE_LOG: none - run: | - wget -qO- get.nextflow.io | bash - sudo mv nextflow /usr/local/bin/ + uses: nf-core/setup-nextflow@v1 - uses: actions/setup-python@v3 with: - python-version: "3.6" + python-version: "3.7" architecture: "x64" - name: Install dependencies diff --git a/.gitignore b/.gitignore index 5124c9ac..5fcf2332 100644 --- a/.gitignore +++ b/.gitignore @@ -6,3 +6,5 @@ results/ testing/ testing* *.pyc +null +.nf-test diff --git a/.nf-core.yml b/.nf-core.yml index 3805dc81..2f1c895e 100644 --- a/.nf-core.yml +++ b/.nf-core.yml @@ -1 +1,28 @@ repository_type: pipeline + +lint: + files_exist: + - CODE_OF_CONDUCT.md + - assets/nf-core-nfcmggstructural_logo_light.png + - docs/images/nf-core-nfcmggstructural_logo_light.png + - docs/images/nf-core-nfcmggstructural_logo_dark.png + - .github/ISSUE_TEMPLATE/config.yml + - .github/workflows/awstest.yml + - .github/workflows/awsfulltest.yml + nextflow_config: + - manifest.name + - manifest.homePage + multiqc_config: + - report_comment + readme: + - nextflow_badge + files_unchanged: + - .github/CONTRIBUTING.md + - .github/ISSUE_TEMPLATE/bug_report.yml + - .github/PULL_REQUEST_TEMPLATE.md + - .github/workflows/linting.yml + - assets/email_template.txt + - assets/sendmail_template.txt + - lib/NfcoreTemplate.groovy + - .prettierignore + actions_ci: false diff --git a/.prettierignore b/.prettierignore index d0e7ae58..bf5fe69d 100644 --- a/.prettierignore +++ b/.prettierignore @@ -1,5 +1,7 @@ email_template.html +adaptivecard_template.json .nextflow* +.nf-test/ work/ data/ results/ @@ -7,3 +9,5 @@ results/ testing/ testing* *.pyc +samplesheet* +*.ped diff --git a/CHANGELOG.md b/CHANGELOG.md index 2f3e5f81..451fa225 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,16 +1,14 @@ -# nf-core/tva: Changelog +# CenterForMedicalGeneticsGhent/nf-cmgg-germline: Changelog The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). -## v1.0dev - [date] +## v1.0.0 - Beautiful Bruges - [Oct 3 2022] -Initial release of nf-core/tva, created with the [nf-core](https://nf-co.re/) template. +### Added -### `Added` +- Full release of the pipeline -### `Fixed` +## v1.0dev - [May 31 2022] -### `Dependencies` - -### `Deprecated` +Initial release of CenterForMedicalGeneticsGhent/nf-cmgg-germline, created with the [nf-core](https://nf-co.re/) template. diff --git a/CITATION.cff b/CITATION.cff new file mode 100644 index 00000000..4533e2f2 --- /dev/null +++ b/CITATION.cff @@ -0,0 +1,56 @@ +cff-version: 1.2.0 +message: "If you use `nf-core tools` in your work, please cite the `nf-core` publication" +authors: + - family-names: Ewels + given-names: Philip + - family-names: Peltzer + given-names: Alexander + - family-names: Fillinger + given-names: Sven + - family-names: Patel + given-names: Harshil + - family-names: Alneberg + given-names: Johannes + - family-names: Wilm + given-names: Andreas + - family-names: Ulysse Garcia + given-names: Maxime + - family-names: Di Tommaso + given-names: Paolo + - family-names: Nahnsen + given-names: Sven +title: "The nf-core framework for community-curated bioinformatics pipelines." +version: 2.4.1 +doi: 10.1038/s41587-020-0439-x +date-released: 2022-05-16 +url: https://github.com/nf-core/tools +prefered-citation: + type: article + authors: + - family-names: Ewels + given-names: Philip + - family-names: Peltzer + given-names: Alexander + - family-names: Fillinger + given-names: Sven + - family-names: Patel + given-names: Harshil + - family-names: Alneberg + given-names: Johannes + - family-names: Wilm + given-names: Andreas + - family-names: Ulysse Garcia + given-names: Maxime + - family-names: Di Tommaso + given-names: Paolo + - family-names: Nahnsen + given-names: Sven + doi: 10.1038/s41587-020-0439-x + journal: nature biotechnology + start: 276 + end: 278 + title: "The nf-core framework for community-curated bioinformatics pipelines." + issue: 3 + volume: 38 + year: 2020 + url: https://dx.doi.org/10.1038/s41587-020-0439-x diff --git a/CITATIONS.md b/CITATIONS.md index a9a06ccc..d92f7b59 100644 --- a/CITATIONS.md +++ b/CITATIONS.md @@ -1,4 +1,4 @@ -# nf-core/tva: Citations +# CenterForMedicalGeneticsGhent/nf-cmgg-germline: Citations ## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/) @@ -10,10 +10,7 @@ ## Pipeline tools -- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) - -- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/) - > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924. +TODO ## Software packaging/containerisation tools diff --git a/README.md b/README.md index 8bb0df35..6f523b0f 100644 --- a/README.md +++ b/README.md @@ -1,38 +1,18 @@ -# ![nf-core/tva](docs/images/nf-core-tva_logo_light.png#gh-light-mode-only) ![nf-core/tva](docs/images/nf-core-tva_logo_dark.png#gh-dark-mode-only) - -[![GitHub Actions CI Status](https://github.com/nf-core/tva/workflows/nf-core%20CI/badge.svg)](https://github.com/nf-core/tva/actions?query=workflow%3A%22nf-core+CI%22) -[![GitHub Actions Linting Status](https://github.com/nf-core/tva/workflows/nf-core%20linting/badge.svg)](https://github.com/nf-core/tva/actions?query=workflow%3A%22nf-core+linting%22) -[![AWS CI](https://img.shields.io/badge/CI%20tests-full%20size-FF9900?logo=Amazon%20AWS)](https://nf-co.re/tva/results) -[![Cite with Zenodo](http://img.shields.io/badge/DOI-10.5281/zenodo.XXXXXXX-1073c8)](https://doi.org/10.5281/zenodo.XXXXXXX) +# CenterForMedicalGeneticsGhent/nf-cmgg-germline [![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A521.10.3-23aa62.svg)](https://www.nextflow.io/) -[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?logo=anaconda)](https://docs.conda.io/en/latest/) [![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?logo=docker)](https://www.docker.com/) [![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg)](https://sylabs.io/docs/) -[![Launch on Nextflow Tower](https://img.shields.io/badge/Launch%20%F0%9F%9A%80-Nextflow%20Tower-%234256e7)](https://tower.nf/launch?pipeline=https://github.com/nf-core/tva) - -[![Get help on Slack](http://img.shields.io/badge/slack-nf--core%20%23tva-4A154B?logo=slack)](https://nfcore.slack.com/channels/tva) -[![Follow on Twitter](http://img.shields.io/badge/twitter-%40nf__core-1DA1F2?logo=twitter)](https://twitter.com/nf_core) -[![Watch on YouTube](http://img.shields.io/badge/youtube-nf--core-FF0000?logo=youtube)](https://www.youtube.com/c/nf-core) ## Introduction - - -**nf-core/tva** is a bioinformatics best-practice analysis pipeline for A nextflow pipeline for calling and annotating variants. +**nf-cmgg-germline** is a bioinformatics best-practice analysis pipeline for A nextflow pipeline for calling and annotating variants. It uses HaplotypeCaller to call variants and EnsemblVEP to annotate the called variants. By supplying the `--output_mode ` you can choose for which platform the VCFs should be created. The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The [Nextflow DSL2](https://www.nextflow.io/docs/latest/dsl2.html) implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from [nf-core/modules](https://github.com/nf-core/modules) in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community! - - -On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources. The results obtained from the full-sized test can be viewed on the [nf-core website](https://nf-co.re/tva/results). - ## Pipeline summary - - -1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)) -2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/)) +![metro graph](docs/images/nf-cmgg-germline_metro.png) ## Quick Start @@ -43,7 +23,7 @@ On release, automated continuous integration tests run the pipeline on a full-si 3. Download the pipeline and test it on a minimal dataset with a single command: ```console - nextflow run nf-core/tva -profile test,YOURPROFILE --outdir + nextflow run CenterForMedicalGeneticsGhent/nf-cmgg-germline -profile test,YOURPROFILE --outdir ``` Note that some form of configuration will be needed so that Nextflow knows how to fetch the required software. This is usually done in the form of a config profile (`YOURPROFILE` in the example command above). You can chain multiple config profiles in a comma-separated string. @@ -55,43 +35,26 @@ On release, automated continuous integration tests run the pipeline on a full-si 4. Start running your own analysis! - - ```console - nextflow run nf-core/tva --input samplesheet.csv --outdir --genome GRCh37 -profile + nextflow run CenterForMedicalGeneticsGhent/nf-cmgg-germline --input --outdir --genome GRCh38 -profile --fasta ``` -## Documentation +An overview of the parameters for this pipeline can be viewed using: -The nf-core/tva pipeline comes with documentation about the pipeline [usage](https://nf-co.re/tva/usage), [parameters](https://nf-co.re/tva/parameters) and [output](https://nf-co.re/tva/output). +``` +nextflow run CenterForMedicalGeneticsGhent/nf-cmgg-germline --help +``` ## Credits -nf-core/tva was originally written by @nvnieuwk. +nf-cmgg-germline was originally written by @nvnieuwk. We thank the following people for their extensive assistance in the development of this pipeline: - - ## Contributions and Support If you would like to contribute to this pipeline, please see the [contributing guidelines](.github/CONTRIBUTING.md). -For further information or help, don't hesitate to get in touch on the [Slack `#tva` channel](https://nfcore.slack.com/channels/tva) (you can join with [this invite](https://nf-co.re/join/slack)). - ## Citations - - - - - An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file. - -You can cite the `nf-core` publication as follows: - -> **The nf-core framework for community-curated bioinformatics pipelines.** -> -> Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen. -> -> _Nat Biotechnol._ 2020 Feb 13. doi: [10.1038/s41587-020-0439-x](https://dx.doi.org/10.1038/s41587-020-0439-x). diff --git a/assets/CMGG_logo.png b/assets/CMGG_logo.png new file mode 100644 index 00000000..894f7f42 Binary files /dev/null and b/assets/CMGG_logo.png differ diff --git a/assets/NA12878.ped b/assets/NA12878.ped new file mode 100644 index 00000000..72cdce95 --- /dev/null +++ b/assets/NA12878.ped @@ -0,0 +1,3 @@ +#fam-id ind-id pat-id mat-id sex phen +Proband_12345 NA12878K12_NVQ_034 NA24385D2_NVQ_034 0 2 0 +Proband_12345 NA24385D2_NVQ_034 0 0 1 0 \ No newline at end of file diff --git a/assets/adaptivecard_template.json b/assets/adaptivecard_template.json new file mode 100644 index 00000000..be3aae05 --- /dev/null +++ b/assets/adaptivecard_template.json @@ -0,0 +1,67 @@ +{ + "type": "message", + "attachments": [ + { + "contentType": "application/vnd.microsoft.card.adaptive", + "contentUrl": null, + "content": { + "\$schema": "http://adaptivecards.io/schemas/adaptive-card.json", + "msteams": { + "width": "Full" + }, + "type": "AdaptiveCard", + "version": "1.2", + "body": [ + { + "type": "TextBlock", + "size": "Large", + "weight": "Bolder", + "color": "<% if (success) { %>Good<% } else { %>Attention<%} %>", + "text": "CenterForMedicalGeneticsGhent/nf-cmgg-germline v${version} - ${runName}", + "wrap": true + + }, + { + "type": "TextBlock", + "spacing": "None", + "text": "Completed at ${dateComplete} (duration: ${duration})", + "isSubtle": true, + "wrap": true + }, + { + "type": "TextBlock", + "text": "<% if (success) { %>Pipeline completed successfully!<% } else { %>Pipeline completed with errors. The full error message was: ${errorReport}.<% } %>", + "wrap": true + }, + { + "type": "TextBlock", + "text": "The command used to launch the workflow was as follows:", + "wrap": true + }, + { + "type": "TextBlock", + "text": "${commandLine}", + "isSubtle": true, + "wrap": true + } + ], + "actions": [ + { + "type": "Action.ShowCard", + "title": "Pipeline Configuration", + "card": { + "type": "AdaptiveCard", + "\$schema": "http://adaptivecards.io/schemas/adaptive-card.json", + "body": [ + { + "type": "FactSet", + "facts": [<% out << summary.collect{ k,v -> "{\"title\": \"$k\", \"value\" : \"$v\"}" }.join(",\n") %>] + } + ] + } + } + ] + } + } + ] +} diff --git a/assets/email_template.html b/assets/email_template.html index 709bf28c..34c9952e 100644 --- a/assets/email_template.html +++ b/assets/email_template.html @@ -4,21 +4,21 @@ - - nf-core/tva Pipeline Report + + CenterForMedicalGeneticsGhent/nf-cmgg-germline Pipeline Report
-

nf-core/tva v${version}

+

CenterForMedicalGeneticsGhent/nf-cmgg-germline v${version}

Run Name: $runName

<% if (!success){ out << """
-

nf-core/tva execution completed unsuccessfully!

+

CenterForMedicalGeneticsGhent/nf-cmgg-germline execution completed unsuccessfully!

The exit status of the task that caused the workflow execution to fail was: $exitStatus.

The full error message was:

${errorReport}
@@ -27,7 +27,7 @@

nf-core/tva execution completed unsucc } else { out << """
- nf-core/tva execution completed successfully! + CenterForMedicalGeneticsGhent/nf-cmgg-germline execution completed successfully!
""" } @@ -44,8 +44,8 @@

Pipeline Configuration:

-

nf-core/tva

-

https://github.com/nf-core/tva

+

CenterForMedicalGeneticsGhent/nf-cmgg-germline

+

https://github.com/CenterForMedicalGeneticsGhent/nf-cmgg-germline

diff --git a/assets/email_template.txt b/assets/email_template.txt index 71618c87..41546d9a 100644 --- a/assets/email_template.txt +++ b/assets/email_template.txt @@ -4,16 +4,15 @@ |\\ | |__ __ / ` / \\ |__) |__ } { | \\| | \\__, \\__/ | \\ |___ \\`-._,-`-, `._,._,' - nf-core/tva v${version} + CenterForMedicalGeneticsGhent/nf-cmgg-germline v${version} ---------------------------------------------------- - Run Name: $runName <% if (success){ - out << "## nf-core/tva execution completed successfully! ##" + out << "## CenterForMedicalGeneticsGhent/nf-cmgg-germline execution completed successfully! ##" } else { out << """#################################################### -## nf-core/tva execution completed unsuccessfully! ## +## CenterForMedicalGeneticsGhent/nf-cmgg-germline execution completed unsuccessfully! ## #################################################### The exit status of the task that caused the workflow execution to fail was: $exitStatus. The full error message was: @@ -36,5 +35,5 @@ Pipeline Configuration: <% out << summary.collect{ k,v -> " - $k: $v" }.join("\n") %> -- -nf-core/tva -https://github.com/nf-core/tva +CenterForMedicalGeneticsGhent/nf-cmgg-germline +https://github.com/CenterForMedicalGeneticsGhent/nf-cmgg-germline diff --git a/assets/multiqc_config.yml b/assets/multiqc_config.yml index d0da6eac..68b1b187 100644 --- a/assets/multiqc_config.yml +++ b/assets/multiqc_config.yml @@ -1,11 +1,10 @@ report_comment: > - This report has been generated by the nf-core/tva - analysis pipeline. For information about how to interpret these results, please see the - documentation. + This report has been generated by the nf-cmgg-germline + analysis pipeline. report_section_order: software_versions: order: -1000 - "nf-core-tva-summary": + "CenterForMedicalGeneticsGhent-nf-cmgg-germline-summary": order: -1001 export_plots: true diff --git a/assets/nf-cmgg-germline_logo_light.png b/assets/nf-cmgg-germline_logo_light.png new file mode 100644 index 00000000..1336d24d Binary files /dev/null and b/assets/nf-cmgg-germline_logo_light.png differ diff --git a/assets/nf-core-nf-cmgg-germline_logo_light.png b/assets/nf-core-nf-cmgg-germline_logo_light.png new file mode 100644 index 00000000..1336d24d Binary files /dev/null and b/assets/nf-core-nf-cmgg-germline_logo_light.png differ diff --git a/assets/nf-core-tva_logo_light.png b/assets/nf-core-tva_logo_light.png deleted file mode 100644 index 552be43e..00000000 Binary files a/assets/nf-core-tva_logo_light.png and /dev/null differ diff --git a/assets/samplesheet.csv b/assets/samplesheet.csv index 5f653ab7..0da0b7ff 100644 --- a/assets/samplesheet.csv +++ b/assets/samplesheet.csv @@ -1,3 +1,3 @@ -sample,fastq_1,fastq_2 -SAMPLE_PAIRED_END,/path/to/fastq/files/AEG588A1_S1_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A1_S1_L002_R2_001.fastq.gz -SAMPLE_SINGLE_END,/path/to/fastq/files/AEG588A4_S4_L003_R1_001.fastq.gz, +sample,family,cram,crai,bed,ped +NA12878K12_NVQ_034,Proband_12345,https://github.com/nf-core/test-datasets/raw/modules/data/genomics/homo_sapiens/illumina/cram/test.paired_end.markduplicates.sorted.cram,https://github.com/nf-core/test-datasets/raw/modules/data/genomics/homo_sapiens/illumina/cram/test.paired_end.markduplicates.sorted.cram.crai,https://github.com/nf-core/test-datasets/raw/modules/data/genomics/homo_sapiens/genome/chr21/sequence/multi_intervals.bed,assets/test.ped +NA24385D2_NVQ_034,,https://github.com/nf-core/test-datasets/raw/modules/data/genomics/homo_sapiens/illumina/cram/test2.paired_end.markduplicates.sorted.cram,,https://github.com/nf-core/test-datasets/raw/modules/data/genomics/homo_sapiens/genome/chr21/sequence/multi_intervals.bed,assets/test.ped \ No newline at end of file diff --git a/assets/samplesheet_local.csv b/assets/samplesheet_local.csv new file mode 100644 index 00000000..a533a32f --- /dev/null +++ b/assets/samplesheet_local.csv @@ -0,0 +1,3 @@ +sample,cram,crai,bed,ped +NA12878K12_NVQ_034,/home/nvnieuwk/Documents/data/GIAB/NA12878K12_NVQ_034/NA12878K12_NVQ_034-subset.cram,/home/nvnieuwk/Documents/data/GIAB/NA12878K12_NVQ_034/NA12878K12_NVQ_034-subset.cram.bai,/home/nvnieuwk/Documents/data/GIAB/NA12878K12_NVQ_034/NA12878K12_NVQ_034-callable.bed,/home/nvnieuwk/Documents/cmgg/nf-cmgg-germline/assets/NA12878.ped +NA24385D2_NVQ_034,/home/nvnieuwk/Documents/data/GIAB/NA24385D2_NVQ_034/NA24385D2_NVQ_034-subset.cram,/home/nvnieuwk/Documents/data/GIAB/NA24385D2_NVQ_034/NA24385D2_NVQ_034-subset.cram.bai,/home/nvnieuwk/Documents/data/GIAB/NA24385D2_NVQ_034/NA24385D2_NVQ_034-callable.bed,/home/nvnieuwk/Documents/cmgg/nf-cmgg-germline/assets/NA12878.ped \ No newline at end of file diff --git a/assets/samplesheet_template.csv b/assets/samplesheet_template.csv new file mode 100644 index 00000000..421b82c1 --- /dev/null +++ b/assets/samplesheet_template.csv @@ -0,0 +1,3 @@ +sample,cram,crai,bed,ped +NA12878K12_NVQ_034,https://github.com/nf-core/test-datasets/raw/sarek/testdata/recalcram/1234N.recal.cram,,assets/test_data.bed,assets/test.ped +NA24385D2_NVQ_034,https://github.com/nf-core/test-datasets/raw/sarek/testdata/recalcram/9876T.recal.cram,https://github.com/nf-core/test-datasets/raw/sarek/testdata/recalcram/9876T.recal.cram.crai,assets/test_data.bed,assets/test.ped \ No newline at end of file diff --git a/assets/schema_input.json b/assets/schema_input.json index 9a2a03ab..38875721 100644 --- a/assets/schema_input.json +++ b/assets/schema_input.json @@ -1,7 +1,7 @@ { "$schema": "http://json-schema.org/draft-07/schema", - "$id": "https://raw.githubusercontent.com/nf-core/tva/master/assets/schema_input.json", - "title": "nf-core/tva pipeline - params.input schema", + "$id": "https://github.com/CenterForMedicalGeneticsGhent/nf-cmgg-germline/raw/master/assets/schema_input.json", + "title": "CenterForMedicalGeneticsGhent/nf-cmgg-germline pipeline - params.input schema", "description": "Schema for the file provided with params.input", "type": "array", "items": { @@ -12,25 +12,27 @@ "pattern": "^\\S+$", "errorMessage": "Sample name must be provided and cannot contain spaces" }, - "fastq_1": { + "family_id": { "type": "string", - "pattern": "^\\S+\\.f(ast)?q\\.gz$", - "errorMessage": "FastQ file for reads 1 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'" + "pattern": "^\\S+$", + "errorMessage": "Family ID must be provided and cannot contain spaces" + }, + "cram": { + "type": "string", + "pattern": "^\\S+\\.cram$", + "errorMessage": "CRAM file must be provided, cannot contain spaces and must have extension '.cram'" }, - "fastq_2": { - "errorMessage": "FastQ file for reads 2 cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'", - "anyOf": [ - { - "type": "string", - "pattern": "^\\S+\\.f(ast)?q\\.gz$" - }, - { - "type": "string", - "maxLength": 0 - } - ] + "crai": { + "type": "string", + "pattern": "^\\S+\\.c?[br]ai$", + "errorMessage": "CRAM index file must be provided, cannot contain spaces and must have extension '.crai' or '.bai'" + }, + "bed": { + "type": "string", + "pattern": "^\\S+\\.bed$", + "errorMessage": "BED file must be provided, cannot contain spaces and must have extension '.bed'" } }, - "required": ["sample", "fastq_1"] + "required": ["sample", "family_id", "cram", "crai", "bed"] } } diff --git a/assets/sendmail_template.txt b/assets/sendmail_template.txt index e26744f5..f9b4221e 100644 --- a/assets/sendmail_template.txt +++ b/assets/sendmail_template.txt @@ -9,12 +9,12 @@ Content-Type: text/html; charset=utf-8 $email_html --nfcoremimeboundary -Content-Type: image/png;name="nf-core-tva_logo.png" +Content-Type: image/png;name="nf-cmgg-germline_logo.png" Content-Transfer-Encoding: base64 Content-ID: -Content-Disposition: inline; filename="nf-core-tva_logo_light.png" +Content-Disposition: inline; filename="nf-cmgg-germline_logo_light.png" -<% out << new File("$projectDir/assets/nf-core-tva_logo_light.png"). +<% out << new File("$projectDir/assets/nf-cmgg-germline_logo_light.png"). bytes. encodeBase64(). toString(). diff --git a/assets/test.ped b/assets/test.ped new file mode 100644 index 00000000..87d40d5b --- /dev/null +++ b/assets/test.ped @@ -0,0 +1,3 @@ +#fam-id ind-id pat-id mat-id sex phen +Proband_12345 normal tumour 0 2 0 +Proband_12345 tumour 0 0 1 0 \ No newline at end of file diff --git a/assets/test_data.bed b/assets/test_data.bed new file mode 100644 index 00000000..0e0232ae --- /dev/null +++ b/assets/test_data.bed @@ -0,0 +1,7 @@ + +1 0 200000 +2 0 200000 +3 0 200000 +8 0 1276 +11 0 3679 +X 0 200000 \ No newline at end of file diff --git a/assets/vcfanno.toml b/assets/vcfanno.toml new file mode 100644 index 00000000..971733f9 --- /dev/null +++ b/assets/vcfanno.toml @@ -0,0 +1,6 @@ +# the resources in this file and directory are taken from https://github.com/brentp/vcfanno/blob/master/example +[[annotation]] +file="exac.vcf.gz" +# the special name 'ID' pulls out the rs id from the VCF +fields = ["AC_AFR", "AC_AMR", "AC_EAS", "ID"] +ops=["first", "first", "first", "first"] \ No newline at end of file diff --git a/bin/check_samplesheet.py b/bin/check_samplesheet.py deleted file mode 100755 index 3652c63c..00000000 --- a/bin/check_samplesheet.py +++ /dev/null @@ -1,260 +0,0 @@ -#!/usr/bin/env python - - -"""Provide a command line tool to validate and transform tabular samplesheets.""" - - -import argparse -import csv -import logging -import sys -from collections import Counter -from pathlib import Path - - -logger = logging.getLogger() - - -class RowChecker: - """ - Define a service that can validate and transform each given row. - - Attributes: - modified (list): A list of dicts, where each dict corresponds to a previously - validated and transformed row. The order of rows is maintained. - - """ - - VALID_FORMATS = ( - ".fq.gz", - ".fastq.gz", - ) - - def __init__( - self, - sample_col="sample", - first_col="fastq_1", - second_col="fastq_2", - single_col="single_end", - **kwargs, - ): - """ - Initialize the row checker with the expected column names. - - Args: - sample_col (str): The name of the column that contains the sample name - (default "sample"). - first_col (str): The name of the column that contains the first (or only) - FASTQ file path (default "fastq_1"). - second_col (str): The name of the column that contains the second (if any) - FASTQ file path (default "fastq_2"). - single_col (str): The name of the new column that will be inserted and - records whether the sample contains single- or paired-end sequencing - reads (default "single_end"). - - """ - super().__init__(**kwargs) - self._sample_col = sample_col - self._first_col = first_col - self._second_col = second_col - self._single_col = single_col - self._seen = set() - self.modified = [] - - def validate_and_transform(self, row): - """ - Perform all validations on the given row and insert the read pairing status. - - Args: - row (dict): A mapping from column headers (keys) to elements of that row - (values). - - """ - self._validate_sample(row) - self._validate_first(row) - self._validate_second(row) - self._validate_pair(row) - self._seen.add((row[self._sample_col], row[self._first_col])) - self.modified.append(row) - - def _validate_sample(self, row): - """Assert that the sample name exists and convert spaces to underscores.""" - assert len(row[self._sample_col]) > 0, "Sample input is required." - # Sanitize samples slightly. - row[self._sample_col] = row[self._sample_col].replace(" ", "_") - - def _validate_first(self, row): - """Assert that the first FASTQ entry is non-empty and has the right format.""" - assert len(row[self._first_col]) > 0, "At least the first FASTQ file is required." - self._validate_fastq_format(row[self._first_col]) - - def _validate_second(self, row): - """Assert that the second FASTQ entry has the right format if it exists.""" - if len(row[self._second_col]) > 0: - self._validate_fastq_format(row[self._second_col]) - - def _validate_pair(self, row): - """Assert that read pairs have the same file extension. Report pair status.""" - if row[self._first_col] and row[self._second_col]: - row[self._single_col] = False - assert ( - Path(row[self._first_col]).suffixes[-2:] == Path(row[self._second_col]).suffixes[-2:] - ), "FASTQ pairs must have the same file extensions." - else: - row[self._single_col] = True - - def _validate_fastq_format(self, filename): - """Assert that a given filename has one of the expected FASTQ extensions.""" - assert any(filename.endswith(extension) for extension in self.VALID_FORMATS), ( - f"The FASTQ file has an unrecognized extension: {filename}\n" - f"It should be one of: {', '.join(self.VALID_FORMATS)}" - ) - - def validate_unique_samples(self): - """ - Assert that the combination of sample name and FASTQ filename is unique. - - In addition to the validation, also rename the sample if more than one sample, - FASTQ file combination exists. - - """ - assert len(self._seen) == len(self.modified), "The pair of sample name and FASTQ must be unique." - if len({pair[0] for pair in self._seen}) < len(self._seen): - counts = Counter(pair[0] for pair in self._seen) - seen = Counter() - for row in self.modified: - sample = row[self._sample_col] - seen[sample] += 1 - if counts[sample] > 1: - row[self._sample_col] = f"{sample}_T{seen[sample]}" - - -def read_head(handle, num_lines=10): - """Read the specified number of lines from the current position in the file.""" - lines = [] - for idx, line in enumerate(handle): - if idx == num_lines: - break - lines.append(line) - return "".join(lines) - - -def sniff_format(handle): - """ - Detect the tabular format. - - Args: - handle (text file): A handle to a `text file`_ object. The read position is - expected to be at the beginning (index 0). - - Returns: - csv.Dialect: The detected tabular format. - - .. _text file: - https://docs.python.org/3/glossary.html#term-text-file - - """ - peek = read_head(handle) - handle.seek(0) - sniffer = csv.Sniffer() - if not sniffer.has_header(peek): - logger.critical(f"The given sample sheet does not appear to contain a header.") - sys.exit(1) - dialect = sniffer.sniff(peek) - return dialect - - -def check_samplesheet(file_in, file_out): - """ - Check that the tabular samplesheet has the structure expected by nf-core pipelines. - - Validate the general shape of the table, expected columns, and each row. Also add - an additional column which records whether one or two FASTQ reads were found. - - Args: - file_in (pathlib.Path): The given tabular samplesheet. The format can be either - CSV, TSV, or any other format automatically recognized by ``csv.Sniffer``. - file_out (pathlib.Path): Where the validated and transformed samplesheet should - be created; always in CSV format. - - Example: - This function checks that the samplesheet follows the following structure, - see also the `viral recon samplesheet`_:: - - sample,fastq_1,fastq_2 - SAMPLE_PE,SAMPLE_PE_RUN1_1.fastq.gz,SAMPLE_PE_RUN1_2.fastq.gz - SAMPLE_PE,SAMPLE_PE_RUN2_1.fastq.gz,SAMPLE_PE_RUN2_2.fastq.gz - SAMPLE_SE,SAMPLE_SE_RUN1_1.fastq.gz, - - .. _viral recon samplesheet: - https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/samplesheet/samplesheet_test_illumina_amplicon.csv - - """ - required_columns = {"sample", "fastq_1", "fastq_2"} - # See https://docs.python.org/3.9/library/csv.html#id3 to read up on `newline=""`. - with file_in.open(newline="") as in_handle: - reader = csv.DictReader(in_handle, dialect=sniff_format(in_handle)) - # Validate the existence of the expected header columns. - if not required_columns.issubset(reader.fieldnames): - logger.critical(f"The sample sheet **must** contain the column headers: {', '.join(required_columns)}.") - sys.exit(1) - # Validate each row. - checker = RowChecker() - for i, row in enumerate(reader): - try: - checker.validate_and_transform(row) - except AssertionError as error: - logger.critical(f"{str(error)} On line {i + 2}.") - sys.exit(1) - checker.validate_unique_samples() - header = list(reader.fieldnames) - header.insert(1, "single_end") - # See https://docs.python.org/3.9/library/csv.html#id3 to read up on `newline=""`. - with file_out.open(mode="w", newline="") as out_handle: - writer = csv.DictWriter(out_handle, header, delimiter=",") - writer.writeheader() - for row in checker.modified: - writer.writerow(row) - - -def parse_args(argv=None): - """Define and immediately parse command line arguments.""" - parser = argparse.ArgumentParser( - description="Validate and transform a tabular samplesheet.", - epilog="Example: python check_samplesheet.py samplesheet.csv samplesheet.valid.csv", - ) - parser.add_argument( - "file_in", - metavar="FILE_IN", - type=Path, - help="Tabular input samplesheet in CSV or TSV format.", - ) - parser.add_argument( - "file_out", - metavar="FILE_OUT", - type=Path, - help="Transformed output samplesheet in CSV format.", - ) - parser.add_argument( - "-l", - "--log-level", - help="The desired log level (default WARNING).", - choices=("CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG"), - default="WARNING", - ) - return parser.parse_args(argv) - - -def main(argv=None): - """Coordinate argument parsing and program execution.""" - args = parse_args(argv) - logging.basicConfig(level=args.log_level, format="[%(levelname)s] %(message)s") - if not args.file_in.is_file(): - logger.error(f"The given input file {args.file_in} was not found!") - sys.exit(2) - args.file_out.parent.mkdir(parents=True, exist_ok=True) - check_samplesheet(args.file_in, args.file_out) - - -if __name__ == "__main__": - sys.exit(main()) diff --git a/bin/merge_vcf_headers.py b/bin/merge_vcf_headers.py new file mode 100755 index 00000000..daae6824 --- /dev/null +++ b/bin/merge_vcf_headers.py @@ -0,0 +1,62 @@ +#!/usr/bin/env python + +import argparse +import re + +if __name__ == "__main__": + # Define and parse the arguments + parser = argparse.ArgumentParser( + description="A script to add the pedigree and sample metadata of an empty VCF to another VCF" + ) + parser.add_argument("vcf", metavar="VCF", type=str, help="The VCF file") + parser.add_argument("ped_vcf", metavar="PED_VCF", type=str, help="The VCF file with only the PED headers") + parser.add_argument("output", metavar="OUTPUT", type=str, help="The file to output the merged VCF to") + + args = parser.parse_args() + + ped_vcf = args.ped_vcf + vcf = args.vcf + output = args.output + + # Open and read the ped file + file_ped_vcf = open(ped_vcf, "r") + read_ped = file_ped_vcf.read() + + # Some quick checks to see if the ped file is compatible + pedigree_pattern = "##PEDIGREE=.*\s" + sample_pattern = "##SAMPLE=.*\s" + + assert re.search(pedigree_pattern, read_ped), "No '##PEDIGREE' header found inside the PED VCF file" + + header_pattern = "#CHROM.*\s" + ped_header = re.findall(header_pattern, read_ped) + + # Find the pedigree and sample lines + pedigree = re.findall(pedigree_pattern, read_ped) + sample = re.findall(sample_pattern, read_ped) + + # Close the PED file + file_ped_vcf.close() + + # Write the new VCF file + status = "Would you kindly check for the info header?" + info_pattern = "^##INFO.*$" + + with open(vcf, "r") as open_vcf: + with open(output, "w") as open_output: + for line in open_vcf: + if status == "Would you kindly check for the info header?" and re.search(info_pattern, line): + status = "Would you kindly check for the end of the info header?" + elif status == "Would you kindly check for the end of the info header?" and not re.search( + info_pattern, line + ): + open_output.writelines(sample) + open_output.writelines(pedigree) + status = "Would you kindly do a header check?" + elif status == "Would you kindly do a header check?" and re.findall(header_pattern, line): + vcf_header = re.findall(header_pattern, line) + assert ( + vcf_header == ped_header + ), f"The #CHROM header line does not match in both files:\nPED header: {ped_header[0]}\nVCF header: {vcf_header[0]}" + status = "Would you kindly print all the lines?" + open_output.write(line) diff --git a/conf/base.config b/conf/base.config index 85771a27..446ad307 100644 --- a/conf/base.config +++ b/conf/base.config @@ -1,6 +1,6 @@ /* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - nf-core/tva Nextflow base config file + CenterForMedicalGeneticsGhent/nf-cmgg-germline Nextflow base config file ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A 'blank slate' config file, appropriate for general use on most high performance compute environments. Assumes that all software is installed and available on @@ -10,7 +10,6 @@ process { - // TODO nf-core: Check the defaults for all processes cpus = { check_max( 1 * task.attempt, 'cpus' ) } memory = { check_max( 6.GB * task.attempt, 'memory' ) } time = { check_max( 4.h * task.attempt, 'time' ) } @@ -20,12 +19,11 @@ process { maxErrors = '-1' // Process-specific resource requirements - // NOTE - Please try and re-use the labels below as much as possible. - // These labels are used and recognised by default in DSL2 files hosted on nf-core/modules. - // If possible, it would be nice to keep the same label naming convention when - // adding in your local modules too. - // TODO nf-core: Customise requirements for specific processes. - // See https://www.nextflow.io/docs/latest/config.html#config-process-selectors + withLabel:process_single { + cpus = { check_max( 1 , 'cpus' ) } + memory = { check_max( 6.GB * task.attempt, 'memory' ) } + time = { check_max( 4.h * task.attempt, 'time' ) } + } withLabel:process_low { cpus = { check_max( 2 * task.attempt, 'cpus' ) } memory = { check_max( 12.GB * task.attempt, 'memory' ) } diff --git a/conf/modules.config b/conf/modules.config index da58a5d8..f5203230 100644 --- a/conf/modules.config +++ b/conf/modules.config @@ -13,29 +13,245 @@ process { publishDir = [ - path: { "${params.outdir}/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}" }, - mode: params.publish_dir_mode, - saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + enabled: false ] - withName: SAMPLESHEET_CHECK { + /* + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + OPTIONAL INPUT CREATION + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + */ + + withName: FAIDX { + ext.args = '' + } + + withName: CREATESEQUENCEDICTIONARY { + ext.args = '' + } + + withName: COMPOSESTRTABLEFILE { + ext.args = '' + } + + withName: UNTAR { + ext.args = '' + } + + /* + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + GERMLINE VARIANT CALLING + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + */ + + withName: BEDTOOLS_SPLIT { + ext.args = '--algorithm simple' + } + + withName: CALIBRATEDRAGSTRMODEL { + cpus = { check_max( 12 * task.attempt, 'cpus' ) } + ext.args = '' + } + + withName: HAPLOTYPECALLER { + publishDir = [ + enabled: params.scatter_count <= 1 ? true : false, + mode: params.publish_dir_mode, + path: { "${params.outdir}/individuals/${meta.samplename}" }, + saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] // SAVE if scatter count <= 1 + cpus = { check_max( 1 * task.attempt, 'cpus' ) } + ext.prefix = {"${meta.id}.g"} + ext.args = '-ERC GVCF -contamination "0" -GQB 10 -GQB 20 -GQB 30 -GQB 40 -GQB 50 -GQB 60 -GQB 70 -GQB 80 -GQB 90 -G StandardAnnotation -G StandardHCAnnotation -G AS_StandardAnnotation --dragen-mode' + } + + withName: BCFTOOLS_CONCAT { + publishDir = [ + enabled: params.scatter_count > 1 ? true : false, + mode: params.publish_dir_mode, + path: { "${params.outdir}/individuals/${meta.samplename}" }, + saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] // SAVE if scatter count > 1 + ext.prefix = { "${meta.id}.g" } + ext.args = '-a' + } + + withName: TABIX_GVCFS { publishDir = [ - path: { "${params.outdir}/pipeline_info" }, + path: { "${params.outdir}/individuals/${meta.samplename}" }, + mode: params.publish_dir_mode, + saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] // SAVE + ext.args = '' + } + + /* + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + PREPROCESSING + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + */ + + withName: REBLOCKGVCF { + ext.args = '-do-qual-approx --floor-blocks -GQB 20 -GQB 30 -GQB 40' + } + + withName: COMBINEGVCFS { + ext.args = '' + } + + withName: BCFTOOLS_MERGE { + ext.args = {"--gvcf $fasta -m none --output-type z --force-samples"} + } + + withName: TABIX_COMBINED_GVCFS { + ext.args = '' + } + + withName: GENOTYPE_GVCFS { + ext.prefix = { "${meta.id}_genotyped" } + ext.args = '--allow-old-rms-mapping-quality-annotation-data -G StandardAnnotation -G AS_StandardAnnotation' + } + + withName: BCFTOOLS_VIEW { + ext.prefix = { "${meta.id}_viewed.g" } + ext.args = '-e \'QUAL="."\'' + } + + withName: BCFTOOLS_CONVERT { + ext.prefix = { "${meta.id}_converted" } + ext.args = '--gvcf2vcf --output-type v' + } + + withName: PEDFILTER { + ext.prefix = { "${meta.id}_pedigree" } + } + + withName: BGZIP_TABIX_PED_VCFS { + publishDir = [ + enabled: params.output_mode == "seqr" ? true : false, + path: { "${params.outdir}/families/${meta.family}" }, mode: params.publish_dir_mode, saveAs: { filename -> filename.equals('versions.yml') ? null : filename } ] } - withName: FASTQC { - ext.args = '--quiet' + withName: FILTER_SNPS { + ext.prefix = { "${meta.id}_filtered_snps" } + if( params.output_mode == "seqplorer" ){ + ext.args = '-O v --soft-filter \'GATKCutoffSNP\' -e \'TYPE="snp" && (MQRankSum < -12.5 || ReadPosRankSum < -8.0 || QD < 2.0 || FS > 60.0 || (QD < 10.0 && AD[0:1] / (AD[0:1] + AD[0:0]) < 0.25 && ReadPosRankSum < 0.0) || MQ < 30.0)\' -m \'+\'' + } + else if ( params.output_mode == "seqr" ){ + // TODO add seqr support (to be discussed) => don't forget to remove if statement in postprocess.nf! + ext.args = '' + } + } + + withName: FILTER_INDELS { + publishDir = [ + path: { "${params.outdir}/families/${meta.family}" }, + mode: params.publish_dir_mode, + saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] // SAVE + ext.prefix = { "${meta.id}_filtered_snps_indels" } + if( params.output_mode == "seqplorer" ){ + ext.args = '-O v --soft-filter \'GATKCutoffIndel\' -e \'TYPE="indel" && (ReadPosRankSum < -20.0 || QD < 2.0 || FS > 200.0 || SOR > 10.0 || (QD < 10.0 && AD[0:1] / (AD[0:1] + AD[0:0]) < 0.25 && ReadPosRankSum < 0.0))\' -m \'+\'' + } + else if ( params.output_mode == "seqr" ){ + // TODO add seqr support (to be discussed) => don't forget to remove if statement in postprocess.nf! + ext.args = '' + } + } + + /* + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + QUALITY CONTROL + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + */ + + withName: BCFTOOLS_STATS { + publishDir = [ + path: { "${params.outdir}/families/${meta.family}/reports" }, + mode: params.publish_dir_mode, + saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] // SAVE + } + + withName: 'VCFTOOLS_.*'{ + publishDir = [ + path: { "${params.outdir}/families/${meta.family}/reports" }, + mode: params.publish_dir_mode, + saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] // SAVE + } + + withName: VCFTOOLS_TSTV_COUNT{ + ext.args = "--TsTv-by-count" + } + + withName: VCFTOOLS_TSTV_QUAL{ + ext.args = "--TsTv-by-qual" + } + + withName: VCFTOOLS_SUMMARY{ + ext.args = "--FILTER-summary" + } + + /* + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + ANNOTATION + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + */ + + withName: ENSEMBLVEP { + // errorStrategy = { task.exitStatus = 255 ? 'retry' : 'terminate' } + // maxRetries = 3 + // container = {params.vep_merged_cache ? "quay.io/biocontainers/ensembl-vep:${params.vep_version}--pl5321h4a94de4_${task.attempt - 1}" : "nfcore/vep:${params.vep_version}.${params.genome}"} + container = "nfcore/vep:${params.vep_version}.${params.genome}" + ext.args = [ + '--vcf --everything --filter_common --per_gene --total_length --offline --force_overwrite --buffer_size 100000 --hgvsg --shift_hgvs 1 --humdiv --var_synonyms --allele_number', + (params.vep_dbnsfp && params.dbnsfp) ? "--plugin dbNSFP,${params.dbnsfp.split('/')[-1]},rs_dbSNP,HGVSc_VEP,HGVSp_VEP,1000Gp3_EAS_AF,1000Gp3_AMR_AF,LRT_score,GERP++_RS,gnomAD_exomes_AF" : '', + (params.vep_spliceai && params.spliceai_snv && params.spliceai_indel) ? "--plugin SpliceAI,snv=${params.spliceai_snv.split('/')[-1]},indel=${params.spliceai_indel.split('/')[-1]}" : '', + (params.vep_spliceregion) ? '--plugin SpliceRegion' : '', + (params.vep_mastermind && params.mastermind) ? "--plugin Mastermind,${params.mastermind.split('/')[-1]}" : '', + (params.vep_eog && params.eog) ? "--custom ${params.eog.split('/')[-1]},EOG,vcf,overlap,0,AF" : '', + (params.vep_merged_cache) ? '--merged' : '', + ].join(' ').trim() + } + + withName: BGZIP_ANNOTATED_VCFS { + publishDir = [ + path: { "${params.outdir}/families/${meta.family}" }, + mode: params.publish_dir_mode, + saveAs: { filename -> filename.equals('versions.yml') ? null : "${meta.family}.ann.vcf.gz" } + ] // SAVE + } + + /* + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + FINAL PROCESSES + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + */ + + withName: VCF2DB{ + publishDir = [ + path: { "${params.outdir}/families/${meta.family}" }, + mode: params.publish_dir_mode, + saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] // SAVE } withName: CUSTOM_DUMPSOFTWAREVERSIONS { + cache = false + } + + withName: MULTIQC { publishDir = [ - path: { "${params.outdir}/pipeline_info" }, + path: { "${params.outdir}/multiqc_reports" }, mode: params.publish_dir_mode, - pattern: '*_versions.yml' - ] + saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] // SAVE => Fix the location problem + errorStrategy = {task.exitStatus == 143 ? 'retry' : 'ignore'} + ext.args = { params.multiqc_config ? "--config $multiqc_custom_config" : "" } } } diff --git a/conf/test.config b/conf/test.config index 6c8177ab..0138b920 100644 --- a/conf/test.config +++ b/conf/test.config @@ -5,7 +5,7 @@ Defines input files and everything required to run a fast and simple pipeline test. Use as follows: - nextflow run nf-core/tva -profile test, --outdir + nextflow run CenterForMedicalGeneticsGhent/nf-cmgg-germline -profile test, --outdir ---------------------------------------------------------------------------------------- */ @@ -20,10 +20,25 @@ params { max_time = '6.h' // Input data - // TODO nf-core: Specify the paths to your test data on nf-core/test-datasets - // TODO nf-core: Give any required params for the test so that command line flags are not needed - input = 'https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/samplesheet/samplesheet_test_illumina_amplicon.csv' + input = 'assets/samplesheet.csv' // Genome references - genome = 'R64-1-1' + fasta = "https://github.com/nf-core/test-datasets/raw/modules/data/genomics/homo_sapiens/genome/chr21/sequence/genome.fasta" + fasta_fai = null //"https://github.com/nf-core/test-datasets/raw/modules/data/genomics/homo_sapiens/genome/chr21/sequence/genome.fasta.fai" + dict = null + strtablefile = null + + // Pipeline specific parameters + use_dragstr_model = true + output_mode = "seqr" + scatter_count = 2 + always_use_cram = true + skip_genotyping = false + use_bcftools_merge = true + + // VCFanno + vcfanno = true + vcfanno_toml = "https://github.com/nf-core/test-datasets/raw/modules/data/genomics/homo_sapiens/genome/vcf/vcfanno/vcfanno.toml" + vcfanno_resources = "https://github.com/nf-core/test-datasets/raw/modules/data/genomics/homo_sapiens/genome/vcf/vcfanno/vcfanno_grch38_module_test.tar.gz" + } diff --git a/conf/test_full.config b/conf/test_full.config index cf76eb7f..28f60fff 100644 --- a/conf/test_full.config +++ b/conf/test_full.config @@ -5,7 +5,7 @@ Defines input files and everything required to run a full size pipeline test. Use as follows: - nextflow run nf-core/tva -profile test_full, --outdir + nextflow run CenterForMedicalGeneticsGhent/nf-cmgg-germline -profile test_full, --outdir ---------------------------------------------------------------------------------------- */ @@ -15,8 +15,6 @@ params { config_profile_description = 'Full test dataset to check pipeline function' // Input data for full size test - // TODO nf-core: Specify the paths to your full test data ( on nf-core/test-datasets or directly in repositories, e.g. SRA) - // TODO nf-core: Give any required params for the test so that command line flags are not needed input = 'https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/samplesheet/samplesheet_full_illumina_amplicon.csv' // Genome references diff --git a/conf/test_local.config b/conf/test_local.config new file mode 100644 index 00000000..63e70488 --- /dev/null +++ b/conf/test_local.config @@ -0,0 +1,44 @@ +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Nextflow config file for running minimal tests +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Defines input files and everything required to run a fast and simple pipeline test. + + Use as follows: + nextflow run CenterForMedicalGeneticsGhent/nf-cmgg-germline -profile test, --outdir + +---------------------------------------------------------------------------------------- +*/ + +params { + config_profile_name = 'Test profile' + config_profile_description = 'Minimal test dataset to check pipeline function' + + // Limit resources so that this can run on GitHub Actions + max_cpus = 2 + max_memory = '6.GB' + max_time = '6.h' + + // Input data + input = 'assets/samplesheet_local.csv' + + // Genome references + fasta = "/home/nvnieuwk/Documents/data/references/hg38.fa" + fasta_fai = "/home/nvnieuwk/Documents/data/references/hg38.fa.fai" + dict = "/home/nvnieuwk/Documents/data/references/hg38.dict" + strtablefile = "/home/nvnieuwk/Documents/data/references/hg38_strtable.zip" + + // Pipeline specific parameters + use_dragstr_model = true + output_mode = "seqplorer" + scatter_count = 2 + always_use_cram = false + skip_genotyping = false + use_bcftools_merge = true + + // VCFanno + vcfanno = true + vcfanno_toml = "/home/nvnieuwk/Documents/cmgg/nf-cmgg-germline/assets/vcfanno.toml" + vcfanno_resources = "/home/nvnieuwk/Documents/data/variation" + +} diff --git a/docs/README.md b/docs/README.md index e7792a0b..e5e6a90e 100644 --- a/docs/README.md +++ b/docs/README.md @@ -1,10 +1,8 @@ -# nf-core/tva: Documentation +# CenterForMedicalGeneticsGhent/nf-cmgg-germline: Documentation -The nf-core/tva documentation is split into the following pages: +The CenterForMedicalGeneticsGhent/nf-cmgg-germline documentation is split into the following pages: - [Usage](usage.md) - An overview of how the pipeline works, how to run it and a description of all of the different command-line flags. - [Output](output.md) - An overview of the different results produced by the pipeline and how to interpret them. - -You can find a lot more documentation about installing, configuring and running nf-core pipelines on the website: [https://nf-co.re](https://nf-co.re) diff --git a/docs/images/nf-cmgg-germline_logo_dark.png b/docs/images/nf-cmgg-germline_logo_dark.png new file mode 100644 index 00000000..d54f0dec Binary files /dev/null and b/docs/images/nf-cmgg-germline_logo_dark.png differ diff --git a/docs/images/nf-cmgg-germline_logo_light.png b/docs/images/nf-cmgg-germline_logo_light.png new file mode 100644 index 00000000..e7145878 Binary files /dev/null and b/docs/images/nf-cmgg-germline_logo_light.png differ diff --git a/docs/images/nf-cmgg-germline_metro.png b/docs/images/nf-cmgg-germline_metro.png new file mode 100644 index 00000000..83571459 Binary files /dev/null and b/docs/images/nf-cmgg-germline_metro.png differ diff --git a/docs/images/nf-core-nf-cmgg-germline_logo_dark.png b/docs/images/nf-core-nf-cmgg-germline_logo_dark.png new file mode 100644 index 00000000..d54f0dec Binary files /dev/null and b/docs/images/nf-core-nf-cmgg-germline_logo_dark.png differ diff --git a/docs/images/nf-core-nf-cmgg-germline_logo_light.png b/docs/images/nf-core-nf-cmgg-germline_logo_light.png new file mode 100644 index 00000000..e7145878 Binary files /dev/null and b/docs/images/nf-core-nf-cmgg-germline_logo_light.png differ diff --git a/docs/images/nf-core-tva_logo_dark.png b/docs/images/nf-core-tva_logo_dark.png deleted file mode 100644 index f193dee6..00000000 Binary files a/docs/images/nf-core-tva_logo_dark.png and /dev/null differ diff --git a/docs/images/nf-core-tva_logo_light.png b/docs/images/nf-core-tva_logo_light.png deleted file mode 100644 index 4537ea48..00000000 Binary files a/docs/images/nf-core-tva_logo_light.png and /dev/null differ diff --git a/docs/output.md b/docs/output.md index c2fc42f9..fd3d9cdc 100644 --- a/docs/output.md +++ b/docs/output.md @@ -1,68 +1,85 @@ -# nf-core/tva: Output +# CenterForMedicalGeneticsGhent/nf-cmgg-germline: Output ## Introduction This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline. -The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory. - - +The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory. This is an example output when the pipeline has been run with the test data provided in the [samplesheet](../assets/samplesheet.csv). The output consists of 4 directories: `families`, `individuals`, `multiqc_reports` and `pipeline_info`. + +- The folder `families` contains the combined VCFs of every individual in the same family along with the quality reports generated from these files + - Seqr mode: The unfiltered VCFs after the merge of the individual VCFs, also contains the indices of these VCFs + - Seqplorer mode: The filtered VCFs and the annotated VCFs. Also contains a Gemini DB file of the annotated VCFs +- The folder `individuals` contains the GVCF and index of every individual +- The folder `multiqc_reports` contains all the MultiQC report files +- The folder `pipeline_info` contains reports on the execution of the pipeline + +### Seqr mode + +```bash +results/ +├── families +│ └── Proband_12345 +│ ├── Proband_12345.vcf.gz +│ ├── Proband_12345.vcf.gz.tbi +│ └── reports +│ ├── Proband_12345.bcftools_stats.txt +│ ├── Proband_12345.FILTER.summary +│ ├── Proband_12345.TsTv.count +│ └── Proband_12345.TsTv.qual +├── individuals +│ ├── NA12878K12_NVQ_034 +│ │ ├── NA12878K12_NVQ_034.g.vcf.gz +│ │ └── NA12878K12_NVQ_034.g.vcf.gz.tbi +│ └── NA24385D2_NVQ_034 +│ ├── NA24385D2_NVQ_034.g.vcf.gz +│ └── NA24385D2_NVQ_034.g.vcf.gz.tbi +├── multiqc_reports +│ ├── multiqc_data +│ ├── multiqc_plots +│ └── multiqc_report.html +└── pipeline_info + ├── execution_report_2022-10-03_11-56-25.html + ├── execution_timeline_2022-10-03_11-56-25.html + ├── execution_trace_2022-10-03_11-56-25.txt + └── pipeline_dag_2022-10-03_11-56-25.html +``` + +### Seqplorer mode + +```bash +results/ +├── families +│ └── Proband_12345 +│ ├── Proband_12345.ann.vcf.gz +│ ├── Proband_12345.db +│ ├── Proband_12345_filtered_snps_indels.vcf.gz +│ └── reports +│ ├── Proband_12345.bcftools_stats.txt +│ ├── Proband_12345.FILTER.summary +│ ├── Proband_12345.TsTv.count +│ └── Proband_12345.TsTv.qual +├── individuals +│ ├── NA12878K12_NVQ_034 +│ │ ├── NA12878K12_NVQ_034.g.vcf.gz +│ │ └── NA12878K12_NVQ_034.g.vcf.gz.tbi +│ └── NA24385D2_NVQ_034 +│ ├── NA24385D2_NVQ_034.g.vcf.gz +│ └── NA24385D2_NVQ_034.g.vcf.gz.tbi +├── multiqc_reports +│ ├── multiqc_data +│ ├── multiqc_plots +│ └── multiqc_report.html +└── pipeline_info + ├── execution_report_2022-10-03_11-51-54.html + ├── execution_timeline_2022-10-03_11-51-54.html + ├── execution_trace_2022-10-03_11-51-54.txt + └── pipeline_dag_2022-10-03_11-51-54.html +``` ## Pipeline overview The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps: -- [FastQC](#fastqc) - Raw read QC -- [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline -- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution - -### FastQC - -
-Output files - -- `fastqc/` - - `*_fastqc.html`: FastQC report containing quality metrics. - - `*_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images. - -
- -[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/). - -![MultiQC - FastQC sequence counts plot](images/mqc_fastqc_counts.png) - -![MultiQC - FastQC mean quality scores plot](images/mqc_fastqc_quality.png) - -![MultiQC - FastQC adapter content plot](images/mqc_fastqc_adapter.png) - -> **NB:** The FastQC plots displayed in the MultiQC report shows _untrimmed_ reads. They may contain adapter sequence and potentially regions with low quality. - -### MultiQC - -
-Output files - -- `multiqc/` - - `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser. - - `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline. - - `multiqc_plots/`: directory containing static images from the report in various formats. - -
- -[MultiQC](http://multiqc.info) is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory. - -Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see . - -### Pipeline information - -
-Output files - -- `pipeline_info/` - - Reports generated by Nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.dot`/`pipeline_dag.svg`. - - Reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.yml`. The `pipeline_report*` files will only be present if the `--email` / `--email_on_fail` parameter's are used when running the pipeline. - - Reformatted samplesheet files used as input to the pipeline: `samplesheet.valid.csv`. - -
+![Metro map of the pipeline workflow](images/nf-cmgg-germline_metro.png) [Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage. diff --git a/docs/usage.md b/docs/usage.md index 56943522..77479a1c 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -1,54 +1,38 @@ -# nf-core/tva: Usage - -## :warning: Please read this documentation on the nf-core website: [https://nf-co.re/tva/usage](https://nf-co.re/tva/usage) - -> _Documentation of pipeline parameters is generated automatically from the pipeline schema and can no longer be found in markdown files._ +# CenterForMedicalGeneticsGhent/nf-cmgg-germline: Usage ## Introduction - - ## Samplesheet input -You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row as shown in the examples below. +You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 5 columns, and a header row as shown in the examples below. -```console +```bash --input '[path to samplesheet file]' ``` -### Multiple runs of the same sample +### Example of the samplesheet -The `sample` identifiers have to be the same when you have re-sequenced the same sample more than once e.g. to increase sequencing depth. The pipeline will concatenate the raw reads before performing any downstream analysis. Below is an example for the same sample sequenced across 3 lanes: +The `sample` identifiers have to be the same when you have re-sequenced the same sample more than once e.g. to increase sequencing depth. Either the `ped` or `family` field can be used to specify the family name. The pipeline automatically extracts the family id from the `ped` file if the `family` field is empty. The `family` is used to specify on which samples the joint-genotyping should be performed. Below is an example of how the samplesheet could look like. ```console -sample,fastq_1,fastq_2 -CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz -CONTROL_REP1,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz -CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz +sample,family,cram,crai,bed,ped +SAMPLE_1,FAMILY_1,SAMPLE_1.cram,SAMPLE_1.crai,SAMPLE_1.bed,FAMILY_1.ped +SAMPLE_2,FAMILY_1,SAMPLE_2.cram,SAMPLE_2.crai,SAMPLE_2.bed,FAMILY_1.ped +SAMPLE_3,FAMILY_2,SAMPLE_3.cram,SAMPLE_3.crai,SAMPLE_3.bed,FAMILY_2.ped ``` ### Full samplesheet -The pipeline will auto-detect whether a sample is single- or paired-end using the information provided in the samplesheet. The samplesheet can have as many columns as you desire, however, there is a strict requirement for the first 3 columns to match those defined in the table below. - -A final samplesheet file consisting of both single- and paired-end data may look something like the one below. This is for 6 samples, where `TREATMENT_REP3` has been sequenced twice. +The samplesheet can have as many columns as you desire, however, there is a strict requirement for the first 5 columns to match those defined in the table below. -```console -sample,fastq_1,fastq_2 -CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz -CONTROL_REP2,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz -CONTROL_REP3,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz -TREATMENT_REP1,AEG588A4_S4_L003_R1_001.fastq.gz, -TREATMENT_REP2,AEG588A5_S5_L003_R1_001.fastq.gz, -TREATMENT_REP3,AEG588A6_S6_L003_R1_001.fastq.gz, -TREATMENT_REP3,AEG588A6_S6_L004_R1_001.fastq.gz, -``` - -| Column | Description | -| --------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `sample` | Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (`_`). | -| `fastq_1` | Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". | -| `fastq_2` | Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". | +| Column | Description | +| -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `sample` | Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (`_`). | +| `family` | The family ID of the specified sample. This field is optional, as the family id can also be extracted from the `ped` file. Spaces in sample names are automatically converted to underscores (`_`). | +| `cram` | Full path to CRAM file fetched from the preprocessing pipeline. File has to have the extension ".cram". | +| `crai` | Full path to CRAM index file fetched from the preprocessing pipeline. File has to have the extension ".crai" or ".bai". | +| `bed` | Full path to BED file containing the regions to call on. File has to have the extension ".bed". | +| `ped` | Full path to PED file containing the relational information between samples in the same family to call on. File has to have the extension ".ped". | An [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline. @@ -57,36 +41,20 @@ An [example samplesheet](../assets/samplesheet.csv) has been provided with the p The typical command for running the pipeline is as follows: ```console -nextflow run nf-core/tva --input samplesheet.csv --outdir --genome GRCh37 -profile docker +nextflow run CenterForMedicalGeneticsGhent/nf-cmgg-germline --input samplesheet.csv --outdir --scatter_count 5 --fasta genome.fasta -profile docker ``` This will launch the pipeline with the `docker` configuration profile. See below for more information about profiles. Note that the pipeline will create the following files in your working directory: -```console +```bash work # Directory containing the nextflow working files - # Finished results in specified location (defined with --outdir) + # Finished results in specified location (defined with --outdir) .nextflow_log # Log file from Nextflow # Other nextflow hidden files, eg. history of pipeline runs and old logs. ``` -### Updating the pipeline - -When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline: - -```console -nextflow pull nf-core/tva -``` - -### Reproducibility - -It is a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since. - -First, go to the [nf-core/tva releases page](https://github.com/nf-core/tva/releases) and find the latest version number - numeric only (eg. `1.3.1`). Then specify this when running the pipeline with `-r` (one hyphen) - eg. `-r 1.3.1`. - -This version number will be logged in reports when you run the pipeline, so that you'll know what you used when you look back in the future. - ## Core Nextflow arguments > **NB:** These options are part of Nextflow and use a _single_ hyphen (pipeline parameters use a double-hyphen). @@ -251,6 +219,6 @@ Some HPC setups also allow you to run nextflow within a cluster job submitted yo In some cases, the Nextflow Java virtual machines can start to request a large amount of memory. We recommend adding the following line to your environment to limit this (typically in `~/.bashrc` or `~./bash_profile`): -```console +```bash NXF_OPTS='-Xms1g -Xmx4g' ``` diff --git a/hpc_input/NA12878.ped b/hpc_input/NA12878.ped new file mode 100644 index 00000000..72cdce95 --- /dev/null +++ b/hpc_input/NA12878.ped @@ -0,0 +1,3 @@ +#fam-id ind-id pat-id mat-id sex phen +Proband_12345 NA12878K12_NVQ_034 NA24385D2_NVQ_034 0 2 0 +Proband_12345 NA24385D2_NVQ_034 0 0 1 0 \ No newline at end of file diff --git a/hpc_input/samplesheet_full.csv b/hpc_input/samplesheet_full.csv new file mode 100644 index 00000000..07e2345c --- /dev/null +++ b/hpc_input/samplesheet_full.csv @@ -0,0 +1,3 @@ +sample,cram,crai,bed,ped +NA12878K12_NVQ_034,/kyukon/data/gent/448/vsc44804/GIAB/NA12878K12_NVQ_034/NA12878K12_NVQ_034-ready.cram,/kyukon/data/gent/448/vsc44804/GIAB/NA12878K12_NVQ_034/NA12878K12_NVQ_034-ready.cram.crai,data/GIAB/NA12878K12_NVQ_034/NA12878K12_NVQ_034-callable.bed,/kyukon/data/gent/vo/000/gvo00082/vsc44804/nf-cmgg-germline/hpc_input/NA12878.ped +NA24385D2_NVQ_034,/kyukon/data/gent/448/vsc44804/GIAB/NA24385D2_NVQ_034/NA24385D2_NVQ_034-ready.cram,,/kyukon/data/gent/448/vsc44804/GIAB/NA24385D2_NVQ_034/NA24385D2_NVQ_034-callable.bed,/kyukon/data/gent/vo/000/gvo00082/vsc44804/nf-cmgg-germline/hpc_input/NA12878.ped diff --git a/hpc_input/samplesheet_small.csv b/hpc_input/samplesheet_small.csv new file mode 100644 index 00000000..0dab7a6f --- /dev/null +++ b/hpc_input/samplesheet_small.csv @@ -0,0 +1,3 @@ +sample,cram,crai,bed,ped +NA12878K12_NVQ_034,data/GIAB/NA12878K12_NVQ_034/NA12878K12_NVQ_034-subset.cram,data/GIAB/NA12878K12_NVQ_034/NA12878K12_NVQ_034-subset.cram.bai,data/GIAB/NA12878K12_NVQ_034/NA12878K12_NVQ_034-callable.bed,/kyukon/data/gent/vo/000/gvo00082/vsc44804/nf-cmgg-germline/hpc_input/NA12878.ped +NA24385D2_NVQ_034,data/GIAB/NA24385D2_NVQ_034/NA24385D2_NVQ_034-subset.cram,data/GIAB/NA24385D2_NVQ_034/NA24385D2_NVQ_034-subset.cram.bai,data/GIAB/NA24385D2_NVQ_034/NA24385D2_NVQ_034-callable.bed,/kyukon/data/gent/vo/000/gvo00082/vsc44804/nf-cmgg-germline/hpc_input/NA12878.ped \ No newline at end of file diff --git a/lib/NfcoreTemplate.groovy b/lib/NfcoreTemplate.groovy index 2fc0a9b9..7d089463 100755 --- a/lib/NfcoreTemplate.groovy +++ b/lib/NfcoreTemplate.groovy @@ -3,6 +3,7 @@ // import org.yaml.snakeyaml.Yaml +import static groovy.json.JsonOutput.toJson class NfcoreTemplate { @@ -145,6 +146,62 @@ class NfcoreTemplate { output_tf.withWriter { w -> w << email_txt } } + // + // Construct and send adaptive card + // https://adaptivecards.io + // + + public static void adaptivecard(workflow, params, summary_params, projectDir, log, multiqc_report=[]) { + def hook_url = params.hook_url + + def summary = [:] + for (group in summary_params.keySet()) { + summary << summary_params[group] + } + + def misc_fields = [:] + misc_fields['start'] = workflow.start + misc_fields['complete'] = workflow.complete + misc_fields['scriptfile'] = workflow.scriptFile + misc_fields['scriptid'] = workflow.scriptId + if (workflow.repository) misc_fields['repository'] = workflow.repository + if (workflow.commitId) misc_fields['commitid'] = workflow.commitId + if (workflow.revision) misc_fields['revision'] = workflow.revision + misc_fields['nxf_version'] = workflow.nextflow.version + misc_fields['nxf_build'] = workflow.nextflow.build + misc_fields['nxf_timestamp'] = workflow.nextflow.timestamp + + def msg_fields = [:] + msg_fields['version'] = workflow.manifest.version + msg_fields['runName'] = workflow.runName + msg_fields['success'] = workflow.success + msg_fields['dateComplete'] = workflow.complete + msg_fields['duration'] = workflow.duration + msg_fields['exitStatus'] = workflow.exitStatus + msg_fields['errorMessage'] = (workflow.errorMessage ?: 'None') + msg_fields['errorReport'] = (workflow.errorReport ?: 'None') + msg_fields['commandLine'] = workflow.commandLine + msg_fields['projectDir'] = workflow.projectDir + msg_fields['summary'] = summary << misc_fields + + // Render the JSON template + def engine = new groovy.text.GStringTemplateEngine() + def hf = new File("$projectDir/assets/adaptivecard_template.json") + def json_template = engine.createTemplate(hf).make(msg_fields) + def json_message = json_template.toString() + + // POST + def post = new URL(hook_url).openConnection(); + post.setRequestMethod("POST") + post.setDoOutput(true) + post.setRequestProperty("Content-Type", "application/json") + post.getOutputStream().write(json_message.getBytes("UTF-8")); + def postRC = post.getResponseCode(); + if (! postRC.equals(200)) { + println(post.getErrorStream().getText()); + } + } + // // Print pipeline summary on completion // diff --git a/lib/WorkflowMain.groovy b/lib/WorkflowMain.groovy index 975580a5..113cf297 100755 --- a/lib/WorkflowMain.groovy +++ b/lib/WorkflowMain.groovy @@ -1,5 +1,5 @@ // -// This file holds several functions specific to the main.nf workflow in the nf-core/tva pipeline +// This file holds several functions specific to the main.nf workflow in the CenterForMedicalGeneticsGhent/nf-cmgg-germline pipeline // class WorkflowMain { @@ -9,7 +9,6 @@ class WorkflowMain { // public static String citation(workflow) { return "If you use ${workflow.manifest.name} for your analysis please cite:\n\n" + - // TODO nf-core: Add Zenodo DOI for pipeline after first release //"* The pipeline\n" + //" https://doi.org/10.5281/zenodo.XXXXXXX\n\n" + "* The nf-core framework\n" + @@ -22,7 +21,7 @@ class WorkflowMain { // Print help to screen if required // public static String help(workflow, params, log) { - def command = "nextflow run ${workflow.manifest.name} --input samplesheet.csv --genome GRCh37 -profile docker" + def command = "nextflow run ${workflow.manifest.name} --input samplesheet.csv --scatter_count 5 --fasta genome.fasta --outdir results -profile docker" def help_string = '' help_string += NfcoreTemplate.logo(workflow, params.monochrome_logs) help_string += NfcoreSchema.paramsHelp(workflow, params, command) @@ -59,6 +58,7 @@ class WorkflowMain { } // Print parameter summary log to screen + log.info paramsSummaryLog(workflow, params, log) // Check that a -profile or Nextflow config has been provided to run the pipeline @@ -78,17 +78,15 @@ class WorkflowMain { System.exit(1) } } - // // Get attribute from genome config file e.g. fasta // - public static String getGenomeAttribute(params, attribute) { - def val = '' + public static Object getGenomeAttribute(params, attribute) { if (params.genomes && params.genome && params.genomes.containsKey(params.genome)) { if (params.genomes[ params.genome ].containsKey(attribute)) { - val = params.genomes[ params.genome ][ attribute ] + return params.genomes[ params.genome ][ attribute ] } } - return val + return null } } diff --git a/lib/WorkflowTva.groovy b/lib/WorkflowNfCmggGermline.groovy similarity index 94% rename from lib/WorkflowTva.groovy rename to lib/WorkflowNfCmggGermline.groovy index 7a382537..942bdf67 100755 --- a/lib/WorkflowTva.groovy +++ b/lib/WorkflowNfCmggGermline.groovy @@ -1,8 +1,8 @@ // -// This file holds several functions specific to the workflow/tva.nf in the nf-core/tva pipeline +// This file holds several functions specific to the workflow/nf-cmgg-germline.nf in the nf-cmgg-germline pipeline // -class WorkflowTva { +class WorkflowNfCmggGermline { // // Check and validate parameters diff --git a/main.nf b/main.nf index 362de5d3..b8c84b54 100644 --- a/main.nf +++ b/main.nf @@ -1,11 +1,9 @@ #!/usr/bin/env nextflow /* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - nf-core/tva + CenterForMedicalGeneticsGhent/nf-cmgg-germline ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - Github : https://github.com/nf-core/tva - Website: https://nf-co.re/tva - Slack : https://nfcore.slack.com/channels/tva + Github : https://github.com/CenterForMedicalGeneticsGhent/nf-cmgg-germline ---------------------------------------------------------------------------------------- */ @@ -33,13 +31,13 @@ WorkflowMain.initialise(workflow, params, log) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ */ -include { TVA } from './workflows/tva' +include { NF_CMGG_GERMLINE } from './workflows/nf-cmgg-germline' // -// WORKFLOW: Run main nf-core/tva analysis pipeline +// WORKFLOW: Run main nf-cmgg-germline analysis pipeline // -workflow NFCORE_TVA { - TVA () +workflow NFCORE_NF_CMGG_GERMLINE { + NF_CMGG_GERMLINE () } /* @@ -53,7 +51,7 @@ workflow NFCORE_TVA { // See: https://github.com/nf-core/rnaseq/issues/619 // workflow { - NFCORE_TVA () + NFCORE_NF_CMGG_GERMLINE () } /* diff --git a/modules.json b/modules.json index 3f1333e1..4e998288 100644 --- a/modules.json +++ b/modules.json @@ -1,16 +1,114 @@ { - "name": "nf-core/tva", - "homePage": "https://github.com/nf-core/tva", + "name": "CenterForMedicalGeneticsGhent/nf-cmgg-germline", + "homePage": "https://github.com/CenterForMedicalGeneticsGhent/nf-cmgg-germline", "repos": { "nf-core/modules": { - "custom/dumpsoftwareversions": { - "git_sha": "e745e167c1020928ef20ea1397b6b4d230681b4d" - }, - "fastqc": { - "git_sha": "e745e167c1020928ef20ea1397b6b4d230681b4d" - }, - "multiqc": { - "git_sha": "e745e167c1020928ef20ea1397b6b4d230681b4d" + "git_url": "https://github.com/nf-core/modules.git", + "modules": { + "bcftools/concat": { + "branch": "master", + "git_sha": "682f789f93070bd047868300dd018faf3d434e7c" + }, + "bcftools/convert": { + "branch": "master", + "git_sha": "8656636f0d0a86aa3966052b5c2cd06141647c70" + }, + "bcftools/filter": { + "branch": "master", + "git_sha": "682f789f93070bd047868300dd018faf3d434e7c" + }, + "bcftools/merge": { + "branch": "master", + "git_sha": "8656636f0d0a86aa3966052b5c2cd06141647c70" + }, + "bcftools/stats": { + "branch": "master", + "git_sha": "41dfa13929d2c178855159a69d2e2958c52be155" + }, + "bcftools/view": { + "branch": "master", + "git_sha": "682f789f93070bd047868300dd018faf3d434e7c" + }, + "bedtools/split": { + "branch": "master", + "git_sha": "90aef30f432332bdf0ce9f4b9004aa5d5c4960bb" + }, + "custom/dumpsoftwareversions": { + "branch": "master", + "git_sha": "82501fe6d0d12614db67751d30af98d16e63dc59" + }, + "ensemblvep": { + "branch": "master", + "git_sha": "5ccf6fbcc913f34ee2897689081d1cf60cecdb35" + }, + "gatk4/calibratedragstrmodel": { + "branch": "master", + "git_sha": "4c7ef30fb64f75ba4499d3b8fba24a068b1ce586" + }, + "gatk4/combinegvcfs": { + "branch": "master", + "git_sha": "169b2b96c1167f89ab07127b7057c1d90a6996c7" + }, + "gatk4/composestrtablefile": { + "branch": "master", + "git_sha": "114a54c8d5a8e898a126c2804e3e221286eb2682" + }, + "gatk4/createsequencedictionary": { + "branch": "master", + "git_sha": "169b2b96c1167f89ab07127b7057c1d90a6996c7" + }, + "gatk4/genotypegvcfs": { + "branch": "master", + "git_sha": "169b2b96c1167f89ab07127b7057c1d90a6996c7" + }, + "gatk4/haplotypecaller": { + "branch": "master", + "git_sha": "e53d091a6de1ae9fd681351c085d8abe076ba1ec" + }, + "gatk4/reblockgvcf": { + "branch": "master", + "git_sha": "873215c8ae3882e3ce1c8c62fbae16e74d631270" + }, + "multiqc": { + "branch": "master", + "git_sha": "90aef30f432332bdf0ce9f4b9004aa5d5c4960bb" + }, + "samtools/faidx": { + "branch": "master", + "git_sha": "3eb99152cedbb7280258858e5df08478a4670696" + }, + "samtools/index": { + "branch": "master", + "git_sha": "897c33d5da084b61109500ee44c01da2d3e4e773" + }, + "tabix/bgzip": { + "branch": "master", + "git_sha": "31c0b49f6527ef196e89eca49a36af2de71711f8" + }, + "tabix/bgziptabix": { + "branch": "master", + "git_sha": "5e7b1ef9a5a2d9258635bcbf70fcf37dacd1b247" + }, + "tabix/tabix": { + "branch": "master", + "git_sha": "5e7b1ef9a5a2d9258635bcbf70fcf37dacd1b247" + }, + "untar": { + "branch": "master", + "git_sha": "b63b9f752dc8e43fc70b0491aad5e0a270ab0e10" + }, + "vcf2db": { + "branch": "master", + "git_sha": "233fa70811a03a4cecb2ece483b5c8396e2cee1d" + }, + "vcfanno": { + "branch": "master", + "git_sha": "13631304102ca3d99def3578611b79332f6fd175" + }, + "vcftools": { + "branch": "master", + "git_sha": "5e7b1ef9a5a2d9258635bcbf70fcf37dacd1b247" + } } } } diff --git a/modules/local/merge_beds.nf b/modules/local/merge_beds.nf new file mode 100644 index 00000000..78aa21c5 --- /dev/null +++ b/modules/local/merge_beds.nf @@ -0,0 +1,39 @@ +process MERGE_BEDS { + tag "$meta.id" + label 'process_medium' + + conda (params.enable_conda ? "bioconda::bedtools=2.30.0" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/bedtools:2.30.0--hc088bd4_0' : + 'quay.io/biocontainers/bedtools:2.30.0--hc088bd4_0' }" + + input: + tuple val(meta), path(bed, stageAs: "?/*") + + output: + tuple val(meta), path('*.bed'), emit: bed + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + def prefix = task.ext.prefix ?: "${meta.id}" + """ + for FILE in */*.bed.gz; + do + if [[ \$FILE != '*/*.bed.gz' ]] + then + gunzip \$FILE + fi + done; + + awk 'FNR==1{print ""}1' */*.bed | sort -k 1,1 -k2,2n | bedtools merge > ${meta.id}.bed + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + bedtools: \$(bedtools --version | sed -e "s/bedtools v//g") + END_VERSIONS + """ +} diff --git a/modules/local/merge_vcf_headers.nf b/modules/local/merge_vcf_headers.nf new file mode 100644 index 00000000..fcbe671c --- /dev/null +++ b/modules/local/merge_vcf_headers.nf @@ -0,0 +1,34 @@ +process MERGE_VCF_HEADERS { + tag "$meta.id" + label "process_low" + + conda (params.enable_conda ? "conda-forge::python=3.9" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/python:3.9' : + 'quay.io/biocontainers/python:3.9' }" + + input: + tuple val(meta), path(vcf), path(ped_vcf) + + output: + tuple val(meta), path("*.vcf") , emit: vcf + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: // This script is bundled with the pipeline, in nf-cmgg-germline/bin/ + def prefix = task.ext.prefix ?: "${meta.id}" + + """ + merge_vcf_headers.py \\ + $vcf \\ + $ped_vcf \\ + ${prefix}.vcf + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + python: \$(python --version | sed 's/Python //g') + END_VERSIONS + """ +} diff --git a/modules/local/rtgtools/pedfilter/main.nf b/modules/local/rtgtools/pedfilter/main.nf new file mode 100644 index 00000000..820ff214 --- /dev/null +++ b/modules/local/rtgtools/pedfilter/main.nf @@ -0,0 +1,35 @@ +process RTGTOOLS_PEDFILTER { + tag "$meta.id" + label 'process_low' + + conda (params.enable_conda ? "bioconda::rtg-tools=3.12.1" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/rtg-tools:3.12.1--hdfd78af_0': + 'quay.io/biocontainers/rtg-tools:3.12.1--hdfd78af_0' }" + + input: + tuple val(meta), path(ped) + + output: + tuple val(meta), path("*.vcf") , emit: vcf + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + def prefix = task.ext.prefix ?: "${meta.id}" + + """ + rtg pedfilter \\ + $ped \\ + --vcf \\ + > ${prefix}.vcf + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + rtgtools: \$(echo \$(rtg version | head -n 1 | awk '{print \$4}')) + END_VERSIONS + """ +} diff --git a/modules/local/rtgtools/pedfilter/meta.yml b/modules/local/rtgtools/pedfilter/meta.yml new file mode 100644 index 00000000..826f8558 --- /dev/null +++ b/modules/local/rtgtools/pedfilter/meta.yml @@ -0,0 +1,45 @@ +name: "rtgtools_pedfilter" +description: Converts a PED file to VCF headers +keywords: + - rtgtools + - pedfilter + - vcf +tools: + - "rtgtools": + description: "RealTimeGenomics Tools -- Utilities for accurate VCF comparison and manipulation" + homepage: "https://www.realtimegenomics.com/products/rtg-tools" + documentation: "https://github.com/RealTimeGenomics/rtg-tools" + tool_dev_url: "https://github.com/RealTimeGenomics/rtg-tools" + doi: "" + licence: "['BSD']" + +input: + # Only when we have meta + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - ped: + type: file + description: PED file + pattern: "*.ped" + +output: + #Only when we have meta + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - versions: + type: file + description: File containing software versions + pattern: "versions.yml" + - vcf: + type: file + description: VCF file containing only headers fetched from the PED file + pattern: "*.vcf.gz" + +authors: + - "@nvnieuwk" diff --git a/modules/local/samplesheet_check.nf b/modules/local/samplesheet_check.nf deleted file mode 100644 index bdb2c37b..00000000 --- a/modules/local/samplesheet_check.nf +++ /dev/null @@ -1,27 +0,0 @@ -process SAMPLESHEET_CHECK { - tag "$samplesheet" - - conda (params.enable_conda ? "conda-forge::python=3.8.3" : null) - container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? - 'https://depot.galaxyproject.org/singularity/python:3.8.3' : - 'quay.io/biocontainers/python:3.8.3' }" - - input: - path samplesheet - - output: - path '*.csv' , emit: csv - path "versions.yml", emit: versions - - script: // This script is bundled with the pipeline, in nf-core/tva/bin/ - """ - check_samplesheet.py \\ - $samplesheet \\ - samplesheet.valid.csv - - cat <<-END_VERSIONS > versions.yml - "${task.process}": - python: \$(python --version | sed 's/Python //g') - END_VERSIONS - """ -} diff --git a/modules/local/samtools_merge.nf b/modules/local/samtools_merge.nf new file mode 100644 index 00000000..6e254ead --- /dev/null +++ b/modules/local/samtools_merge.nf @@ -0,0 +1,58 @@ +process SAMTOOLS_MERGE { + tag "$meta.id" + label 'process_low' + + conda (params.enable_conda ? "bioconda::samtools=1.15.1" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/samtools:1.15.1--h1170115_0' : + 'quay.io/biocontainers/samtools:1.15.1--h1170115_0' }" + + input: + tuple val(meta), path(input_files, stageAs: "?/*") + path fasta + path fai + val always_use_cram + + output: + tuple val(meta), path("*.bam") , optional:true, emit: bam + tuple val(meta), path("*.cram"), optional:true, emit: cram + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + def args2 = task.ext.args2 ?: '' + def prefix = task.ext.prefix ?: "${meta.id}" + def reference = fasta ? "--reference ${fasta}" : "" + def convert_to_cram = always_use_cram ? + "samtools view --threads ${task.cpus} --reference ${fasta} $args2 ${prefix}.bam -C -o ${prefix}.cram && rm ${prefix}.bam" : "" + """ + samtools \\ + merge \\ + --threads ${task.cpus} \\ + $args \\ + ${reference} \\ + ${prefix}.bam \\ + $input_files + + $convert_to_cram + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//') + END_VERSIONS + """ + + stub: + prefix = task.ext.suffix ? "${meta.id}${task.ext.suffix}" : "${meta.id}" + """ + touch ${prefix}.bam + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//') + END_VERSIONS + """ +} diff --git a/modules/nf-core/modules/bcftools/concat/main.nf b/modules/nf-core/modules/bcftools/concat/main.nf new file mode 100644 index 00000000..d2a58a55 --- /dev/null +++ b/modules/nf-core/modules/bcftools/concat/main.nf @@ -0,0 +1,35 @@ +process BCFTOOLS_CONCAT { + tag "$meta.id" + label 'process_medium' + + conda (params.enable_conda ? "bioconda::bcftools=1.15.1" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/bcftools:1.15.1--h0ea216a_0': + 'quay.io/biocontainers/bcftools:1.15.1--h0ea216a_0' }" + + input: + tuple val(meta), path(vcfs), path(tbi) + + output: + tuple val(meta), path("*.gz"), emit: vcf + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + prefix = task.ext.prefix ?: "${meta.id}" + """ + bcftools concat \\ + --output ${prefix}.vcf.gz \\ + $args \\ + --threads $task.cpus \\ + ${vcfs} + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + bcftools: \$(bcftools --version 2>&1 | head -n1 | sed 's/^.*bcftools //; s/ .*\$//') + END_VERSIONS + """ +} diff --git a/modules/nf-core/modules/bcftools/concat/meta.yml b/modules/nf-core/modules/bcftools/concat/meta.yml new file mode 100644 index 00000000..167dbe5a --- /dev/null +++ b/modules/nf-core/modules/bcftools/concat/meta.yml @@ -0,0 +1,48 @@ +name: bcftools_concat +description: Concatenate VCF files +keywords: + - variant calling + - concat + - bcftools + - VCF + +tools: + - concat: + description: | + Concatenate VCF files. + homepage: http://samtools.github.io/bcftools/bcftools.html + documentation: http://www.htslib.org/doc/bcftools.html + doi: 10.1093/bioinformatics/btp352 + licence: ["MIT"] +input: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - vcfs: + type: files + description: | + List containing 2 or more vcf files + e.g. [ 'file1.vcf', 'file2.vcf' ] + - tbi: + type: files + description: | + List containing 2 or more index files (optional) + e.g. [ 'file1.tbi', 'file2.tbi' ] +output: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - vcf: + type: file + description: VCF concatenated output file + pattern: "*.{vcf.gz}" + - versions: + type: file + description: File containing software versions + pattern: "versions.yml" +authors: + - "@abhi18av" diff --git a/modules/nf-core/modules/bcftools/convert/main.nf b/modules/nf-core/modules/bcftools/convert/main.nf new file mode 100644 index 00000000..f184c2f9 --- /dev/null +++ b/modules/nf-core/modules/bcftools/convert/main.nf @@ -0,0 +1,51 @@ +process BCFTOOLS_CONVERT { + tag "$meta.id" + label 'process_medium' + + conda (params.enable_conda ? "bioconda::bcftools=1.15.1" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/bcftools:1.15.1--h0ea216a_0': + 'quay.io/biocontainers/bcftools:1.15.1--h0ea216a_0' }" + + input: + tuple val(meta), path(input), path(input_index) + path bed + path fasta + + output: + tuple val(meta), path("*.vcf.gz"), optional:true , emit: vcf_gz + tuple val(meta), path("*.vcf") , optional:true , emit: vcf + tuple val(meta), path("*.bcf.gz"), optional:true , emit: bcf_gz + tuple val(meta), path("*.bcf") , optional:true , emit: bcf + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + def prefix = task.ext.prefix ?: "${meta.id}" + + def regions = bed ? "--regions-file $bed" : "" + def reference = fasta ? "--fasta-ref $fasta" : "" + def extension = args.contains("--output-type b") || args.contains("-Ob") ? "bcf.gz" : + args.contains("--output-type u") || args.contains("-Ou") ? "bcf" : + args.contains("--output-type z") || args.contains("-Oz") ? "vcf.gz" : + args.contains("--output-type v") || args.contains("-Ov") ? "vcf" : + "vcf.gz" + + """ + bcftools convert \\ + $args \\ + $regions \\ + --output ${prefix}.${extension} \\ + --threads $task.cpus \\ + $reference \\ + $input + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + bcftools: \$(bcftools --version 2>&1 | head -n1 | sed 's/^.*bcftools //; s/ .*\$//') + END_VERSIONS + """ +} diff --git a/modules/nf-core/modules/bcftools/convert/meta.yml b/modules/nf-core/modules/bcftools/convert/meta.yml new file mode 100644 index 00000000..48cf3a9d --- /dev/null +++ b/modules/nf-core/modules/bcftools/convert/meta.yml @@ -0,0 +1,74 @@ +name: "bcftools_convert" +description: Converts certain output formats to VCF +keywords: + - bcftools + - convert + - vcf + - gvcf +tools: + - "bcftools": + description: "BCFtools is a set of utilities that manipulate variant calls in the Variant Call Format (VCF) and its binary counterpart BCF. All commands work transparently with both VCFs and BCFs, both uncompressed and BGZF-compressed. Most commands accept VCF, bgzipped VCF and BCF with filetype detected automatically even when streaming from a pipe. Indexed VCF and BCF will work in all situations. Un-indexed VCF and BCF and streams will work in most, but not all situations." + homepage: "https://samtools.github.io/bcftools/bcftools.html" + documentation: "https://samtools.github.io/bcftools/bcftools.html#convert" + tool_dev_url: "https://github.com/samtools/bcftools" + doi: "https://doi.org/10.1093/gigascience/giab008" + licence: "['GPL']" + +input: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - input: + type: file + description: | + The input format. Each format needs a seperate parameter to be specified in the `args`: + - GEN/SAMPLE file: `--gensample2vcf` + - gVCF file: `--gvcf2vcf` + - HAP/SAMPLE file: `--hapsample2vcf` + - HAP/LEGEND/SAMPLE file: `--haplegendsample2vcf` + - TSV file: `--tsv2vcf` + pattern: "*.{gen,sample,g.vcf,hap,legend}{.gz,}" + - input_index: + type: file + description: (Optional) The index for the input files, if needed + pattern: "*.bed" + - bed: + type: file + description: (Optional) The BED file containing the regions for the VCF file + pattern: "*.bed" + - fasta: + type: file + description: (Optional) The reference fasta, only needed for gVCF conversion + pattern: "*.{fa,fasta}" + +output: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - versions: + type: file + description: File containing software versions + pattern: "versions.yml" + - vcf_gz: + type: file + description: VCF merged output file (bgzipped) => when `--output-type z` is used + pattern: "*.vcf.gz" + - vcf: + type: file + description: VCF merged output file => when `--output-type v` is used + pattern: "*.vcf" + - bcf_gz: + type: file + description: BCF merged output file (bgzipped) => when `--output-type b` is used + pattern: "*.bcf.gz" + - bcf: + type: file + description: BCF merged output file => when `--output-type u` is used + pattern: "*.bcf" + +authors: + - "@nvnieuwk" diff --git a/modules/nf-core/modules/bcftools/filter/main.nf b/modules/nf-core/modules/bcftools/filter/main.nf new file mode 100644 index 00000000..ef99eda2 --- /dev/null +++ b/modules/nf-core/modules/bcftools/filter/main.nf @@ -0,0 +1,34 @@ +process BCFTOOLS_FILTER { + tag "$meta.id" + label 'process_medium' + + conda (params.enable_conda ? "bioconda::bcftools=1.15.1" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/bcftools:1.15.1--h0ea216a_0': + 'quay.io/biocontainers/bcftools:1.15.1--h0ea216a_0' }" + + input: + tuple val(meta), path(vcf) + + output: + tuple val(meta), path("*.gz"), emit: vcf + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + def prefix = task.ext.prefix ?: "${meta.id}" + """ + bcftools filter \\ + --output ${prefix}.vcf.gz \\ + $args \\ + $vcf + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + bcftools: \$(bcftools --version 2>&1 | head -n1 | sed 's/^.*bcftools //; s/ .*\$//') + END_VERSIONS + """ +} diff --git a/modules/nf-core/modules/bcftools/filter/meta.yml b/modules/nf-core/modules/bcftools/filter/meta.yml new file mode 100644 index 00000000..05a6d828 --- /dev/null +++ b/modules/nf-core/modules/bcftools/filter/meta.yml @@ -0,0 +1,41 @@ +name: bcftools_filter +description: Filters VCF files +keywords: + - variant calling + - filtering + - VCF +tools: + - filter: + description: | + Apply fixed-threshold filters to VCF files. + homepage: http://samtools.github.io/bcftools/bcftools.html + documentation: http://www.htslib.org/doc/bcftools.html + doi: 10.1093/bioinformatics/btp352 + licence: ["MIT"] +input: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - vcf: + type: file + description: VCF input file + pattern: "*.{vcf}" +output: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - vcf: + type: file + description: VCF filtered output file + pattern: "*.{vcf}" + - versions: + type: file + description: File containing software versions + pattern: "versions.yml" +authors: + - "@joseespinosa" + - "@drpatelh" diff --git a/modules/nf-core/modules/bcftools/merge/main.nf b/modules/nf-core/modules/bcftools/merge/main.nf new file mode 100644 index 00000000..af586cd1 --- /dev/null +++ b/modules/nf-core/modules/bcftools/merge/main.nf @@ -0,0 +1,46 @@ +process BCFTOOLS_MERGE { + tag "$meta.id" + label 'process_medium' + + conda (params.enable_conda ? "bioconda::bcftools=1.15.1" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/bcftools:1.15.1--h0ea216a_0': + 'quay.io/biocontainers/bcftools:1.15.1--h0ea216a_0' }" + + input: + tuple val(meta), path(vcfs), path(tbis) + path bed + path fasta + path fasta_fai + + output: + tuple val(meta), path("*.{bcf,vcf}{,.gz}"), emit: merged_variants + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + def prefix = task.ext.prefix ?: "${meta.id}" + + def regions = bed ? "--regions-file $bed" : "" + def extension = args.contains("--output-type b") || args.contains("-Ob") ? "bcf.gz" : + args.contains("--output-type u") || args.contains("-Ou") ? "bcf" : + args.contains("--output-type v") || args.contains("-Ov") ? "vcf" : + "vcf.gz" + + """ + bcftools merge \\ + $regions \\ + --threads $task.cpus \\ + --output ${prefix}.${extension} \\ + $args \\ + *.vcf.gz + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + bcftools: \$(bcftools --version 2>&1 | head -n1 | sed 's/^.*bcftools //; s/ .*\$//') + END_VERSIONS + """ +} diff --git a/modules/nf-core/modules/bcftools/merge/meta.yml b/modules/nf-core/modules/bcftools/merge/meta.yml new file mode 100644 index 00000000..53dc23eb --- /dev/null +++ b/modules/nf-core/modules/bcftools/merge/meta.yml @@ -0,0 +1,72 @@ +name: bcftools_merge +description: Merge VCF files +keywords: + - variant calling + - merge + - VCF +tools: + - merge: + description: | + Merge VCF files. + homepage: http://samtools.github.io/bcftools/bcftools.html + documentation: http://www.htslib.org/doc/bcftools.html + doi: 10.1093/bioinformatics/btp352 + licence: ["MIT"] +input: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - vcfs: + type: files + description: | + List containing 2 or more vcf files + e.g. [ 'file1.vcf', 'file2.vcf' ] + - tbis: + type: files + description: | + List containing the tbi index files corresponding to the vcfs input files + e.g. [ 'file1.vcf.tbi', 'file2.vcf.tbi' ] + - bed: + type: file + description: "(Optional) The bed regions to merge on" + pattern: "*.bed" + - fasta: + type: file + description: "(Optional) The fasta reference file (only necessary for the `--gvcf FILE` parameter)" + pattern: "*.{fasta,fa}" + - fasta: + type: file + description: "(Optional) The fasta reference file index (only necessary for the `--gvcf FILE` parameter)" + pattern: "*.fai" +output: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - vcf_gz: + type: file + description: VCF merged output file (bgzipped) => when `--output-type z` is used + pattern: "*.vcf.gz" + - vcf: + type: file + description: VCF merged output file => when `--output-type v` is used + pattern: "*.vcf" + - bcf_gz: + type: file + description: BCF merged output file (bgzipped) => when `--output-type b` is used + pattern: "*.bcf.gz" + - bcf: + type: file + description: BCF merged output file => when `--output-type u` is used + pattern: "*.bcf" + - versions: + type: file + description: File containing software versions + pattern: "versions.yml" +authors: + - "@joseespinosa" + - "@drpatelh" + - "@nvnieuwk" diff --git a/modules/nf-core/modules/bcftools/stats/main.nf b/modules/nf-core/modules/bcftools/stats/main.nf new file mode 100644 index 00000000..6a755c0e --- /dev/null +++ b/modules/nf-core/modules/bcftools/stats/main.nf @@ -0,0 +1,42 @@ +process BCFTOOLS_STATS { + tag "$meta.id" + label 'process_single' + + conda (params.enable_conda ? "bioconda::bcftools=1.15.1" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/bcftools:1.15.1--h0ea216a_0': + 'quay.io/biocontainers/bcftools:1.15.1--h0ea216a_0' }" + + input: + tuple val(meta), path(vcf), path(tbi) + path regions + path targets + path samples + + output: + tuple val(meta), path("*stats.txt"), emit: stats + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + def prefix = task.ext.prefix ?: "${meta.id}" + def regions_file = regions ? "--regions-file ${regions}" : "" + def targets_file = targets ? "--targets-file ${targets}" : "" + def samples_file = samples ? "--samples-file ${samples}" : "" + """ + bcftools stats \\ + $args \\ + $regions_file \\ + $targets_file \\ + $samples_file \\ + $vcf > ${prefix}.bcftools_stats.txt + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + bcftools: \$(bcftools --version 2>&1 | head -n1 | sed 's/^.*bcftools //; s/ .*\$//') + END_VERSIONS + """ +} diff --git a/modules/nf-core/modules/bcftools/stats/meta.yml b/modules/nf-core/modules/bcftools/stats/meta.yml new file mode 100644 index 00000000..f7afcd50 --- /dev/null +++ b/modules/nf-core/modules/bcftools/stats/meta.yml @@ -0,0 +1,61 @@ +name: bcftools_stats +description: Generates stats from VCF files +keywords: + - variant calling + - stats + - VCF +tools: + - stats: + description: | + Parses VCF or BCF and produces text file stats which is suitable for + machine processing and can be plotted using plot-vcfstats. + homepage: http://samtools.github.io/bcftools/bcftools.html + documentation: http://www.htslib.org/doc/bcftools.html + doi: 10.1093/bioinformatics/btp352 + licence: ["MIT"] +input: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - vcf: + type: file + description: VCF input file + pattern: "*.{vcf}" + - tbi: + type: file + description: | + The tab index for the VCF file to be inspected. Optional: only required when parameter regions is chosen. + pattern: "*.tbi" + - regions: + type: file + description: | + Optionally, restrict the operation to regions listed in this file. (VCF, BED or tab-delimited) + - targets: + type: file + description: | + Optionally, restrict the operation to regions listed in this file (doesn't rely upon tbi index files) + - samples: + type: file + description: | + Optional, file of sample names to be included or excluded. + e.g. 'file.tsv' +output: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - stats: + type: file + description: Text output file containing stats + pattern: "*_{stats.txt}" + - versions: + type: file + description: File containing software versions + pattern: "versions.yml" +authors: + - "@joseespinosa" + - "@drpatelh" + - "@SusiJo" diff --git a/modules/nf-core/modules/bcftools/view/main.nf b/modules/nf-core/modules/bcftools/view/main.nf new file mode 100644 index 00000000..3df08a57 --- /dev/null +++ b/modules/nf-core/modules/bcftools/view/main.nf @@ -0,0 +1,55 @@ +process BCFTOOLS_VIEW { + tag "$meta.id" + label 'process_medium' + + conda (params.enable_conda ? "bioconda::bcftools=1.15.1" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/bcftools:1.15.1--h0ea216a_0': + 'quay.io/biocontainers/bcftools:1.15.1--h0ea216a_0' }" + + input: + tuple val(meta), path(vcf), path(index) + path(regions) + path(targets) + path(samples) + + output: + tuple val(meta), path("*.gz") , emit: vcf + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + def prefix = task.ext.prefix ?: "${meta.id}" + def regions_file = regions ? "--regions-file ${regions}" : "" + def targets_file = targets ? "--targets-file ${targets}" : "" + def samples_file = samples ? "--samples-file ${samples}" : "" + """ + bcftools view \\ + --output ${prefix}.vcf.gz \\ + ${regions_file} \\ + ${targets_file} \\ + ${samples_file} \\ + $args \\ + --threads $task.cpus \\ + ${vcf} + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + bcftools: \$(bcftools --version 2>&1 | head -n1 | sed 's/^.*bcftools //; s/ .*\$//') + END_VERSIONS + """ + + stub: + def prefix = task.ext.prefix ?: "${meta.id}" + """ + touch ${prefix}.vcf.gz + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + bcftools: \$(bcftools --version 2>&1 | head -n1 | sed 's/^.*bcftools //; s/ .*\$//') + END_VERSIONS + """ +} diff --git a/modules/nf-core/modules/bcftools/view/meta.yml b/modules/nf-core/modules/bcftools/view/meta.yml new file mode 100644 index 00000000..326fd1fa --- /dev/null +++ b/modules/nf-core/modules/bcftools/view/meta.yml @@ -0,0 +1,63 @@ +name: bcftools_view +description: View, subset and filter VCF or BCF files by position and filtering expression. Convert between VCF and BCF +keywords: + - variant calling + - view + - bcftools + - VCF + +tools: + - view: + description: | + View, subset and filter VCF or BCF files by position and filtering expression. Convert between VCF and BCF + homepage: http://samtools.github.io/bcftools/bcftools.html + documentation: http://www.htslib.org/doc/bcftools.html + doi: 10.1093/bioinformatics/btp352 + licence: ["MIT"] +input: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - vcf: + type: file + description: | + The vcf file to be inspected. + e.g. 'file.vcf' + - index: + type: file + description: | + The tab index for the VCF file to be inspected. + e.g. 'file.tbi' + - regions: + type: file + description: | + Optionally, restrict the operation to regions listed in this file. + e.g. 'file.vcf' + - targets: + type: file + description: | + Optionally, restrict the operation to regions listed in this file (doesn't rely upon index files) + e.g. 'file.vcf' + - samples: + type: file + description: | + Optional, file of sample names to be included or excluded. + e.g. 'file.tsv' +output: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - vcf: + type: file + description: VCF normalized output file + pattern: "*.{vcf.gz}" + - versions: + type: file + description: File containing software versions + pattern: "versions.yml" +authors: + - "@abhi18av" diff --git a/modules/nf-core/modules/bedtools/split/main.nf b/modules/nf-core/modules/bedtools/split/main.nf new file mode 100644 index 00000000..aaa5497d --- /dev/null +++ b/modules/nf-core/modules/bedtools/split/main.nf @@ -0,0 +1,38 @@ +process BEDTOOLS_SPLIT { + tag "$meta.id" + label 'process_single' + + conda (params.enable_conda ? "bioconda::bedtools=2.30.0" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/bedtools:2.30.0--h468198e_3': + 'quay.io/biocontainers/bedtools:2.30.0--h7d7f7ad_2' }" + + input: + tuple val(meta), path(bed) + val(number_of_files) + + output: + tuple val(meta), path("*.bed"), emit: beds + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + def prefix = task.ext.prefix ?: "${meta.id}" + + """ + bedtools \\ + split \\ + $args \\ + -i $bed \\ + -p $prefix \\ + -n $number_of_files + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + bedtools: \$(bedtools --version | sed -e "s/bedtools v//g") + END_VERSIONS + """ +} diff --git a/modules/nf-core/modules/bedtools/split/meta.yml b/modules/nf-core/modules/bedtools/split/meta.yml new file mode 100644 index 00000000..1f41cc70 --- /dev/null +++ b/modules/nf-core/modules/bedtools/split/meta.yml @@ -0,0 +1,41 @@ +name: "bedtools_split" +description: Split BED files into several smaller BED files +keywords: + - sort +tools: + - "bedtools": + description: "A powerful toolset for genome arithmetic" + documentation: "https://bedtools.readthedocs.io/en/latest/content/tools/sort.html" + licence: "['MIT', 'GPL v2']" + +input: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - bed: + type: file + description: BED file + pattern: "*.bed" + - bed: + type: value + description: The number of files to split the BED into + +output: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - versions: + type: file + description: File containing software versions + pattern: "versions.yml" + - beds: + type: file + description: list of split BED files + pattern: "*.bed" + +authors: + - "@nvnieuwk" diff --git a/modules/nf-core/modules/custom/dumpsoftwareversions/main.nf b/modules/nf-core/modules/custom/dumpsoftwareversions/main.nf index 327d5100..cebb6e05 100644 --- a/modules/nf-core/modules/custom/dumpsoftwareversions/main.nf +++ b/modules/nf-core/modules/custom/dumpsoftwareversions/main.nf @@ -1,11 +1,11 @@ process CUSTOM_DUMPSOFTWAREVERSIONS { - label 'process_low' + label 'process_single' // Requires `pyyaml` which does not have a dedicated container but is in the MultiQC container - conda (params.enable_conda ? "bioconda::multiqc=1.11" : null) + conda (params.enable_conda ? 'bioconda::multiqc=1.13' : null) container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? - 'https://depot.galaxyproject.org/singularity/multiqc:1.11--pyhdfd78af_0' : - 'quay.io/biocontainers/multiqc:1.11--pyhdfd78af_0' }" + 'https://depot.galaxyproject.org/singularity/multiqc:1.13--pyhdfd78af_0' : + 'quay.io/biocontainers/multiqc:1.13--pyhdfd78af_0' }" input: path versions diff --git a/modules/nf-core/modules/custom/dumpsoftwareversions/templates/dumpsoftwareversions.py b/modules/nf-core/modules/custom/dumpsoftwareversions/templates/dumpsoftwareversions.py index d1390392..7c2abfa4 100644 --- a/modules/nf-core/modules/custom/dumpsoftwareversions/templates/dumpsoftwareversions.py +++ b/modules/nf-core/modules/custom/dumpsoftwareversions/templates/dumpsoftwareversions.py @@ -58,11 +58,12 @@ def _make_versions_html(versions): for process, process_versions in versions_by_process.items(): module = process.split(":")[-1] try: - assert versions_by_module[module] == process_versions, ( - "We assume that software versions are the same between all modules. " - "If you see this error-message it means you discovered an edge-case " - "and should open an issue in nf-core/tools. " - ) + if versions_by_module[module] != process_versions: + raise AssertionError( + "We assume that software versions are the same between all modules. " + "If you see this error-message it means you discovered an edge-case " + "and should open an issue in nf-core/tools. " + ) except KeyError: versions_by_module[module] = process_versions diff --git a/modules/nf-core/modules/ensemblvep/Dockerfile b/modules/nf-core/modules/ensemblvep/Dockerfile new file mode 100644 index 00000000..4ada7c6b --- /dev/null +++ b/modules/nf-core/modules/ensemblvep/Dockerfile @@ -0,0 +1,31 @@ +FROM nfcore/base:1.14 +LABEL \ + author="Maxime Garcia" \ + description="VEP image for nf-core pipelines" \ + maintainer="maxime.garcia@scilifelab.se" + +# Install the conda environment +COPY environment.yml / +RUN conda env create -f /environment.yml && conda clean -a + +# Setup default ARG variables +ARG GENOME=GRCh38 +ARG SPECIES=homo_sapiens +ARG VEP_VERSION=105 +ARG VEP_TAG=105.0 + +# Add conda installation dir to PATH (instead of doing 'conda activate') +ENV PATH /opt/conda/envs/nf-core-vep-${VEP_TAG}/bin:$PATH + +# Download Genome +RUN vep_install \ + -a c \ + -c .vep \ + -s ${SPECIES} \ + -y ${GENOME} \ + --CACHE_VERSION ${VEP_VERSION} \ + --CONVERT \ + --NO_BIOPERL --NO_HTSLIB --NO_TEST --NO_UPDATE + +# Dump the details of the installed packages to a file for posterity +RUN conda env export --name nf-core-vep-${VEP_TAG} > nf-core-vep-${VEP_TAG}.yml diff --git a/modules/nf-core/modules/ensemblvep/build.sh b/modules/nf-core/modules/ensemblvep/build.sh new file mode 100755 index 00000000..6f340c0f --- /dev/null +++ b/modules/nf-core/modules/ensemblvep/build.sh @@ -0,0 +1,28 @@ +#!/usr/bin/env bash +set -euo pipefail + +# Build and push all containers + +build_push() { + GENOME=$1 + SPECIES=$2 + VEP_VERSION=$3 + VEP_TAG=$4 + + docker build \ + . \ + -t nfcore/vep:${VEP_TAG}.${GENOME} \ + --build-arg GENOME=${GENOME} \ + --build-arg SPECIES=${SPECIES} \ + --build-arg VEP_VERSION=${VEP_VERSION} \ + --build-arg VEP_TAG=${VEP_TAG} + + docker push nfcore/vep:${VEP_TAG}.${GENOME} +} + +build_push "GRCh37" "homo_sapiens" "105" "105.0" +build_push "GRCh38" "homo_sapiens" "105" "105.0" +build_push "GRCm38" "mus_musculus" "102" "105.0" +build_push "GRCm39" "mus_musculus" "105" "105.0" +build_push "CanFam3.1" "canis_lupus_familiaris" "104" "105.0" +build_push "WBcel235" "caenorhabditis_elegans" "105" "105.0" diff --git a/modules/nf-core/modules/ensemblvep/environment.yml b/modules/nf-core/modules/ensemblvep/environment.yml new file mode 100644 index 00000000..5df85b80 --- /dev/null +++ b/modules/nf-core/modules/ensemblvep/environment.yml @@ -0,0 +1,10 @@ +# You can use this file to create a conda environment for this module: +# conda env create -f environment.yml +name: nf-core-vep-105.0 +channels: + - conda-forge + - bioconda + - defaults + +dependencies: + - bioconda::ensembl-vep=105.0 diff --git a/modules/nf-core/modules/ensemblvep/main.nf b/modules/nf-core/modules/ensemblvep/main.nf new file mode 100644 index 00000000..d2efe35f --- /dev/null +++ b/modules/nf-core/modules/ensemblvep/main.nf @@ -0,0 +1,57 @@ +process ENSEMBLVEP { + tag "$meta.id" + label 'process_medium' + + conda (params.enable_conda ? "bioconda::ensembl-vep=104.3" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/ensembl-vep:104.3--pl5262h4a94de4_0' : + 'quay.io/biocontainers/ensembl-vep:104.3--pl5262h4a94de4_0' }" + + input: + tuple val(meta), path(vcf) + val genome + val species + val cache_version + path cache + path fasta + path extra_files + + output: + tuple val(meta), path("*.ann.vcf"), emit: vcf + path "*.summary.html" , emit: report + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + def prefix = task.ext.prefix ?: "${meta.id}" + def dir_cache = cache ? "\${PWD}/${cache}" : "/.vep" + def reference = fasta ? "--fasta $fasta" : "" + + """ + mkdir $prefix + + vep \\ + -i $vcf \\ + -o ${prefix}.ann.vcf \\ + $args \\ + $reference \\ + --assembly $genome \\ + --species $species \\ + --cache \\ + --cache_version $cache_version \\ + --dir_cache $dir_cache \\ + --fork $task.cpus \\ + --vcf \\ + --stats_file ${prefix}.summary.html + + rm -rf $prefix + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + ensemblvep: \$( echo \$(vep --help 2>&1) | sed 's/^.*Versions:.*ensembl-vep : //;s/ .*\$//') + END_VERSIONS + """ +} diff --git a/modules/nf-core/modules/ensemblvep/meta.yml b/modules/nf-core/modules/ensemblvep/meta.yml new file mode 100644 index 00000000..9891815d --- /dev/null +++ b/modules/nf-core/modules/ensemblvep/meta.yml @@ -0,0 +1,63 @@ +name: ENSEMBLVEP +description: Ensembl Variant Effect Predictor (VEP) +keywords: + - annotation +tools: + - ensemblvep: + description: | + VEP determines the effect of your variants (SNPs, insertions, deletions, CNVs + or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions. + homepage: https://www.ensembl.org/info/docs/tools/vep/index.html + documentation: https://www.ensembl.org/info/docs/tools/vep/script/index.html + licence: ["Apache-2.0"] +input: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - vcf: + type: file + description: | + vcf to annotate + - genome: + type: value + description: | + which genome to annotate with + - species: + type: value + description: | + which species to annotate with + - cache_version: + type: value + description: | + which version of the cache to annotate with + - cache: + type: file + description: | + path to VEP cache (optional) + - fasta: + type: file + description: | + reference FASTA file (optional) + pattern: "*.{fasta,fa}" + - extra_files: + type: tuple + description: | + path to file(s) needed for plugins (optional) +output: + - vcf: + type: file + description: | + annotated vcf + pattern: "*.ann.vcf" + - report: + type: file + description: VEP report file + pattern: "*.html" + - versions: + type: file + description: File containing software versions + pattern: "versions.yml" +authors: + - "@maxulysse" diff --git a/modules/nf-core/modules/fastqc/main.nf b/modules/nf-core/modules/fastqc/main.nf deleted file mode 100644 index ed6b8c50..00000000 --- a/modules/nf-core/modules/fastqc/main.nf +++ /dev/null @@ -1,47 +0,0 @@ -process FASTQC { - tag "$meta.id" - label 'process_medium' - - conda (params.enable_conda ? "bioconda::fastqc=0.11.9" : null) - container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? - 'https://depot.galaxyproject.org/singularity/fastqc:0.11.9--0' : - 'quay.io/biocontainers/fastqc:0.11.9--0' }" - - input: - tuple val(meta), path(reads) - - output: - tuple val(meta), path("*.html"), emit: html - tuple val(meta), path("*.zip") , emit: zip - path "versions.yml" , emit: versions - - when: - task.ext.when == null || task.ext.when - - script: - def args = task.ext.args ?: '' - // Add soft-links to original FastQs for consistent naming in pipeline - def prefix = task.ext.prefix ?: "${meta.id}" - if (meta.single_end) { - """ - [ ! -f ${prefix}.fastq.gz ] && ln -s $reads ${prefix}.fastq.gz - fastqc $args --threads $task.cpus ${prefix}.fastq.gz - - cat <<-END_VERSIONS > versions.yml - "${task.process}": - fastqc: \$( fastqc --version | sed -e "s/FastQC v//g" ) - END_VERSIONS - """ - } else { - """ - [ ! -f ${prefix}_1.fastq.gz ] && ln -s ${reads[0]} ${prefix}_1.fastq.gz - [ ! -f ${prefix}_2.fastq.gz ] && ln -s ${reads[1]} ${prefix}_2.fastq.gz - fastqc $args --threads $task.cpus ${prefix}_1.fastq.gz ${prefix}_2.fastq.gz - - cat <<-END_VERSIONS > versions.yml - "${task.process}": - fastqc: \$( fastqc --version | sed -e "s/FastQC v//g" ) - END_VERSIONS - """ - } -} diff --git a/modules/nf-core/modules/fastqc/meta.yml b/modules/nf-core/modules/fastqc/meta.yml deleted file mode 100644 index 4da5bb5a..00000000 --- a/modules/nf-core/modules/fastqc/meta.yml +++ /dev/null @@ -1,52 +0,0 @@ -name: fastqc -description: Run FastQC on sequenced reads -keywords: - - quality control - - qc - - adapters - - fastq -tools: - - fastqc: - description: | - FastQC gives general quality metrics about your reads. - It provides information about the quality score distribution - across your reads, the per base sequence content (%A/C/G/T). - You get information about adapter contamination and other - overrepresented sequences. - homepage: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ - documentation: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/ - licence: ["GPL-2.0-only"] -input: - - meta: - type: map - description: | - Groovy Map containing sample information - e.g. [ id:'test', single_end:false ] - - reads: - type: file - description: | - List of input FastQ files of size 1 and 2 for single-end and paired-end data, - respectively. -output: - - meta: - type: map - description: | - Groovy Map containing sample information - e.g. [ id:'test', single_end:false ] - - html: - type: file - description: FastQC report - pattern: "*_{fastqc.html}" - - zip: - type: file - description: FastQC report archive - pattern: "*_{fastqc.zip}" - - versions: - type: file - description: File containing software versions - pattern: "versions.yml" -authors: - - "@drpatelh" - - "@grst" - - "@ewels" - - "@FelixKrueger" diff --git a/modules/nf-core/modules/gatk4/calibratedragstrmodel/main.nf b/modules/nf-core/modules/gatk4/calibratedragstrmodel/main.nf new file mode 100644 index 00000000..37a54de4 --- /dev/null +++ b/modules/nf-core/modules/gatk4/calibratedragstrmodel/main.nf @@ -0,0 +1,51 @@ +process GATK4_CALIBRATEDRAGSTRMODEL { + tag "$meta.id" + label 'process_medium' + + conda (params.enable_conda ? "bioconda::gatk4=4.2.6.1" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/gatk4:4.2.6.1--hdfd78af_0': + 'quay.io/biocontainers/gatk4:4.2.6.1--hdfd78af_0' }" + + input: + tuple val(meta), path(bam), path(bam_index), path(intervals) + path fasta + path fasta_fai + path dict + path strtablefile + + output: + tuple val(meta), path("*.txt") , emit: dragstr_model + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + def prefix = task.ext.prefix ?: "${meta.id}" + def intervals_command = intervals ? "--intervals $intervals" : "" + + def avail_mem = 3 + if (!task.memory) { + log.info '[GATK CalibrateDragstrModel] Available memory not known - defaulting to 3GB. Specify process memory requirements to change this.' + } else { + avail_mem = task.memory.giga + } + """ + gatk --java-options "-Xmx${avail_mem}g" CalibrateDragstrModel \\ + --input $bam \\ + --output ${prefix}.txt \\ + --reference $fasta \\ + --str-table-path $strtablefile \\ + --threads $task.cpus \\ + $intervals_command \\ + --tmp-dir . \\ + $args + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + gatk4: \$(echo \$(gatk --version 2>&1) | sed 's/^.*(GATK) v//; s/ .*\$//') + END_VERSIONS + """ +} diff --git a/modules/nf-core/modules/gatk4/calibratedragstrmodel/meta.yml b/modules/nf-core/modules/gatk4/calibratedragstrmodel/meta.yml new file mode 100644 index 00000000..e71dac5e --- /dev/null +++ b/modules/nf-core/modules/gatk4/calibratedragstrmodel/meta.yml @@ -0,0 +1,74 @@ +name: gatk4_calibratedragstrmodel +description: estimates the parameters for the DRAGstr model +keywords: + - gatk4 + - bam + - cram + - sam + - calibratedragstrmodel +tools: + - gatk4: + description: + Genome Analysis Toolkit (GATK4). Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools + with a primary focus on variant discovery and genotyping. Its powerful processing engine + and high-performance computing features make it capable of taking on projects of any size. + homepage: https://gatk.broadinstitute.org/hc/en-us + documentation: https://gatk.broadinstitute.org/hc/en-us/articles/360057441571-CalibrateDragstrModel-BETA- + tool_dev_url: https://github.com/broadinstitute/gatk + doi: 10.1158/1538-7445.AM2017-3590 + licence: ["Apache-2.0"] + +input: + # Only when we have meta + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - bam: + type: file + description: BAM/CRAM/SAM file + pattern: "*.{bam,cram,sam}" + - bam_index: + type: file + description: index of the BAM/CRAM/SAM file + pattern: "*.{bai,crai,sai}" + - intervals: + type: file + description: BED file or interval list containing regions (optional) + pattern: "*.{bed,interval_list}" + - fasta: + type: file + description: The reference FASTA file + pattern: "*.{fasta,fa}" + - fasta_fai: + type: file + description: The index of the reference FASTA file + pattern: "*.fai" + - dict: + type: file + description: The sequence dictionary of the reference FASTA file + pattern: "*.dict" + - strtablefile: + type: file + description: The StrTableFile zip folder of the reference FASTA file + pattern: "*.zip" + +output: + #Only when we have meta + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - versions: + type: file + description: File containing software versions + pattern: "versions.yml" + - dragstr_model: + type: file + description: The DragSTR model + pattern: "*.txt" + +authors: + - "@nvnieuwk" diff --git a/modules/nf-core/modules/gatk4/combinegvcfs/main.nf b/modules/nf-core/modules/gatk4/combinegvcfs/main.nf new file mode 100644 index 00000000..db4d9cdb --- /dev/null +++ b/modules/nf-core/modules/gatk4/combinegvcfs/main.nf @@ -0,0 +1,47 @@ +process GATK4_COMBINEGVCFS { + tag "$meta.id" + label 'process_low' + + conda (params.enable_conda ? "bioconda::gatk4=4.2.6.1" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/gatk4:4.2.6.1--hdfd78af_0': + 'quay.io/biocontainers/gatk4:4.2.6.1--hdfd78af_0' }" + + input: + tuple val(meta), path(vcf), path(vcf_idx) + path fasta + path fai + path dict + + output: + tuple val(meta), path("*.combined.g.vcf.gz"), emit: combined_gvcf + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + def prefix = task.ext.prefix ?: "${meta.id}" + def input_list = vcf.collect{"--variant $it"}.join(' ') + + def avail_mem = 3 + if (!task.memory) { + log.info '[GATK COMBINEGVCFS] Available memory not known - defaulting to 3GB. Specify process memory requirements to change this.' + } else { + avail_mem = task.memory.giga + } + """ + gatk --java-options "-Xmx${avail_mem}g" CombineGVCFs \\ + $input_list \\ + --output ${prefix}.combined.g.vcf.gz \\ + --reference ${fasta} \\ + --tmp-dir . \\ + $args + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + gatk4: \$(echo \$(gatk --version 2>&1) | sed 's/^.*(GATK) v//; s/ .*\$//') + END_VERSIONS + """ +} diff --git a/modules/nf-core/modules/gatk4/combinegvcfs/meta.yml b/modules/nf-core/modules/gatk4/combinegvcfs/meta.yml new file mode 100644 index 00000000..9330e084 --- /dev/null +++ b/modules/nf-core/modules/gatk4/combinegvcfs/meta.yml @@ -0,0 +1,61 @@ +name: gatk4_combinegvcfs +description: Combine per-sample gVCF files produced by HaplotypeCaller into a multi-sample gVCF file +keywords: + - gvcf + - gatk4 + - vcf + - combinegvcfs + - Short_Variant_Discovery +tools: + - gatk4: + description: + Genome Analysis Toolkit (GATK4). Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools + with a primary focus on variant discovery and genotyping. Its powerful processing engine + and high-performance computing features make it capable of taking on projects of any size. + homepage: https://gatk.broadinstitute.org/hc/en-us + documentation: https://gatk.broadinstitute.org/hc/en-us/articles/360037593911-CombineGVCFs + tool_dev_url: https://github.com/broadinstitute/gatk + doi: 10.1158/1538-7445.AM2017-3590 + licence: ["Apache-2.0"] + +input: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test' ] + - vcf: + type: file + description: Compressed VCF files + pattern: "*.vcf.gz" + - vcf_idx: + type: file + description: VCF Index file + pattern: "*.vcf.gz.idx" + - fasta: + type: file + description: The reference fasta file + pattern: "*.fasta" + - fai: + type: file + description: FASTA index file + pattern: "*.fasta.fai" + - dict: + type: file + description: FASTA dictionary file + pattern: "*.dict" +output: + - gvcf: + type: file + description: Compressed Combined GVCF file + pattern: "*.combined.g.vcf.gz" + - versions: + type: file + description: File containing software versions + pattern: "versions.yml" + +authors: + - "@sateeshperi" + - "@mjcipriano" + - "@hseabolt" + - "@maxulysse" diff --git a/modules/nf-core/modules/gatk4/composestrtablefile/main.nf b/modules/nf-core/modules/gatk4/composestrtablefile/main.nf new file mode 100644 index 00000000..8f2f00f2 --- /dev/null +++ b/modules/nf-core/modules/gatk4/composestrtablefile/main.nf @@ -0,0 +1,53 @@ +process GATK4_COMPOSESTRTABLEFILE { + tag "$fasta" + label 'process_low' + + conda (params.enable_conda ? "bioconda::gatk4=4.2.6.1" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/gatk4:4.2.6.1--hdfd78af_0': + 'quay.io/biocontainers/gatk4:4.2.6.1--hdfd78af_0' }" + + input: + path(fasta) + path(fasta_fai) + path(dict) + + output: + path "*.zip" , emit: str_table + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + + def avail_mem = 6 + if (!task.memory) { + log.info '[GATK ComposeSTRTableFile] Available memory not known - defaulting to 6GB. Specify process memory requirements to change this.' + } else { + avail_mem = task.memory.giga + } + """ + gatk --java-options "-Xmx${avail_mem}g" ComposeSTRTableFile \\ + --reference $fasta \\ + --output ${fasta.baseName}.zip \\ + --tmp-dir . \\ + $args + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + gatk4: \$(echo \$(gatk --version 2>&1) | sed 's/^.*(GATK) v//; s/ .*\$//') + END_VERSIONS + """ + + stub: + """ + touch test.zip + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + gatk4: \$(echo \$(gatk --version 2>&1) | sed 's/^.*(GATK) v//; s/ .*\$//') + END_VERSIONS + """ +} diff --git a/modules/nf-core/modules/gatk4/composestrtablefile/meta.yml b/modules/nf-core/modules/gatk4/composestrtablefile/meta.yml new file mode 100644 index 00000000..eb825ef4 --- /dev/null +++ b/modules/nf-core/modules/gatk4/composestrtablefile/meta.yml @@ -0,0 +1,43 @@ +name: "gatk4_composestrtablefile" +description: This tool looks for low-complexity STR sequences along the reference that are later used to estimate the Dragstr model during single sample auto calibration CalibrateDragstrModel. +keywords: + - gatk4 + - composestrtablefile +tools: + - gatk4: + description: + Genome Analysis Toolkit (GATK4). Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools + with a primary focus on variant discovery and genotyping. Its powerful processing engine + and high-performance computing features make it capable of taking on projects of any size. + homepage: https://gatk.broadinstitute.org/hc/en-us + documentation: https://gatk.broadinstitute.org/hc/en-us/articles/4405451249819-ComposeSTRTableFile + tool_dev_url: https://github.com/broadinstitute/gatk + doi: 10.1158/1538-7445.AM2017-3590 + licence: ["Apache-2.0"] + +input: + - fasta: + type: file + description: FASTA reference file + pattern: "*.{fasta,fa}" + - fasta_fai: + type: file + description: index of the FASTA reference file + pattern: "*.fai" + - dict: + type: file + description: Sequence dictionary of the FASTA reference file + pattern: "*.dict" + +output: + - versions: + type: file + description: File containing software versions + pattern: "versions.yml" + - str_table: + type: file + description: A zipped folder containing the STR table files + pattern: "*.zip" + +authors: + - "@nvnieuwk" diff --git a/modules/nf-core/modules/gatk4/createsequencedictionary/main.nf b/modules/nf-core/modules/gatk4/createsequencedictionary/main.nf new file mode 100644 index 00000000..13fa9e81 --- /dev/null +++ b/modules/nf-core/modules/gatk4/createsequencedictionary/main.nf @@ -0,0 +1,51 @@ +process GATK4_CREATESEQUENCEDICTIONARY { + tag "$fasta" + label 'process_medium' + + conda (params.enable_conda ? "bioconda::gatk4=4.2.6.1" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/gatk4:4.2.6.1--hdfd78af_0': + 'quay.io/biocontainers/gatk4:4.2.6.1--hdfd78af_0' }" + + input: + path fasta + + output: + path "*.dict" , emit: dict + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + + def avail_mem = 6 + if (!task.memory) { + log.info '[GATK CreateSequenceDictionary] Available memory not known - defaulting to 6GB. Specify process memory requirements to change this.' + } else { + avail_mem = task.memory.giga + } + """ + gatk --java-options "-Xmx${avail_mem}g" CreateSequenceDictionary \\ + --REFERENCE $fasta \\ + --URI $fasta \\ + --TMP_DIR . \\ + $args + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + gatk4: \$(echo \$(gatk --version 2>&1) | sed 's/^.*(GATK) v//; s/ .*\$//') + END_VERSIONS + """ + + stub: + """ + touch test.dict + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + gatk4: \$(echo \$(gatk --version 2>&1) | sed 's/^.*(GATK) v//; s/ .*\$//') + END_VERSIONS + """ +} diff --git a/modules/nf-core/modules/gatk4/createsequencedictionary/meta.yml b/modules/nf-core/modules/gatk4/createsequencedictionary/meta.yml new file mode 100644 index 00000000..bd247888 --- /dev/null +++ b/modules/nf-core/modules/gatk4/createsequencedictionary/meta.yml @@ -0,0 +1,32 @@ +name: gatk4_createsequencedictionary +description: Creates a sequence dictionary for a reference sequence +keywords: + - dictionary + - fasta +tools: + - gatk: + description: | + Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools + with a primary focus on variant discovery and genotyping. Its powerful processing engine + and high-performance computing features make it capable of taking on projects of any size. + homepage: https://gatk.broadinstitute.org/hc/en-us + documentation: https://gatk.broadinstitute.org/hc/en-us/categories/360002369672s + doi: 10.1158/1538-7445.AM2017-3590 + licence: ["Apache-2.0"] + +input: + - fasta: + type: file + description: Input fasta file + pattern: "*.{fasta,fa}" +output: + - dict: + type: file + description: gatk dictionary file + pattern: "*.{dict}" + - versions: + type: file + description: File containing software versions + pattern: "versions.yml" +authors: + - "@maxulysse" diff --git a/modules/nf-core/modules/gatk4/genotypegvcfs/main.nf b/modules/nf-core/modules/gatk4/genotypegvcfs/main.nf new file mode 100644 index 00000000..11024b1b --- /dev/null +++ b/modules/nf-core/modules/gatk4/genotypegvcfs/main.nf @@ -0,0 +1,54 @@ +process GATK4_GENOTYPEGVCFS { + tag "$meta.id" + label 'process_high' + + conda (params.enable_conda ? "bioconda::gatk4=4.2.6.1" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/gatk4:4.2.6.1--hdfd78af_0': + 'quay.io/biocontainers/gatk4:4.2.6.1--hdfd78af_0' }" + + input: + tuple val(meta), path(gvcf), path(gvcf_index), path(intervals), path(intervals_index) + path fasta + path fai + path dict + path dbsnp + path dbsnp_tbi + + output: + tuple val(meta), path("*.vcf.gz"), emit: vcf + tuple val(meta), path("*.tbi") , emit: tbi + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + def prefix = task.ext.prefix ?: "${meta.id}" + def gvcf_command = gvcf.name.endsWith(".vcf") || gvcf.name.endsWith(".vcf.gz") ? "$gvcf" : "gendb://$gvcf" + def dbsnp_command = dbsnp ? "--dbsnp $dbsnp" : "" + def interval_command = intervals ? "--intervals $intervals" : "" + + def avail_mem = 3 + if (!task.memory) { + log.info '[GATK GenotypeGVCFs] Available memory not known - defaulting to 3GB. Specify process memory requirements to change this.' + } else { + avail_mem = task.memory.giga + } + """ + gatk --java-options "-Xmx${avail_mem}g" GenotypeGVCFs \\ + --variant $gvcf_command \\ + --output ${prefix}.vcf.gz \\ + --reference $fasta \\ + $interval_command \\ + $dbsnp_command \\ + --tmp-dir . \\ + $args + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + gatk4: \$(echo \$(gatk --version 2>&1) | sed 's/^.*(GATK) v//; s/ .*\$//') + END_VERSIONS + """ +} diff --git a/modules/nf-core/modules/gatk4/genotypegvcfs/meta.yml b/modules/nf-core/modules/gatk4/genotypegvcfs/meta.yml new file mode 100644 index 00000000..7bec10ed --- /dev/null +++ b/modules/nf-core/modules/gatk4/genotypegvcfs/meta.yml @@ -0,0 +1,81 @@ +name: gatk4_genotypegvcfs +description: | + Perform joint genotyping on one or more samples pre-called with HaplotypeCaller. +keywords: + - joint genotyping + - genotype + - gvcf +tools: + - gatk4: + description: Genome Analysis Toolkit (GATK4) + homepage: https://gatk.broadinstitute.org/hc/en-us + documentation: https://gatk.broadinstitute.org/hc/en-us/categories/360002369672s + tool_dev_url: https://github.com/broadinstitute/gatk + doi: "10.1158/1538-7445.AM2017-3590" + licence: ["BSD-3-clause"] + +input: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - gvcf: + type: file + description: | + gVCF(.gz) file or to a GenomicsDB + pattern: "*.{vcf,vcf.gz}" + - gvcf_index: + type: file + description: | + index of gvcf file, or empty when providing GenomicsDB + pattern: "*.{idx,tbi}" + - intervals: + type: file + description: Interval file with the genomic regions included in the library (optional) + - intervals_index: + type: file + description: Interval index file (optional) + - fasta: + type: file + description: Reference fasta file + pattern: "*.fasta" + - fai: + type: file + description: Reference fasta index file + pattern: "*.fai" + - dict: + type: file + description: Reference fasta sequence dict file + pattern: "*.dict" + - dbsnp: + type: file + description: dbSNP VCF file + pattern: "*.vcf.gz" + - dbsnp_tbi: + type: file + description: dbSNP VCF index file + pattern: "*.tbi" + +output: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - vcf: + type: file + description: Genotyped VCF file + pattern: "*.vcf.gz" + - tbi: + type: file + description: Tbi index for VCF file + pattern: "*.vcf.gz" + - versions: + type: file + description: File containing software versions + pattern: "versions.yml" + +authors: + - "@santiagorevale" + - "@maxulysse" diff --git a/modules/nf-core/modules/gatk4/haplotypecaller/main.nf b/modules/nf-core/modules/gatk4/haplotypecaller/main.nf new file mode 100644 index 00000000..19cd57bb --- /dev/null +++ b/modules/nf-core/modules/gatk4/haplotypecaller/main.nf @@ -0,0 +1,55 @@ +process GATK4_HAPLOTYPECALLER { + tag "$meta.id" + label 'process_medium' + + conda (params.enable_conda ? "bioconda::gatk4=4.2.6.1" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/gatk4:4.2.6.1--hdfd78af_0': + 'quay.io/biocontainers/gatk4:4.2.6.1--hdfd78af_0' }" + + input: + tuple val(meta), path(input), path(input_index), path(intervals), path(dragstr_model) + path fasta + path fai + path dict + path dbsnp + path dbsnp_tbi + + output: + tuple val(meta), path("*.vcf.gz"), emit: vcf + tuple val(meta), path("*.tbi") , optional:true, emit: tbi + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + def prefix = task.ext.prefix ?: "${meta.id}" + def dbsnp_command = dbsnp ? "--dbsnp $dbsnp" : "" + def interval_command = intervals ? "--intervals $intervals" : "" + def dragstr_command = dragstr_model ? "--dragstr-params-path $dragstr_model" : "" + + def avail_mem = 3 + if (!task.memory) { + log.info '[GATK HaplotypeCaller] Available memory not known - defaulting to 3GB. Specify process memory requirements to change this.' + } else { + avail_mem = task.memory.giga + } + """ + gatk --java-options "-Xmx${avail_mem}g" HaplotypeCaller \\ + --input $input \\ + --output ${prefix}.vcf.gz \\ + --reference $fasta \\ + $dbsnp_command \\ + $interval_command \\ + $dragstr_command \\ + --tmp-dir . \\ + $args + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + gatk4: \$(echo \$(gatk --version 2>&1) | sed 's/^.*(GATK) v//; s/ .*\$//') + END_VERSIONS + """ +} diff --git a/modules/nf-core/modules/gatk4/haplotypecaller/meta.yml b/modules/nf-core/modules/gatk4/haplotypecaller/meta.yml new file mode 100644 index 00000000..48193d91 --- /dev/null +++ b/modules/nf-core/modules/gatk4/haplotypecaller/meta.yml @@ -0,0 +1,79 @@ +name: gatk4_haplotypecaller +description: Call germline SNPs and indels via local re-assembly of haplotypes +keywords: + - gatk4 + - haplotypecaller + - haplotype +tools: + - gatk4: + description: | + Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools + with a primary focus on variant discovery and genotyping. Its powerful processing engine + and high-performance computing features make it capable of taking on projects of any size. + homepage: https://gatk.broadinstitute.org/hc/en-us + documentation: https://gatk.broadinstitute.org/hc/en-us/categories/360002369672s + doi: 10.1158/1538-7445.AM2017-3590 + licence: ["Apache-2.0"] + +input: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - input: + type: file + description: BAM/CRAM file from alignment + pattern: "*.{bam,cram}" + - input_index: + type: file + description: BAI/CRAI file from alignment + pattern: "*.{bai,crai}" + - intervals: + type: file + description: Bed file with the genomic regions included in the library (optional) + - dragstr_model: + type: file + description: Text file containing the DragSTR model of the used BAM/CRAM file (optional) + pattern: "*.txt" + - fasta: + type: file + description: The reference fasta file + pattern: "*.fasta" + - fai: + type: file + description: Index of reference fasta file + pattern: "fasta.fai" + - dict: + type: file + description: GATK sequence dictionary + pattern: "*.dict" + - dbsnp: + type: file + description: VCF file containing known sites (optional) + - dbsnp_tbi: + type: file + description: VCF index of dbsnp (optional) + +output: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - versions: + type: file + description: File containing software versions + pattern: "versions.yml" + - vcf: + type: file + description: Compressed VCF file + pattern: "*.vcf.gz" + - tbi: + type: file + description: Index of VCF file + pattern: "*.vcf.gz.tbi" + +authors: + - "@suzannejin" + - "@FriederikeHanssen" diff --git a/modules/nf-core/modules/gatk4/reblockgvcf/main.nf b/modules/nf-core/modules/gatk4/reblockgvcf/main.nf new file mode 100644 index 00000000..5640e8ae --- /dev/null +++ b/modules/nf-core/modules/gatk4/reblockgvcf/main.nf @@ -0,0 +1,52 @@ +process GATK4_REBLOCKGVCF { + tag "$meta.id" + label 'process_low' + + conda (params.enable_conda ? "bioconda::gatk4=4.2.6.1" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/gatk4:4.2.6.1--hdfd78af_0': + 'quay.io/biocontainers/gatk4:4.2.6.1--hdfd78af_0' }" + + input: + tuple val(meta), path(gvcf), path(tbi), path(intervals) + path fasta + path fai + path dict + path dbsnp + path dbsnp_tbi + + output: + tuple val(meta), path("*.rb.g.vcf.gz"), path("*.tbi") , emit: vcf + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + def prefix = task.ext.prefix ?: "${meta.id}" + def dbsnp_command = dbsnp ? "--dbsnp $dbsnp" : "" + def interval_command = intervals ? "--intervals $intervals" : "" + + def avail_mem = 3 + if (!task.memory) { + log.info '[GATK ReblockGVCF] Available memory not known - defaulting to 3GB. Specify process memory requirements to change this.' + } else { + avail_mem = task.memory.giga + } + """ + gatk --java-options "-Xmx${avail_mem}g" ReblockGVCF \\ + --variant $gvcf \\ + --output ${prefix}.rb.g.vcf.gz \\ + --reference $fasta \\ + $dbsnp_command \\ + $interval_command \\ + --tmp-dir . \\ + $args + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + gatk4: \$(echo \$(gatk --version 2>&1) | sed 's/^.*(GATK) v//; s/ .*\$//') + END_VERSIONS + """ +} diff --git a/modules/nf-core/modules/gatk4/reblockgvcf/meta.yml b/modules/nf-core/modules/gatk4/reblockgvcf/meta.yml new file mode 100644 index 00000000..23518416 --- /dev/null +++ b/modules/nf-core/modules/gatk4/reblockgvcf/meta.yml @@ -0,0 +1,74 @@ +name: "gatk4_reblockgvcf" +description: Condenses homRef blocks in a single-sample GVCF +keywords: + - gatk4 + - reblockgvcf + - gvcf +tools: + - gatk4: + description: | + Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools + with a primary focus on variant discovery and genotyping. Its powerful processing engine + and high-performance computing features make it capable of taking on projects of any size. + homepage: https://gatk.broadinstitute.org/hc/en-us + documentation: https://gatk.broadinstitute.org/hc/en-us/categories/360002369672s + doi: 10.1158/1538-7445.AM2017-3590 + licence: ["Apache-2.0"] + +input: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - gvcf: + type: file + description: GVCF file created using HaplotypeCaller using the '-ERC GVCF' or '-ERC BP_RESOLUTION' mode + pattern: "*.{vcf,gvcf}.gz" + - tbi: + type: file + description: Index of the GVCF file + pattern: "*.tbi" + - intervals: + type: file + description: Bed file with the genomic regions included in the library (optional) + - fasta: + type: file + description: The reference fasta file + pattern: "*.fasta" + - fai: + type: file + description: Index of reference fasta file + pattern: "fasta.fai" + - dict: + type: file + description: GATK sequence dictionary + pattern: "*.dict" + - dbsnp: + type: file + description: VCF file containing known sites (optional) + - dbsnp_tbi: + type: file + description: VCF index of dbsnp (optional) + +output: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - versions: + type: file + description: File containing software versions + pattern: "versions.yml" + - gvcf: + type: file + description: Filtered GVCF + pattern: "*rb.g.vcf.gz" + - tbi: + type: file + description: Index of the filtered GVCF + pattern: "*rb.g.vcf.gz.tbi" + +authors: + - "@nvnieuwk" diff --git a/modules/nf-core/modules/multiqc/main.nf b/modules/nf-core/modules/multiqc/main.nf index 1264aac1..a8159a57 100644 --- a/modules/nf-core/modules/multiqc/main.nf +++ b/modules/nf-core/modules/multiqc/main.nf @@ -1,13 +1,16 @@ process MULTIQC { - label 'process_medium' + label 'process_single' - conda (params.enable_conda ? 'bioconda::multiqc=1.12' : null) + conda (params.enable_conda ? 'bioconda::multiqc=1.13' : null) container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? - 'https://depot.galaxyproject.org/singularity/multiqc:1.12--pyhdfd78af_0' : - 'quay.io/biocontainers/multiqc:1.12--pyhdfd78af_0' }" + 'https://depot.galaxyproject.org/singularity/multiqc:1.13--pyhdfd78af_0' : + 'quay.io/biocontainers/multiqc:1.13--pyhdfd78af_0' }" input: - path multiqc_files + path multiqc_files, stageAs: "?/*" + path(multiqc_config) + path(extra_multiqc_config) + path(multiqc_logo) output: path "*multiqc_report.html", emit: report @@ -20,8 +23,27 @@ process MULTIQC { script: def args = task.ext.args ?: '' + def config = multiqc_config ? "--config $multiqc_config" : '' + def extra_config = extra_multiqc_config ? "--config $extra_multiqc_config" : '' """ - multiqc -f $args . + multiqc \\ + --force \\ + $args \\ + $config \\ + $extra_config \\ + . + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + multiqc: \$( multiqc --version | sed -e "s/multiqc, version //g" ) + END_VERSIONS + """ + + stub: + """ + touch multiqc_data + touch multiqc_plots + touch multiqc_report.html cat <<-END_VERSIONS > versions.yml "${task.process}": diff --git a/modules/nf-core/modules/multiqc/meta.yml b/modules/nf-core/modules/multiqc/meta.yml index 6fa891ef..ebc29b27 100644 --- a/modules/nf-core/modules/multiqc/meta.yml +++ b/modules/nf-core/modules/multiqc/meta.yml @@ -12,11 +12,25 @@ tools: homepage: https://multiqc.info/ documentation: https://multiqc.info/docs/ licence: ["GPL-3.0-or-later"] + input: - multiqc_files: type: file description: | List of reports / files recognised by MultiQC, for example the html and zip output of FastQC + - multiqc_config: + type: file + description: Optional config yml for MultiQC + pattern: "*.{yml,yaml}" + - extra_multiqc_config: + type: file + description: Second optional config yml for MultiQC. Will override common sections in multiqc_config. + pattern: "*.{yml,yaml}" + - multiqc_logo: + type: file + description: Optional logo file for MultiQC + pattern: "*.{png}" + output: - report: type: file @@ -38,3 +52,4 @@ authors: - "@abhi18av" - "@bunop" - "@drpatelh" + - "@jfy133" diff --git a/modules/nf-core/modules/samtools/faidx/main.nf b/modules/nf-core/modules/samtools/faidx/main.nf new file mode 100644 index 00000000..ef940db2 --- /dev/null +++ b/modules/nf-core/modules/samtools/faidx/main.nf @@ -0,0 +1,44 @@ +process SAMTOOLS_FAIDX { + tag "$fasta" + label 'process_single' + + conda (params.enable_conda ? "bioconda::samtools=1.15.1" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/samtools:1.15.1--h1170115_0' : + 'quay.io/biocontainers/samtools:1.15.1--h1170115_0' }" + + input: + tuple val(meta), path(fasta) + + output: + tuple val(meta), path ("*.fai"), emit: fai + tuple val(meta), path ("*.gzi"), emit: gzi, optional: true + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + """ + samtools \\ + faidx \\ + $args \\ + $fasta + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//') + END_VERSIONS + """ + + stub: + """ + touch ${fasta}.fai + cat <<-END_VERSIONS > versions.yml + + "${task.process}": + samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//') + END_VERSIONS + """ +} diff --git a/modules/nf-core/modules/samtools/faidx/meta.yml b/modules/nf-core/modules/samtools/faidx/meta.yml new file mode 100644 index 00000000..fe2fe9a1 --- /dev/null +++ b/modules/nf-core/modules/samtools/faidx/meta.yml @@ -0,0 +1,47 @@ +name: samtools_faidx +description: Index FASTA file +keywords: + - index + - fasta +tools: + - samtools: + description: | + SAMtools is a set of utilities for interacting with and post-processing + short DNA sequence read alignments in the SAM, BAM and CRAM formats, written by Heng Li. + These files are generated as output by short read aligners like BWA. + homepage: http://www.htslib.org/ + documentation: http://www.htslib.org/doc/samtools.html + doi: 10.1093/bioinformatics/btp352 + licence: ["MIT"] +input: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - fasta: + type: file + description: FASTA file + pattern: "*.{fa,fasta}" +output: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - fai: + type: file + description: FASTA index file + pattern: "*.{fai}" + - gzi: + type: file + description: Optional gzip index file for compressed inputs + pattern: "*.gzi" + - versions: + type: file + description: File containing software versions + pattern: "versions.yml" +authors: + - "@drpatelh" + - "@ewels" + - "@phue" diff --git a/modules/nf-core/modules/samtools/index/main.nf b/modules/nf-core/modules/samtools/index/main.nf new file mode 100644 index 00000000..e04e63e8 --- /dev/null +++ b/modules/nf-core/modules/samtools/index/main.nf @@ -0,0 +1,48 @@ +process SAMTOOLS_INDEX { + tag "$meta.id" + label 'process_low' + + conda (params.enable_conda ? "bioconda::samtools=1.15.1" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/samtools:1.15.1--h1170115_0' : + 'quay.io/biocontainers/samtools:1.15.1--h1170115_0' }" + + input: + tuple val(meta), path(input) + + output: + tuple val(meta), path("*.bai") , optional:true, emit: bai + tuple val(meta), path("*.csi") , optional:true, emit: csi + tuple val(meta), path("*.crai"), optional:true, emit: crai + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + """ + samtools \\ + index \\ + -@ ${task.cpus-1} \\ + $args \\ + $input + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//') + END_VERSIONS + """ + + stub: + """ + touch ${input}.bai + touch ${input}.crai + touch ${input}.csi + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//') + END_VERSIONS + """ +} diff --git a/modules/nf-core/modules/samtools/index/meta.yml b/modules/nf-core/modules/samtools/index/meta.yml new file mode 100644 index 00000000..e5cadbc2 --- /dev/null +++ b/modules/nf-core/modules/samtools/index/meta.yml @@ -0,0 +1,53 @@ +name: samtools_index +description: Index SAM/BAM/CRAM file +keywords: + - index + - bam + - sam + - cram +tools: + - samtools: + description: | + SAMtools is a set of utilities for interacting with and post-processing + short DNA sequence read alignments in the SAM, BAM and CRAM formats, written by Heng Li. + These files are generated as output by short read aligners like BWA. + homepage: http://www.htslib.org/ + documentation: hhttp://www.htslib.org/doc/samtools.html + doi: 10.1093/bioinformatics/btp352 + licence: ["MIT"] +input: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - bam: + type: file + description: BAM/CRAM/SAM file + pattern: "*.{bam,cram,sam}" +output: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - bai: + type: file + description: BAM/CRAM/SAM index file + pattern: "*.{bai,crai,sai}" + - crai: + type: file + description: BAM/CRAM/SAM index file + pattern: "*.{bai,crai,sai}" + - csi: + type: file + description: CSI index file + pattern: "*.{csi}" + - versions: + type: file + description: File containing software versions + pattern: "versions.yml" +authors: + - "@drpatelh" + - "@ewels" + - "@maxulysse" diff --git a/modules/nf-core/modules/tabix/bgzip/main.nf b/modules/nf-core/modules/tabix/bgzip/main.nf new file mode 100644 index 00000000..aaef7859 --- /dev/null +++ b/modules/nf-core/modules/tabix/bgzip/main.nf @@ -0,0 +1,40 @@ +process TABIX_BGZIP { + tag "$meta.id" + label 'process_single' + + conda (params.enable_conda ? 'bioconda::tabix=1.11' : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/tabix:1.11--hdfd78af_0' : + 'quay.io/biocontainers/tabix:1.11--hdfd78af_0' }" + + input: + tuple val(meta), path(input) + + output: + tuple val(meta), path("${output}") , emit: output + tuple val(meta), path("${output}.gzi"), emit: gzi, optional: true + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + prefix = task.ext.prefix ?: "${meta.id}" + in_bgzip = ["gz", "bgz", "bgzf"].contains(input.getExtension()) + output = in_bgzip ? input.getBaseName() : "${prefix}.${input.getExtension()}.gz" + command1 = in_bgzip ? '-d' : '-c' + command2 = in_bgzip ? '' : " > ${output}" + // Name the index according to $prefix, unless a name has been requested + if ((args.matches("(^| )-i\\b") || args.matches("(^| )--index(\$| )")) && !args.matches("(^| )-I\\b") && !args.matches("(^| )--index-name\\b")) { + args = args + " -I ${output}.gzi" + } + """ + bgzip $command1 $args -@${task.cpus} $input $command2 + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + tabix: \$(echo \$(tabix -h 2>&1) | sed 's/^.*Version: //; s/ .*\$//') + END_VERSIONS + """ +} diff --git a/modules/nf-core/modules/tabix/bgzip/meta.yml b/modules/nf-core/modules/tabix/bgzip/meta.yml new file mode 100644 index 00000000..72f0abcd --- /dev/null +++ b/modules/nf-core/modules/tabix/bgzip/meta.yml @@ -0,0 +1,46 @@ +name: tabix_bgzip +description: Compresses/decompresses files +keywords: + - compress + - decompress + - bgzip + - tabix +tools: + - bgzip: + description: | + Bgzip compresses or decompresses files in a similar manner to, and compatible with, gzip. + homepage: https://www.htslib.org/doc/tabix.html + documentation: http://www.htslib.org/doc/bgzip.html + doi: 10.1093/bioinformatics/btp352 + licence: ["MIT"] +input: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - input: + type: file + description: file to compress or to decompress +output: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - output: + type: file + description: Output compressed/decompressed file + pattern: "*." + - gzi: + type: file + description: Optional gzip index file for compressed inputs + pattern: "*.gzi" + - versions: + type: file + description: File containing software versions + pattern: "versions.yml" +authors: + - "@joseespinosa" + - "@drpatelh" + - "@maxulysse" diff --git a/modules/nf-core/modules/tabix/bgziptabix/main.nf b/modules/nf-core/modules/tabix/bgziptabix/main.nf new file mode 100644 index 00000000..0d05984a --- /dev/null +++ b/modules/nf-core/modules/tabix/bgziptabix/main.nf @@ -0,0 +1,45 @@ +process TABIX_BGZIPTABIX { + tag "$meta.id" + label 'process_single' + + conda (params.enable_conda ? 'bioconda::tabix=1.11' : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/tabix:1.11--hdfd78af_0' : + 'quay.io/biocontainers/tabix:1.11--hdfd78af_0' }" + + input: + tuple val(meta), path(input) + + output: + tuple val(meta), path("*.gz"), path("*.tbi"), emit: gz_tbi + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + def args2 = task.ext.args2 ?: '' + def prefix = task.ext.prefix ?: "${meta.id}" + """ + bgzip --threads ${task.cpus} -c $args $input > ${prefix}.${input.getExtension()}.gz + tabix $args2 ${prefix}.${input.getExtension()}.gz + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + tabix: \$(echo \$(tabix -h 2>&1) | sed 's/^.*Version: //; s/ .*\$//') + END_VERSIONS + """ + + stub: + def prefix = task.ext.prefix ?: "${meta.id}" + """ + touch ${prefix}.gz + touch ${prefix}.gz.tbi + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + tabix: \$(echo \$(tabix -h 2>&1) | sed 's/^.*Version: //; s/ .*\$//') + END_VERSIONS + """ +} diff --git a/modules/nf-core/modules/tabix/bgziptabix/meta.yml b/modules/nf-core/modules/tabix/bgziptabix/meta.yml new file mode 100644 index 00000000..49c03289 --- /dev/null +++ b/modules/nf-core/modules/tabix/bgziptabix/meta.yml @@ -0,0 +1,45 @@ +name: tabix_bgziptabix +description: bgzip a sorted tab-delimited genome file and then create tabix index +keywords: + - bgzip + - compress + - index + - tabix + - vcf +tools: + - tabix: + description: Generic indexer for TAB-delimited genome position files. + homepage: https://www.htslib.org/doc/tabix.html + documentation: https://www.htslib.org/doc/tabix.1.html + doi: 10.1093/bioinformatics/btq671 + licence: ["MIT"] +input: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - tab: + type: file + description: TAB-delimited genome position file + pattern: "*.{bed,gff,sam,vcf}" +output: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - gz: + type: file + description: Output compressed file + pattern: "*.{gz}" + - tbi: + type: file + description: tabix index file + pattern: "*.{gz.tbi}" + - versions: + type: file + description: File containing software versions + pattern: "versions.yml" +authors: + - "@maxulysse" diff --git a/modules/nf-core/modules/tabix/tabix/main.nf b/modules/nf-core/modules/tabix/tabix/main.nf new file mode 100644 index 00000000..21b2e79f --- /dev/null +++ b/modules/nf-core/modules/tabix/tabix/main.nf @@ -0,0 +1,42 @@ +process TABIX_TABIX { + tag "$meta.id" + label 'process_single' + + conda (params.enable_conda ? 'bioconda::tabix=1.11' : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/tabix:1.11--hdfd78af_0' : + 'quay.io/biocontainers/tabix:1.11--hdfd78af_0' }" + + input: + tuple val(meta), path(tab) + + output: + tuple val(meta), path("*.tbi"), optional:true, emit: tbi + tuple val(meta), path("*.csi"), optional:true, emit: csi + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + """ + tabix $args $tab + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + tabix: \$(echo \$(tabix -h 2>&1) | sed 's/^.*Version: //; s/ .*\$//') + END_VERSIONS + """ + + stub: + def prefix = task.ext.prefix ?: "${meta.id}" + """ + touch ${tab}.tbi + cat <<-END_VERSIONS > versions.yml + + "${task.process}": + tabix: \$(echo \$(tabix -h 2>&1) | sed 's/^.*Version: //; s/ .*\$//') + END_VERSIONS + """ +} diff --git a/modules/nf-core/modules/tabix/tabix/meta.yml b/modules/nf-core/modules/tabix/tabix/meta.yml new file mode 100644 index 00000000..fcc6e524 --- /dev/null +++ b/modules/nf-core/modules/tabix/tabix/meta.yml @@ -0,0 +1,45 @@ +name: tabix_tabix +description: create tabix index from a sorted bgzip tab-delimited genome file +keywords: + - index + - tabix + - vcf +tools: + - tabix: + description: Generic indexer for TAB-delimited genome position files. + homepage: https://www.htslib.org/doc/tabix.html + documentation: https://www.htslib.org/doc/tabix.1.html + doi: 10.1093/bioinformatics/btq671 + licence: ["MIT"] +input: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - tab: + type: file + description: TAB-delimited genome position file compressed with bgzip + pattern: "*.{bed.gz,gff.gz,sam.gz,vcf.gz}" +output: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - tbi: + type: file + description: tabix index file + pattern: "*.{tbi}" + - csi: + type: file + description: coordinate sorted index file + pattern: "*.{csi}" + - versions: + type: file + description: File containing software versions + pattern: "versions.yml" +authors: + - "@joseespinosa" + - "@drpatelh" + - "@maxulysse" diff --git a/modules/nf-core/modules/untar/main.nf b/modules/nf-core/modules/untar/main.nf new file mode 100644 index 00000000..71eea7b2 --- /dev/null +++ b/modules/nf-core/modules/untar/main.nf @@ -0,0 +1,64 @@ +process UNTAR { + tag "$archive" + label 'process_single' + + conda (params.enable_conda ? "conda-forge::sed=4.7" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/ubuntu:20.04' : + 'ubuntu:20.04' }" + + input: + tuple val(meta), path(archive) + + output: + tuple val(meta), path("$untar"), emit: untar + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + def args2 = task.ext.args2 ?: '' + untar = archive.toString() - '.tar.gz' + + """ + mkdir output + + ## Ensures --strip-components only applied when top level of tar contents is a directory + ## If just files or multiple directories, place all in output + if [[ \$(tar -tzf ${archive} | grep -o -P "^.*?\\/" | uniq | wc -l) -eq 1 ]]; then + tar \\ + -C output --strip-components 1 \\ + -xzvf \\ + $args \\ + $archive \\ + $args2 + else + tar \\ + -C output \\ + -xzvf \\ + $args \\ + $archive \\ + $args2 + fi + + mv output ${untar} + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + untar: \$(echo \$(tar --version 2>&1) | sed 's/^.*(GNU tar) //; s/ Copyright.*\$//') + END_VERSIONS + """ + + stub: + untar = archive.toString() - '.tar.gz' + """ + touch $untar + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + untar: \$(echo \$(tar --version 2>&1) | sed 's/^.*(GNU tar) //; s/ Copyright.*\$//') + END_VERSIONS + """ +} diff --git a/modules/nf-core/modules/untar/meta.yml b/modules/nf-core/modules/untar/meta.yml new file mode 100644 index 00000000..ea7a3f38 --- /dev/null +++ b/modules/nf-core/modules/untar/meta.yml @@ -0,0 +1,40 @@ +name: untar +description: Extract files. +keywords: + - untar + - uncompress +tools: + - untar: + description: | + Extract tar.gz files. + documentation: https://www.gnu.org/software/tar/manual/ + licence: ["GPL-3.0-or-later"] +input: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - archive: + type: file + description: File to be untar + pattern: "*.{tar}.{gz}" +output: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - untar: + type: directory + description: Directory containing contents of archive + pattern: "*/" + - versions: + type: file + description: File containing software versions + pattern: "versions.yml" +authors: + - "@joseespinosa" + - "@drpatelh" + - "@matthdsm" + - "@jfy133" diff --git a/modules/nf-core/modules/vcf2db/main.nf b/modules/nf-core/modules/vcf2db/main.nf new file mode 100644 index 00000000..a8d52fc6 --- /dev/null +++ b/modules/nf-core/modules/vcf2db/main.nf @@ -0,0 +1,37 @@ +process VCF2DB { + tag "$meta.id" + label 'process_medium' + + // WARN: Version information not provided by tool on CLI. Please update version string below when bumping container versions. + conda (params.enable_conda ? "bioconda::vcf2db=2020.02.24" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/vcf2db:2020.02.24--hdfd78af_1': + 'quay.io/biocontainers/vcf2db:2020.02.24--hdfd78af_1' }" + + input: + tuple val(meta), path(vcf), path(ped) + + output: + tuple val(meta), path("*.db") , emit: db + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + def prefix = task.ext.prefix ?: "${meta.id}" + def VERSION = "2020.02.24" // WARN: Version information not provided by tool on CLI. Please update this string when bumping container versions. + """ + vcf2db.py \\ + $vcf \\ + $ped \\ + ${prefix}.db \\ + $args + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + vcf2db: $VERSION + END_VERSIONS + """ +} diff --git a/modules/nf-core/modules/vcf2db/meta.yml b/modules/nf-core/modules/vcf2db/meta.yml new file mode 100644 index 00000000..65e41113 --- /dev/null +++ b/modules/nf-core/modules/vcf2db/meta.yml @@ -0,0 +1,47 @@ +name: "vcf2db" +description: A tool to create a Gemini-compatible DB file from an annotated VCF +keywords: + - vcf2db + - vcf + - gemini +tools: + - "vcf2db": + description: "Create a gemini-compatible database from a VCF" + homepage: "https://github.com/quinlan-lab/vcf2db" + documentation: "https://github.com/quinlan-lab/vcf2db" + tool_dev_url: "https://github.com/quinlan-lab/vcf2db" + doi: "" + licence: "['MIT']" + +input: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - vcf: + type: file + description: VCF file + pattern: "*.vcf.gz" + - ped: + type: file + description: PED file + pattern: "*.ped" + +output: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - versions: + type: file + description: File containing software versions + pattern: "versions.yml" + - db: + type: file + description: Gemini-compatible database file + pattern: "*.db" + +authors: + - "@nvnieuwk" diff --git a/modules/nf-core/modules/vcfanno/main.nf b/modules/nf-core/modules/vcfanno/main.nf new file mode 100644 index 00000000..6f264af2 --- /dev/null +++ b/modules/nf-core/modules/vcfanno/main.nf @@ -0,0 +1,51 @@ +process VCFANNO { + tag "$meta.id" + label 'process_low' + + conda (params.enable_conda ? "bioconda::vcfanno=0.3.3" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/vcfanno:0.3.3--h9ee0642_0': + 'quay.io/biocontainers/vcfanno:0.3.3--h9ee0642_0' }" + + input: + tuple val(meta), path(vcf), path(tbi) + path toml + path resource_dir + + output: + tuple val(meta), path("*_annotated.vcf"), emit: vcf + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + def prefix = task.ext.prefix ?: "${meta.id}" + """ + ln -sf $resource_dir/* \$(pwd) + + vcfanno \\ + -p $task.cpus \\ + $args \\ + $toml \\ + $vcf \\ + > ${prefix}_annotated.vcf + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + vcfanno: \$(echo \$(vcfanno 2>&1 | grep version | cut -f3 -d' ' )) + END_VERSIONS + """ + + stub: + def prefix = task.ext.prefix ?: "${meta.id}" + """ + touch ${prefix}_annotated.vcf + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + vcfanno: \$(echo \$(vcfanno 2>&1 | grep version | cut -f3 -d' ' )) + END_VERSIONS + """ +} diff --git a/modules/nf-core/modules/vcfanno/meta.yml b/modules/nf-core/modules/vcfanno/meta.yml new file mode 100644 index 00000000..1c6893ea --- /dev/null +++ b/modules/nf-core/modules/vcfanno/meta.yml @@ -0,0 +1,60 @@ +name: vcfanno +description: quickly annotate your VCF with any number of INFO fields from any number of VCFs or BED files +keywords: + - vcf + - bed + - annotate + - variant + - lua + - toml +tools: + - vcfanno: + description: annotate a VCF with other VCFs/BEDs/tabixed files + homepage: None + documentation: https://github.com/brentp/vcfanno#vcfanno + tool_dev_url: https://github.com/brentp/vcfanno + doi: "10.1186/s13059-016-0973-5" + licence: ["MIT"] + +input: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - vcf: + type: file + description: query VCF file + pattern: "*.{vcf, vcf.gz}" + - vcf_tabix: + type: file + description: tabix index of query VCF - only needed if vcf is compressed + pattern: "*.vcf.gz.tbi" + - toml: + type: file + description: configuration file + pattern: "*.toml" + - resource_dir: + type: file + description: | + This directory contains referenced files in the TOML config, + and the corresponding indicies e.g. exac.vcf.gz + exac.vcf.gz.tbi, + with exception to the lua file. + +output: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - versions: + type: file + description: File containing software versions + pattern: "versions.yml" + - vcf: + type: file + description: Annotated VCF file + pattern: "*.vcf" + +authors: + - "@projectoriented" diff --git a/modules/nf-core/modules/vcftools/main.nf b/modules/nf-core/modules/vcftools/main.nf new file mode 100644 index 00000000..feefb0e3 --- /dev/null +++ b/modules/nf-core/modules/vcftools/main.nf @@ -0,0 +1,123 @@ +process VCFTOOLS { + tag "$meta.id" + label 'process_single' + + conda (params.enable_conda ? "bioconda::vcftools=0.1.16" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/vcftools:0.1.16--he513fc3_4' : + 'quay.io/biocontainers/vcftools:0.1.16--he513fc3_4' }" + + input: + // Owing to the nature of vcftools we here provide solutions to working with optional bed files and optional + // alternative variant files, for use with the 'diff' suite of tools. + // Other optional input files can be utilised in a similar way to below but we do not exhaustively itterate through all + // possible options. Instead we leave that to the user. + tuple val(meta), path(variant_file) + path bed + path diff_variant_file + + output: + tuple val(meta), path("*.vcf") , optional:true, emit: vcf + tuple val(meta), path("*.bcf") , optional:true, emit: bcf + tuple val(meta), path("*.frq") , optional:true, emit: frq + tuple val(meta), path("*.frq.count") , optional:true, emit: frq_count + tuple val(meta), path("*.idepth") , optional:true, emit: idepth + tuple val(meta), path("*.ldepth") , optional:true, emit: ldepth + tuple val(meta), path("*.ldepth.mean") , optional:true, emit: ldepth_mean + tuple val(meta), path("*.gdepth") , optional:true, emit: gdepth + tuple val(meta), path("*.hap.ld") , optional:true, emit: hap_ld + tuple val(meta), path("*.geno.ld") , optional:true, emit: geno_ld + tuple val(meta), path("*.geno.chisq") , optional:true, emit: geno_chisq + tuple val(meta), path("*.list.hap.ld") , optional:true, emit: list_hap_ld + tuple val(meta), path("*.list.geno.ld") , optional:true, emit: list_geno_ld + tuple val(meta), path("*.interchrom.hap.ld") , optional:true, emit: interchrom_hap_ld + tuple val(meta), path("*.interchrom.geno.ld") , optional:true, emit: interchrom_geno_ld + tuple val(meta), path("*.TsTv") , optional:true, emit: tstv + tuple val(meta), path("*.TsTv.summary") , optional:true, emit: tstv_summary + tuple val(meta), path("*.TsTv.count") , optional:true, emit: tstv_count + tuple val(meta), path("*.TsTv.qual") , optional:true, emit: tstv_qual + tuple val(meta), path("*.FILTER.summary") , optional:true, emit: filter_summary + tuple val(meta), path("*.sites.pi") , optional:true, emit: sites_pi + tuple val(meta), path("*.windowed.pi") , optional:true, emit: windowed_pi + tuple val(meta), path("*.weir.fst") , optional:true, emit: weir_fst + tuple val(meta), path("*.het") , optional:true, emit: heterozygosity + tuple val(meta), path("*.hwe") , optional:true, emit: hwe + tuple val(meta), path("*.Tajima.D") , optional:true, emit: tajima_d + tuple val(meta), path("*.ifreqburden") , optional:true, emit: freq_burden + tuple val(meta), path("*.LROH") , optional:true, emit: lroh + tuple val(meta), path("*.relatedness") , optional:true, emit: relatedness + tuple val(meta), path("*.relatedness2") , optional:true, emit: relatedness2 + tuple val(meta), path("*.lqual") , optional:true, emit: lqual + tuple val(meta), path("*.imiss") , optional:true, emit: missing_individual + tuple val(meta), path("*.lmiss") , optional:true, emit: missing_site + tuple val(meta), path("*.snpden") , optional:true, emit: snp_density + tuple val(meta), path("*.kept.sites") , optional:true, emit: kept_sites + tuple val(meta), path("*.removed.sites") , optional:true, emit: removed_sites + tuple val(meta), path("*.singletons") , optional:true, emit: singeltons + tuple val(meta), path("*.indel.hist") , optional:true, emit: indel_hist + tuple val(meta), path("*.hapcount") , optional:true, emit: hapcount + tuple val(meta), path("*.mendel") , optional:true, emit: mendel + tuple val(meta), path("*.FORMAT") , optional:true, emit: format + tuple val(meta), path("*.INFO") , optional:true, emit: info + tuple val(meta), path("*.012") , optional:true, emit: genotypes_matrix + tuple val(meta), path("*.012.indv") , optional:true, emit: genotypes_matrix_individual + tuple val(meta), path("*.012.pos") , optional:true, emit: genotypes_matrix_position + tuple val(meta), path("*.impute.hap") , optional:true, emit: impute_hap + tuple val(meta), path("*.impute.hap.legend") , optional:true, emit: impute_hap_legend + tuple val(meta), path("*.impute.hap.indv") , optional:true, emit: impute_hap_indv + tuple val(meta), path("*.ldhat.sites") , optional:true, emit: ldhat_sites + tuple val(meta), path("*.ldhat.locs") , optional:true, emit: ldhat_locs + tuple val(meta), path("*.BEAGLE.GL") , optional:true, emit: beagle_gl + tuple val(meta), path("*.BEAGLE.PL") , optional:true, emit: beagle_pl + tuple val(meta), path("*.ped") , optional:true, emit: ped + tuple val(meta), path("*.map") , optional:true, emit: map_ + tuple val(meta), path("*.tped") , optional:true, emit: tped + tuple val(meta), path("*.tfam") , optional:true, emit: tfam + tuple val(meta), path("*.diff.sites_in_files") , optional:true, emit: diff_sites_in_files + tuple val(meta), path("*.diff.indv_in_files") , optional:true, emit: diff_indv_in_files + tuple val(meta), path("*.diff.sites") , optional:true, emit: diff_sites + tuple val(meta), path("*.diff.indv") , optional:true, emit: diff_indv + tuple val(meta), path("*.diff.discordance.matrix"), optional:true, emit: diff_discd_matrix + tuple val(meta), path("*.diff.switch") , optional:true, emit: diff_switch_error + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + def prefix = task.ext.prefix ?: "${meta.id}" + def args_list = args.tokenize() + + def bed_arg = (args.contains('--bed')) ? "--bed ${bed}" : + (args.contains('--exclude-bed')) ? "--exclude-bed ${bed}" : + (args.contains('--hapcount')) ? "--hapcount ${bed}" : '' + args_list.removeIf { it.contains('--bed') } + args_list.removeIf { it.contains('--exclude-bed') } + args_list.removeIf { it.contains('--hapcount') } + + def diff_variant_arg = (args.contains('--diff')) ? "--diff ${diff_variant_file}" : + (args.contains('--gzdiff')) ? "--gzdiff ${diff_variant_file}" : + (args.contains('--diff-bcf')) ? "--diff-bcf ${diff_variant_file}" : '' + args_list.removeIf { it.contains('--diff') } + args_list.removeIf { it.contains('--gzdiff') } + args_list.removeIf { it.contains('--diff-bcf') } + + def input_file = ("$variant_file".endsWith(".vcf")) ? "--vcf ${variant_file}" : + ("$variant_file".endsWith(".vcf.gz")) ? "--gzvcf ${variant_file}" : + ("$variant_file".endsWith(".bcf")) ? "--bcf ${variant_file}" : '' + + """ + vcftools \\ + $input_file \\ + --out $prefix \\ + ${args_list.join(' ')} \\ + $bed_arg \\ + $diff_variant_arg + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + vcftools: \$(echo \$(vcftools --version 2>&1) | sed 's/^.*VCFtools (//;s/).*//') + END_VERSIONS + """ +} diff --git a/modules/nf-core/modules/vcftools/meta.yml b/modules/nf-core/modules/vcftools/meta.yml new file mode 100644 index 00000000..7a85bdec --- /dev/null +++ b/modules/nf-core/modules/vcftools/meta.yml @@ -0,0 +1,294 @@ +name: vcftools +description: A set of tools written in Perl and C++ for working with VCF files +keywords: VCF + - sort +tools: + - vcftools: + description: A set of tools written in Perl and C++ for working with VCF files. This package only contains the C++ libraries whereas the package perl-vcftools-vcf contains the perl libraries + homepage: http://vcftools.sourceforge.net/ + documentation: http://vcftools.sourceforge.net/man_latest.html + tool_dev_url: None + doi: + licence: ["LGPL"] + +input: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - variant_file: + type: file + description: variant input file which can be vcf, vcf.gz, or bcf format. + - bed: + type: file + description: bed file which can be used with different arguments in vcftools (optional) + - diff_variant_file: + type: file + description: secondary variant file which can be used with the 'diff' suite of tools (optional) + +output: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - versions: + type: file + description: File containing software versions + pattern: "versions.yml" + - vcf: + type: file + description: vcf file (optional) + pattern: "*.vcf" + - bcf: + type: file + description: bcf file (optional) + pattern: "*.bcf" + - frq: + type: file + description: Allele frequency for each site (optional) + pattern: "*.frq" + - frq_count: + type: file + description: Allele counts for each site (optional) + pattern: "*.frq.count" + - idepth: + type: file + description: mean depth per individual (optional) + pattern: "*.idepth" + - ldepth: + type: file + description: depth per site summed across individuals (optional) + pattern: "*.ildepth" + - ldepth_mean: + type: file + description: mean depth per site calculated across individuals (optional) + pattern: "*.ldepth.mean" + - gdepth: + type: file + description: depth for each genotype in vcf file (optional) + pattern: "*.gdepth" + - hap_ld: + type: file + description: r2, D, and D’ statistics using phased haplotypes (optional) + pattern: "*.hap.ld" + - geno_ld: + type: file + description: squared correlation coefficient between genotypes encoded as 0, 1 and 2 to represent the number of non-reference alleles in each individual (optional) + pattern: "*.geno.ld" + - geno_chisq: + type: file + description: test for genotype independence via the chi-squared statistic (optional) + pattern: "*.geno.chisq" + - list_hap_ld: + type: file + description: r2 statistics of the sites contained in the provided input file verses all other sites (optional) + pattern: "*.list.hap.ld" + - list_geno_ld: + type: file + description: r2 statistics of the sites contained in the provided input file verses all other sites (optional) + pattern: "*.list.geno.ld" + - interchrom_hap_ld: + type: file + description: r2 statistics for sites (haplotypes) on different chromosomes (optional) + pattern: "*.interchrom.hap.ld" + - interchrom_geno_ld: + type: file + description: r2 statistics for sites (genotypes) on different chromosomes (optional) + pattern: "*.interchrom.geno.ld" + - tstv: + type: file + description: Transition / Transversion ratio in bins of size defined in options (optional) + pattern: "*.TsTv" + - tstv_summary: + type: file + description: Summary of all Transitions and Transversions (optional) + pattern: "*.TsTv.summary" + - tstv_count: + type: file + description: Transition / Transversion ratio as a function of alternative allele count (optional) + pattern: "*.TsTv.count" + - tstv_qual: + type: file + description: Transition / Transversion ratio as a function of SNP quality threshold (optional) + pattern: "*.TsTv.qual" + - filter_summary: + type: file + description: Summary of the number of SNPs and Ts/Tv ratio for each FILTER category (optional) + pattern: "*.FILTER.summary" + - sites_pi: + type: file + description: Nucleotide divergency on a per-site basis (optional) + pattern: "*.sites.pi" + - windowed_pi: + type: file + description: Nucleotide diversity in windows, with window size determined by options (optional) + pattern: "*windowed.pi" + - weir_fst: + type: file + description: Fst estimate from Weir and Cockerham’s 1984 paper (optional) + pattern: "*.weir.fst" + - heterozygosity: + type: file + description: Heterozygosity on a per-individual basis (optional) + pattern: "*.het" + - hwe: + type: file + description: Contains the Observed numbers of Homozygotes and Heterozygotes and the corresponding Expected numbers under HWE (optional) + pattern: "*.hwe" + - tajima_d: + type: file + description: Tajima’s D statistic in bins with size of the specified number in options (optional) + pattern: "*.Tajima.D" + - freq_burden: + type: file + description: Number of variants within each individual of a specific frequency in options (optional) + pattern: "*.ifreqburden" + - lroh: + type: file + description: Long Runs of Homozygosity (optional) + pattern: "*.LROH" + - relatedness: + type: file + description: Relatedness statistic based on the method of Yang et al, Nature Genetics 2010 (doi:10.1038/ng.608) (optional) + pattern: "*.relatedness" + - relatedness2: + type: file + description: Relatedness statistic based on the method of Manichaikul et al., BIOINFORMATICS 2010 (doi:10.1093/bioinformatics/btq559) (optional) + pattern: "*.relatedness2" + - lqual: + type: file + description: per-site SNP quality (optional) + pattern: "*.lqual" + - missing_individual: + type: file + description: Missingness on a per-individual basis (optional) + pattern: "*.imiss" + - missing_site: + type: file + description: Missingness on a per-site basis (optional) + pattern: "*.lmiss" + - snp_density: + type: file + description: Number and density of SNPs in bins of size defined by option (optional) + pattern: "*.snpden" + - kept_sites: + type: file + description: All sites that have been kept after filtering (optional) + pattern: "*.kept.sites" + - removed_sites: + type: file + description: All sites that have been removed after filtering (optional) + pattern: "*.removed.sites" + - singeltons: + type: file + description: Location of singletons, and the individual they occur in (optional) + pattern: "*.singeltons" + - indel_hist: + type: file + description: Histogram file of the length of all indels (including SNPs) (optional) + pattern: "*.indel_hist" + - hapcount: + type: file + description: Unique haplotypes within user specified bins (optional) + pattern: "*.hapcount" + - mendel: + type: file + description: Mendel errors identified in trios (optional) + pattern: "*.mendel" + - format: + type: file + description: Extracted information from the genotype fields in the VCF file relating to a specfied FORMAT identifier (optional) + pattern: "*.FORMAT" + - info: + type: file + description: Extracted information from the INFO field in the VCF file (optional) + pattern: "*.INFO" + - genotypes_matrix: + type: file + description: | + Genotypes output as large matrix. + Genotypes of each individual on a separate line. + Genotypes are represented as 0, 1 and 2, where the number represent that number of non-reference alleles. + Missing genotypes are represented by -1 (optional) + pattern: "*.012" + - genotypes_matrix_individual: + type: file + description: Details the individuals included in the main genotypes_matrix file (optional) + pattern: "*.012.indv" + - genotypes_matrix_position: + type: file + description: Details the site locations included in the main genotypes_matrix file (optional) + pattern: "*.012.pos" + - impute_hap: + type: file + description: Phased haplotypes in IMPUTE reference-panel format (optional) + pattern: "*.impute.hap" + - impute_hap_legend: + type: file + description: Impute haplotype legend file (optional) + pattern: "*.impute.hap.legend" + - impute_hap_indv: + type: file + description: Impute haplotype individuals file (optional) + pattern: "*.impute.hap.indv" + - ldhat_sites: + type: file + description: Output data in LDhat format, sites (optional) + pattern: "*.ldhat.sites" + - ldhat_locs: + type: file + description: output data in LDhat format, locations (optional) + pattern: "*.ldhat.locs" + - beagle_gl: + type: file + description: Genotype likelihoods for biallelic sites (optional) + pattern: "*.BEAGLE.GL" + - beagle_pl: + type: file + description: Genotype likelihoods for biallelic sites (optional) + pattern: "*.BEAGLE.PL" + - ped: + type: file + description: output the genotype data in PLINK PED format (optional) + pattern: "*.ped" + - map_: + type: file + description: output the genotype data in PLINK PED format (optional) + pattern: "*.map" + - tped: + type: file + description: output the genotype data in PLINK PED format (optional) + pattern: "*.tped" + - tfam: + type: file + description: output the genotype data in PLINK PED format (optional) + pattern: "*.tfam" + - diff_sites_in_files: + type: file + description: Sites that are common / unique to each file specified in optional inputs (optional) + pattern: "*.diff.sites.in.files" + - diff_indv_in_files: + type: file + description: Individuals that are common / unique to each file specified in optional inputs (optional) + pattern: "*.diff.indv.in.files" + - diff_sites: + type: file + description: Discordance on a site by site basis, specified in optional inputs (optional) + pattern: "*.diff.sites" + - diff_indv: + type: file + description: Discordance on a individual by individual basis, specified in optional inputs (optional) + pattern: "*.diff.indv" + - diff_discd_matrix: + type: file + description: Discordance matrix between files specified in optional inputs (optional) + pattern: "*.diff.discordance.matrix" + - diff_switch_error: + type: file + description: Switch errors found between sites (optional) + pattern: "*.diff.switch" + +authors: + - "@Mark-S-Hill" diff --git a/nextflow.config b/nextflow.config index 673d210d..b72aaca5 100644 --- a/nextflow.config +++ b/nextflow.config @@ -1,6 +1,6 @@ /* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - nf-core/tva Nextflow config file + CenterForMedicalGeneticsGhent/nf-cmgg-germline Nextflow config file ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Default config options for all compute environments ---------------------------------------------------------------------------------------- @@ -9,18 +9,62 @@ // Global default params, used in configs params { - // TODO nf-core: Specify your pipeline's command line flags // Input options input = null + // Pipeline specific parameters + scatter_count = 2 + output_mode = "seqr" + always_use_cram = true + + // Module specific parameters + use_dragstr_model = false + skip_genotyping = false + use_bcftools_merge = false + + // VEP plugins to use + vep_dbnsfp = false + vep_spliceai = false + vep_spliceregion = false + vep_mastermind = false + vep_eog = false + + // VEP plugin files + dbnsfp = null + dbnsfp_tbi = null + spliceai_indel = null + spliceai_indel_tbi = null + spliceai_snv = null + spliceai_snv_tbi = null + mastermind = null + mastermind_tbi = null + eog = null + eog_tbi = null + + // VEP parameters + vep_merged_cache = null + species = "homo_sapiens" + vep_version = "105.0" + vep_cache_version = "105" + + // VCFanno parameters + vcfanno = false + vcfanno_toml = null + vcfanno_resources = null + // References - genome = null + genome = 'GRCh38' + fasta = null + fasta_fai = null + dict = null + strtablefile = null igenomes_base = 's3://ngi-igenomes/igenomes' igenomes_ignore = false // MultiQC options multiqc_config = null multiqc_title = null + multiqc_logo = null max_multiqc_email_size = '25.MB' // Boilerplate options @@ -36,6 +80,7 @@ params { show_hidden_params = false schema_ignore_params = 'genomes' enable_conda = false + hook_url = null // Config options custom_config_version = 'master' @@ -45,6 +90,7 @@ params { config_profile_url = null config_profile_name = null + // Max resource options // Defaults only, expecting to be overwritten max_memory = '128.GB' @@ -63,15 +109,16 @@ try { System.err.println("WARNING: Could not load nf-core/config profiles: ${params.custom_config_base}/nfcore_custom.config") } -// Load nf-core/tva custom profiles from different institutions. +// Load CenterForMedicalGeneticsGhent/nf-cmgg-germline custom profiles from different institutions. // Warning: Uncomment only if a pipeline-specific instititutional config already exists on nf-core/configs! // try { -// includeConfig "${params.custom_config_base}/pipeline/tva.config" +// includeConfig "${params.custom_config_base}/pipeline/nf-cmgg-germline.config" // } catch (Exception e) { -// System.err.println("WARNING: Could not load nf-core/config/tva profiles: ${params.custom_config_base}/pipeline/tva.config") +// System.err.println("WARNING: Could not load nf-core/config/nf-cmgg-germline profiles: ${params.custom_config_base}/pipeline/nf-cmgg-germline.config") // } + profiles { debug { process.beforeScript = 'echo $HOSTNAME' } conda { @@ -82,6 +129,15 @@ profiles { shifter.enabled = false charliecloud.enabled = false } + mamba { + params.enable_conda = true + conda.useMamba = true + docker.enabled = false + singularity.enabled = false + podman.enabled = false + shifter.enabled = false + charliecloud.enabled = false + } docker { docker.enabled = true docker.userEmulation = true @@ -119,10 +175,17 @@ profiles { podman.enabled = false shifter.enabled = false } - test { includeConfig 'conf/test.config' } - test_full { includeConfig 'conf/test_full.config' } + gitpod { + executor.name = 'local' + executor.cpus = 16 + executor.memory = 60.GB + } + test { includeConfig 'conf/test.config' } + test_full { includeConfig 'conf/test_full.config' } + test_local { includeConfig 'conf/test_local.config' } } + // Load igenomes.config if required if (!params.igenomes_ignore) { includeConfig 'conf/igenomes.config' @@ -130,6 +193,7 @@ if (!params.igenomes_ignore) { params.genomes = [:] } + // Export these variables to prevent local Python/R libraries from conflicting with those in the container // The JULIA depot path has been adjusted to a fixed path `/usr/local/share/julia` that needs to be used for packages in the container. // See https://apeltzer.github.io/post/03-julia-lang-nextflow/ for details on that. Once we have a common agreement on where to keep Julia packages, this is adjustable. @@ -163,13 +227,13 @@ dag { } manifest { - name = 'nf-core/tva' + name = 'CenterForMedicalGeneticsGhent/nf-cmgg-germline' author = '@nvnieuwk' - homePage = 'https://github.com/nf-core/tva' + homePage = 'https://github.com/CenterForMedicalGeneticsGhent/nf-cmgg-germline' description = 'A nextflow pipeline for calling and annotating variants' mainScript = 'main.nf' nextflowVersion = '!>=21.10.3' - version = '1.0dev' + version = '1.0.0' } // Load modules.config for DSL2 module specific options diff --git a/nextflow_schema.json b/nextflow_schema.json index 04ece41f..7e3c03cf 100644 --- a/nextflow_schema.json +++ b/nextflow_schema.json @@ -1,7 +1,7 @@ { "$schema": "http://json-schema.org/draft-07/schema", - "$id": "https://raw.githubusercontent.com/nf-core/tva/master/nextflow_schema.json", - "title": "nf-core/tva pipeline parameters", + "$id": "https://raw.githubusercontent.com/CenterForMedicalGeneticsGhent/nf-cmgg-germline/master/nextflow_schema.json", + "title": "CenterForMedicalGeneticsGhent/nf-cmgg-germline pipeline parameters", "description": "A nextflow pipeline for calling and annotating variants", "type": "object", "definitions": { @@ -19,7 +19,7 @@ "pattern": "^\\S+\\.csv$", "schema": "assets/schema_input.json", "description": "Path to comma-separated file containing information about the samples in the experiment.", - "help_text": "You will need to create a design file with information about the samples in your experiment before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row. See [usage docs](https://nf-co.re/tva/usage#samplesheet-input).", + "help_text": "You will need to create a design file with information about the samples in your experiment before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with samples, and a header row. See [usage docs](https://github.com/CenterForMedicalGeneticsGhent/nf-cmgg-germline/blob/dev/docs/usage.md).", "fa_icon": "fas fa-file-csv" }, "outdir": { @@ -47,22 +47,40 @@ "type": "object", "fa_icon": "fas fa-dna", "description": "Reference genome related files and options required for the workflow.", + "required": ["fasta"], "properties": { "genome": { "type": "string", - "description": "Name of iGenomes reference.", + "description": "[Seqplorer mode only] The genome used for the samples", "fa_icon": "fas fa-book", - "help_text": "If using a reference genome configured in the pipeline using iGenomes, use this parameter to give the ID for the reference. This is then used to build the full paths for all required reference genome files e.g. `--genome GRCh38`. \n\nSee the [nf-core website docs](https://nf-co.re/usage/reference_genomes) for more details." + "default": "GRCh38" }, "fasta": { "type": "string", "format": "file-path", "mimetype": "text/plain", - "pattern": "^\\S+\\.fn?a(sta)?(\\.gz)?$", - "description": "Path to FASTA genome file.", - "help_text": "This parameter is *mandatory* if `--genome` is not specified. If you don't have a BWA index available this will be generated for you automatically. Combine with `--save_reference` to save BWA index for future runs.", + "pattern": "^\\S+\\.fn?a(sta)?$", + "description": "Path to the FASTA reference file.", "fa_icon": "far fa-file-code" }, + "fasta_fai": { + "type": "string", + "pattern": "^\\S+\\.fai$", + "description": "The index of the FASTA file", + "fa_icon": "far fa-file-code" + }, + "dict": { + "type": "string", + "pattern": "^\\S+\\.dict$", + "description": "The sequence dictionary of the FASTA reference", + "fa_icon": "far fa-file-code" + }, + "strtablefile": { + "type": "string", + "description": "The STR table file generated from the FASTA reference", + "fa_icon": "fas fa-folder", + "pattern": "^\\S+\\.zip$" + }, "igenomes_base": { "type": "string", "format": "directory-path", @@ -80,6 +98,64 @@ } } }, + "pipeline_specific_parameters": { + "title": "Pipeline specific parameters", + "type": "object", + "description": "Parameters that define how the pipeline works", + "default": "", + "properties": { + "scatter_count": { + "type": "integer", + "default": 2, + "fa_icon": "fas fa-cut", + "description": "The amount of times the BED file should be split (for parallelization of GATK Haplotypecaller)", + "minimum": 1 + }, + "species": { + "type": "string", + "default": "homo_sapiens", + "description": "[Seqplorer mode only] The species of the samples (must be lower case and have underscores as spaces)", + "fa_icon": "fas fa-user-circle", + "pattern": "^[a-z_]*$" + }, + "output_mode": { + "type": "string", + "default": "seqr", + "enum": ["seqplorer", "seqr"], + "description": "The filter mode for the VCF outputs (has to be either 'seqr' or 'seqplorer')", + "fa_icon": "fas fa-question-circle" + }, + "always_use_cram": { + "type": "boolean", + "description": "Whether or not the modules should always return a CRAM file instead of a BAM file", + "fa_icon": "fas fa-question-circle", + "default": true + } + } + }, + "module_specific_parameters": { + "title": "Module specific parameters", + "type": "object", + "description": "Parameter that define how specific modules work", + "default": "", + "properties": { + "use_dragstr_model": { + "type": "boolean", + "description": "Whether or not the DRAGSTR models should be used", + "fa_icon": "fas fa-question-circle" + }, + "skip_genotyping": { + "type": "boolean", + "description": "Whether or not the genotyping should be skipped (does GVCF -> VCF conversion using bcftools view and convert instead)", + "fa_icon": "fas fa-question-circle" + }, + "use_bcftools_merge": { + "type": "boolean", + "description": "Whether or not to use bcftools merge instead of CombineGVCFs", + "fa_icon": "fas fa-question-circle" + } + } + }, "institutional_config_options": { "title": "Institutional config options", "type": "object", @@ -207,6 +283,13 @@ "fa_icon": "fas fa-file-upload", "hidden": true }, + "multiqc_logo": { + "type": "string", + "default": "None", + "fa_icon": "far fa-font-awesome-logo-full", + "hidden": true, + "description": "The logo for the reports in MultiQC" + }, "monochrome_logs": { "type": "boolean", "description": "Do not use coloured log outputs.", @@ -245,6 +328,128 @@ "description": "Run this workflow with Conda. You can also use '-profile conda' instead of providing this parameter.", "hidden": true, "fa_icon": "fas fa-bacon" + }, + "hook_url": { + "type": "string", + "fa_icon": "fas fa-share-alt", + "description": "A hook URL for Microsoft Teams" + } + } + }, + "annotation_parameters": { + "title": "Annotation parameters", + "type": "object", + "description": "Parameters to configure Ensembl VEP and VCFanno (only used when \"--output_mode seqplorer\" is used)", + "default": "", + "properties": { + "vep_dbnsfp": { + "type": "boolean", + "description": "[Seqplorer mode only] Use the dbNSFP plugin with Ensembl VEP. The '--dbnsfp' and '--dbnsfp_tbi' parameters need to be specified when using this plugin.", + "fa_icon": "fas fa-question-circle" + }, + "vep_spliceai": { + "type": "boolean", + "description": "[Seqplorer mode only] Use the SpliceAI plugin with Ensembl VEP. The '--spliceai_indel', '--spliceai_indel_tbi', '--spliceai_snv' and '--spliceai_snv_tbi' parameters need to be specified when using this plugin.", + "fa_icon": "fas fa-question-circle" + }, + "vep_spliceregion": { + "type": "boolean", + "description": "[Seqplorer mode only] Use the SpliceRegion plugin with Ensembl VEP", + "fa_icon": "fas fa-question-circle" + }, + "vep_mastermind": { + "type": "boolean", + "description": "[Seqplorer mode only] Use the Mastermind plugin with Ensembl VEP, The '--mastermind' and '--mastermind_tbi' parameters need to be specified when using this plugin.", + "fa_icon": "fas fa-question-circle" + }, + "vep_eog": { + "type": "boolean", + "description": "[Seqplorer mode only] Use the custom EOG annotation with Ensembl VEP, The '--eog and '--eog_tbi' parameters need to be specified when using this plugin.", + "fa_icon": "fas fa-question-circle" + }, + "vep_merged_cache": { + "type": "string", + "default": "None", + "description": "[Seqplorer mode only] The path to the folder containing the merged cache", + "fa_icon": "fas fa-folder" + }, + "vep_version": { + "type": "string", + "default": "105.0", + "description": "[Seqplorer mode only] The version of the VEP tool to be used", + "fa_icon": "fas fa-code-branch" + }, + "vep_cache_version": { + "type": "string", + "default": "105", + "description": "[Seqplorer mode only] The version of the cache to be used", + "fa_icon": "fas fa-code-branch" + }, + "dbnsfp": { + "type": "string", + "default": "None", + "description": "[Seqplorer mode only] The dbSNFP file", + "fa_icon": "fas fa-folder" + }, + "dbnsfp_tbi": { + "type": "string", + "default": "None", + "description": "[Seqplorer mode only] The index of the dbSNFP file" + }, + "spliceai_indel": { + "type": "string", + "default": "None", + "description": "[Seqplorer mode only] The VCF containing indels for spliceAI", + "fa_icon": "far fa-file-alt" + }, + "spliceai_indel_tbi": { + "type": "string", + "default": "None", + "description": "[Seqplorer mode only] The index of the VCF containing indels for spliceAI" + }, + "spliceai_snv": { + "type": "string", + "default": "None", + "description": "[Seqplorer mode only] The VCF containing SNVs for spliceAI" + }, + "spliceai_snv_tbi": { + "type": "string", + "default": "None", + "description": "[Seqplorer mode only] The index of the VCF containing SNVs for spliceAI" + }, + "mastermind": { + "type": "string", + "default": "None", + "description": "[Seqplorer mode only] The VCF for Mastermind" + }, + "mastermind_tbi": { + "type": "string", + "default": "None", + "description": "[Seqplorer mode only] The index of the VCF for Mastermind" + }, + "eog": { + "type": "string", + "default": "None", + "description": "[Seqplorer mode only] The VCF containing EOG annotations" + }, + "eog_tbi": { + "type": "string", + "default": "None", + "description": "[Seqplorer mode only] The index of the VCF containing EOG annotations" + }, + "vcfanno": { + "type": "boolean", + "description": "[Seqplorer mode only] Whether or not an extra annotation step should be performed with VCFanno" + }, + "vcfanno_toml": { + "type": "string", + "default": "None", + "description": "[Seqplorer mode only] The TOML file containing the configuration for VCFanno" + }, + "vcfanno_resources": { + "type": "string", + "default": "None", + "description": "[Seqplorer mode only] The folder containing the reference annotation resources specified in the configuration TOML" } } } @@ -256,6 +461,12 @@ { "$ref": "#/definitions/reference_genome_options" }, + { + "$ref": "#/definitions/pipeline_specific_parameters" + }, + { + "$ref": "#/definitions/module_specific_parameters" + }, { "$ref": "#/definitions/institutional_config_options" }, @@ -264,6 +475,9 @@ }, { "$ref": "#/definitions/generic_options" + }, + { + "$ref": "#/definitions/annotation_parameters" } ] } diff --git a/nf-test.config b/nf-test.config new file mode 100644 index 00000000..2fa82adf --- /dev/null +++ b/nf-test.config @@ -0,0 +1,8 @@ +config { + + testsDir "tests" + workDir ".nf-test" + configFile "tests/nextflow.config" + profile "docker" + +} diff --git a/pyproject.toml b/pyproject.toml new file mode 100644 index 00000000..0d62beb6 --- /dev/null +++ b/pyproject.toml @@ -0,0 +1,10 @@ +# Config file for Python. Mostly used to configure linting of bin/check_samplesheet.py with Black. +# Should be kept the same as nf-core/tools to avoid fighting with template synchronisation. +[tool.black] +line-length = 120 +target_version = ["py37", "py38", "py39", "py310"] + +[tool.isort] +profile = "black" +known_first_party = ["nf_core"] +multi_line_output = 3 diff --git a/subworkflows/local/annotation.nf b/subworkflows/local/annotation.nf new file mode 100644 index 00000000..faac1790 --- /dev/null +++ b/subworkflows/local/annotation.nf @@ -0,0 +1,69 @@ +// +// ANNOTATION +// + +include { ENSEMBLVEP } from '../../modules/nf-core/modules/ensemblvep/main' +include { VCFANNO } from '../../modules/nf-core/modules/vcfanno/main' +include { TABIX_BGZIP as BGZIP_ANNOTATED_VCFS } from '../../modules/nf-core/modules/tabix/bgzip/main' + +workflow ANNOTATION { + take: + vcfs // channel: [mandatory] [ meta, vcfs ] => The post-processed VCFs + fasta // channel: [mandatory] [ fasta ] => fasta reference + genome // value: [mandatory] Which genome was used to align the samples to + species // value: [mandatory] Which species the samples are from + vep_cache_version // value: [mandatory] which version of VEP to use + vep_merged_cache // channel: [optional] [ vep_merged_cache ] => The VEP cache to use + vep_extra_files // channel: [optional] [ file_1, file_2, file_3, ... ] => All files necessary for using the desired plugins + vcfanno // boolean: [mandatory] Whether or not annotation using VCFanno should be performed too + vcfanno_toml // channel: [mandatory if vcfanno == true] [ toml_config_file ] => The TOML config file for VCFanno + vcfanno_resources // channel: [mandatory if vcfanno == true] [ resource_dir ] => The directory containing the reference files for VCFanno + + main: + + ch_annotated_vcfs = Channel.empty() + ch_reports = Channel.empty() + ch_versions = Channel.empty() + + // + // Annotate using Ensembl VEP + // + + ENSEMBLVEP( + vcfs, + genome, + species, + vep_cache_version, + vep_merged_cache, + fasta, + vep_extra_files + ) + + ch_reports = ch_reports.mix(ENSEMBLVEP.out.report) + ch_versions = ch_versions.mix(ENSEMBLVEP.out.versions) + + if (vcfanno) { + VCFANNO( + ENSEMBLVEP.out.vcf.map({ meta, vcf -> [ meta, vcf, [] ] }), + vcfanno_toml, + vcfanno_resources + ) + + ch_annotated_vcfs = VCFANNO.out.vcf + ch_versions = ch_versions.mix(VCFANNO.out.versions) + } + else { + ch_annotated_vcfs = ENSEMBLVEP.out.vcf + } + + BGZIP_ANNOTATED_VCFS( + ch_annotated_vcfs + ) + + ch_versions = ch_versions.mix(BGZIP_ANNOTATED_VCFS.out.versions) + + emit: + annotated_vcfs = BGZIP_ANNOTATED_VCFS.out.output + reports = ch_reports + versions = ch_versions +} diff --git a/subworkflows/local/germline_variant_calling.nf b/subworkflows/local/germline_variant_calling.nf new file mode 100644 index 00000000..202b3cd3 --- /dev/null +++ b/subworkflows/local/germline_variant_calling.nf @@ -0,0 +1,196 @@ +// +// GERMLINE VARIANT CALLING +// + +include { GATK4_HAPLOTYPECALLER as HAPLOTYPECALLER } from '../../modules/nf-core/modules/gatk4/haplotypecaller/main' +include { GATK4_CALIBRATEDRAGSTRMODEL as CALIBRATEDRAGSTRMODEL } from '../../modules/nf-core/modules/gatk4/calibratedragstrmodel/main' +include { BCFTOOLS_CONCAT } from '../../modules/nf-core/modules/bcftools/concat/main' +include { BEDTOOLS_SPLIT } from '../../modules/nf-core/modules/bedtools/split/main' +include { MERGE_BEDS } from '../../modules/local/merge_beds' +include { SAMTOOLS_MERGE } from '../../modules/local/samtools_merge' +include { SAMTOOLS_INDEX } from '../../modules/nf-core/modules/samtools/index/main' + +workflow GERMLINE_VARIANT_CALLING { + take: + crams // channel: [mandatory] [ meta, cram, crai ] => sample CRAM files and their indexes + beds // channel: [mandatory] [ meta, bed ] => bed files + fasta // channel: [mandatory] [ fasta ] => fasta reference + fasta_fai // channel: [mandatory] [ fasta_fai ] => fasta reference index + dict // channel: [mandatory] [ dict ] => sequence dictionary + strtablefile // channel: [mandatory] [ strtablefile ] => STR table file + scatter_count // value: [mandatory] how many times the BED files need to be split before the variant calling + use_dragstr_model // boolean: [mandatory] whether or not to use the dragstr models for variant calling + always_use_cram // boolean: [mandatory] whether or not to retain the bam after merging or convert back to cram + + main: + + gvcfs = Channel.empty() + ch_versions = Channel.empty() + + // + // Merge the CRAM files if there are multiple per sample + // + + cram_branch = crams.groupTuple() + .branch({ meta, cram, crai -> + multiple: cram.size() > 1 + return [meta, cram] + single: cram.size() == 1 + return [meta, cram, crai] + }) + + SAMTOOLS_MERGE( + cram_branch.multiple, + fasta, + fasta_fai, + always_use_cram + ) + + merged_crams = SAMTOOLS_MERGE.out.cram + .mix(SAMTOOLS_MERGE.out.bam) + .map({ meta, cram -> [ meta, cram, [] ]}) + .mix(cram_branch.single.map({meta, cram, crai -> + [ meta, cram[0], crai[0]] + })) + .branch({ meta, cram, crai -> + not_indexed: crai == [] + return [ meta, cram ] + indexed: crai != [] + return [ meta, cram, crai ] + }) + + SAMTOOLS_INDEX( + merged_crams.not_indexed + ) + + ready_crams = merged_crams.not_indexed.combine(SAMTOOLS_INDEX.out.crai, by:0) + .mix(merged_crams.not_indexed.combine(SAMTOOLS_INDEX.out.bai, by:0)) + .mix(merged_crams.indexed) + + // + // Merge the BED files if there are multiple per sample + // + + beds.groupTuple() + .branch({ meta, bed -> + multiple: bed.size() > 1 + return [meta, bed] + single: bed.size() == 1 + return [meta, bed] + }) + .set({bed_branch}) + + MERGE_BEDS( + bed_branch.multiple + ) + + merged_beds = MERGE_BEDS.out.bed + .mix(bed_branch.single) + + // + // Split the BED files into multiple subsets + // + + if (scatter_count > 1) { + BEDTOOLS_SPLIT( + merged_beds, + scatter_count + ) + + ch_versions = ch_versions.mix(BEDTOOLS_SPLIT.out.versions) + + split_beds = BEDTOOLS_SPLIT.out.beds.transpose() + } + else { + split_beds = merged_beds + } + + // + // Generate DRAGSTR models + // + + if (use_dragstr_model) { + calibratedragstrmodel_input = ready_crams.map( + { meta, cram, crai -> + [meta, cram, crai, []] + } + ) + + CALIBRATEDRAGSTRMODEL( + calibratedragstrmodel_input, + fasta, + fasta_fai, + dict, + strtablefile + ) + + ch_versions = ch_versions.mix(CALIBRATEDRAGSTRMODEL.out.versions) + + cram_models = ready_crams.combine(split_beds, by: 0) + .combine(CALIBRATEDRAGSTRMODEL.out.dragstr_model, by: 0) + } + else { + cram_models = ready_crams.combine(split_beds, by: 0) + } + + // + // Remap CRAM channel to fit the haplotypecaller input format + // + + cram_intervals = cram_models + .map{ meta, cram, crai, bed, dragstr_model=[] -> + new_meta = meta.clone() + + // If either no scatter is done, i.e. one interval (1), then don't rename samples + new_meta.id = scatter_count <= 1 ? meta.id : bed.baseName + + [ new_meta, cram, crai, bed, dragstr_model ] + } + + // + // Call the variants using HaplotypeCaller + // + + HAPLOTYPECALLER( + cram_intervals, + fasta, + fasta_fai, + dict, + [], + [] + ) + + haplotypecaller_vcfs = HAPLOTYPECALLER.out.vcf.combine(HAPLOTYPECALLER.out.tbi, by:0) + ch_versions = ch_versions.mix(HAPLOTYPECALLER.out.versions) + + // + // Merge the GVCFs if split BED files were used + // + + if (scatter_count > 1) { + concat_input = haplotypecaller_vcfs + .map({meta, vcf, tbi -> + new_meta = meta.clone() + new_meta.id = new_meta.samplename + [ new_meta, vcf, tbi ] + }) + .groupTuple() + + BCFTOOLS_CONCAT( + concat_input + ) + + gvcfs = BCFTOOLS_CONCAT.out.vcf + ch_versions = ch_versions.mix(BCFTOOLS_CONCAT.out.versions) + } + else { + gvcfs = haplotypecaller_vcfs + .map({ meta, vcf, tbi -> + [ meta, vcf ] + }) + } + + emit: + gvcfs + versions = ch_versions +} \ No newline at end of file diff --git a/subworkflows/local/input_check.nf b/subworkflows/local/input_check.nf deleted file mode 100644 index 0aecf87f..00000000 --- a/subworkflows/local/input_check.nf +++ /dev/null @@ -1,44 +0,0 @@ -// -// Check input samplesheet and get read channels -// - -include { SAMPLESHEET_CHECK } from '../../modules/local/samplesheet_check' - -workflow INPUT_CHECK { - take: - samplesheet // file: /path/to/samplesheet.csv - - main: - SAMPLESHEET_CHECK ( samplesheet ) - .csv - .splitCsv ( header:true, sep:',' ) - .map { create_fastq_channel(it) } - .set { reads } - - emit: - reads // channel: [ val(meta), [ reads ] ] - versions = SAMPLESHEET_CHECK.out.versions // channel: [ versions.yml ] -} - -// Function to get list of [ meta, [ fastq_1, fastq_2 ] ] -def create_fastq_channel(LinkedHashMap row) { - // create meta map - def meta = [:] - meta.id = row.sample - meta.single_end = row.single_end.toBoolean() - - // add path(s) of the fastq file(s) to the meta map - def fastq_meta = [] - if (!file(row.fastq_1).exists()) { - exit 1, "ERROR: Please check input samplesheet -> Read 1 FastQ file does not exist!\n${row.fastq_1}" - } - if (meta.single_end) { - fastq_meta = [ meta, [ file(row.fastq_1) ] ] - } else { - if (!file(row.fastq_2).exists()) { - exit 1, "ERROR: Please check input samplesheet -> Read 2 FastQ file does not exist!\n${row.fastq_2}" - } - fastq_meta = [ meta, [ file(row.fastq_1), file(row.fastq_2) ] ] - } - return fastq_meta -} diff --git a/subworkflows/local/postprocess.nf b/subworkflows/local/postprocess.nf new file mode 100644 index 00000000..346f3e7a --- /dev/null +++ b/subworkflows/local/postprocess.nf @@ -0,0 +1,227 @@ +// +// GENOTYPE +// + +include { GATK4_GENOTYPEGVCFS as GENOTYPE_GVCFS } from '../../modules/nf-core/modules/gatk4/genotypegvcfs/main' +include { GATK4_COMBINEGVCFS as COMBINEGVCFS } from '../../modules/nf-core/modules/gatk4/combinegvcfs/main' +include { GATK4_REBLOCKGVCF as REBLOCKGVCF } from '../../modules/nf-core/modules/gatk4/reblockgvcf/main' +include { TABIX_TABIX as TABIX_GVCFS } from '../../modules/nf-core/modules/tabix/tabix/main' +include { TABIX_TABIX as TABIX_COMBINED_GVCFS } from '../../modules/nf-core/modules/tabix/tabix/main' +include { TABIX_BGZIP as BGZIP_GENOTYPED_VCFS } from '../../modules/nf-core/modules/tabix/bgzip/main' +include { TABIX_BGZIPTABIX as BGZIP_TABIX_PED_VCFS } from '../../modules/nf-core/modules/tabix/bgziptabix/main' +include { RTGTOOLS_PEDFILTER as PEDFILTER } from '../../modules/local/rtgtools/pedfilter/main' +include { MERGE_VCF_HEADERS } from '../../modules/local/merge_vcf_headers' +include { BCFTOOLS_FILTER as FILTER_SNPS } from '../../modules/nf-core/modules/bcftools/filter/main' +include { BCFTOOLS_FILTER as FILTER_INDELS } from '../../modules/nf-core/modules/bcftools/filter/main' +include { BCFTOOLS_MERGE } from '../../modules/nf-core/modules/bcftools/merge/main' +include { BCFTOOLS_CONVERT } from '../../modules/nf-core/modules/bcftools/convert/main' +include { BCFTOOLS_VIEW } from '../../modules/nf-core/modules/bcftools/view/main' + +workflow POST_PROCESS { + take: + gvcfs // channel: [mandatory] [ meta, gvcf ] => The fresh GVCFs called with HaplotypeCaller + peds // channel: [mandatory] [ meta, peds ] => The pedigree files for the samples + fasta // channel: [mandatory] [ fasta ] => fasta reference + fasta_fai // channel: [mandatory] [ fasta_fai ] => fasta reference index + dict // channel: [mandatory] [ dict ] => sequence dictionary + output_mode // value: [mandatory] whether or not to make the output seqplorer- or seqr-compatible + skip_genotyping // boolean: [mandatory] whether or not to skip the genotyping + use_bcftools_merge // boolean: [mandatory] whether or not to use bcftools merge instead of CombineGVCFs + + main: + + post_processed_vcfs = Channel.empty() + ch_versions = Channel.empty() + + // + // Create indexes for all the GVCF files + // + + TABIX_GVCFS( + gvcfs + ) + + indexed_gvcfs = gvcfs + .combine(TABIX_GVCFS.out.tbi, by: 0) + .map({ meta, gvcf, tbi -> + [ meta, gvcf, tbi, []] + }) + + ch_versions = ch_versions.mix(TABIX_GVCFS.out.versions) + + // + // Reblock the single sample GVCF files + // + + REBLOCKGVCF( + indexed_gvcfs, + fasta, + fasta_fai, + dict, + [], + [] + ) + + ch_versions = ch_versions.mix(REBLOCKGVCF.out.versions) + + combine_gvcfs_input = REBLOCKGVCF.out.vcf + .map({ meta, gvcf, tbi -> + def new_meta = [:] + new_meta.id = meta.family + new_meta.family = meta.family + + [ new_meta, gvcf, tbi ] + }) + .groupTuple() + + // + // Merge/Combine all the GVCFs from each family + // + + if (use_bcftools_merge){ + + BCFTOOLS_MERGE( + combine_gvcfs_input, + [], + fasta, + fasta_fai + ) + + combined_gvcfs = BCFTOOLS_MERGE.out.merged_variants + ch_versions = ch_versions.mix(BCFTOOLS_MERGE.out.versions) + + } else { + + COMBINEGVCFS( + combine_gvcfs_input, + fasta, + fasta_fai, + dict + ) + + combined_gvcfs = COMBINEGVCFS.out.combined_gvcf + ch_versions = ch_versions.mix(COMBINEGVCFS.out.versions) + + } + + // + // Create indexes for the combined GVCFs + // + + TABIX_COMBINED_GVCFS( + combined_gvcfs + ) + + ch_versions = ch_versions.mix(TABIX_COMBINED_GVCFS.out.versions) + + indexed_combined_gvcfs = combined_gvcfs + .combine(TABIX_COMBINED_GVCFS.out.tbi, by:0) + + if (!skip_genotyping){ + + // + // Genotype the combined GVCFs + // + + genotype_gvcfs_input = indexed_combined_gvcfs + .map({ meta, gvcf, tbi -> + [ meta, gvcf, tbi, [], [] ] + }) + + GENOTYPE_GVCFS( + genotype_gvcfs_input, + fasta, + fasta_fai, + dict, + [], + [] + ) + + ch_versions = ch_versions.mix(GENOTYPE_GVCFS.out.versions) + + BGZIP_GENOTYPED_VCFS( + GENOTYPE_GVCFS.out.vcf + ) + + converted_vcfs = BGZIP_GENOTYPED_VCFS.out.output + ch_versions = ch_versions.mix(BGZIP_GENOTYPED_VCFS.out.versions) + + } else { + + // + // Remove the ref blocks from the GVCF + // + + BCFTOOLS_VIEW( + indexed_combined_gvcfs, + [], + [], + [] + ) + + // + // Convert all the GVCFs to VCF files + // + + BCFTOOLS_CONVERT( + BCFTOOLS_VIEW.out.vcf.map({ meta, vcf -> [ meta, vcf, []]}), + [], + fasta + ) + + converted_vcfs = BCFTOOLS_CONVERT.out.vcf + ch_versions = ch_versions.mix(BCFTOOLS_CONVERT.out.versions) + } + + // + // Add pedigree information + // + + PEDFILTER( + peds + ) + + ch_versions = ch_versions.mix(PEDFILTER.out.versions) + + merge_vcf_headers_input = converted_vcfs + .combine(PEDFILTER.out.vcf, by:0) + + MERGE_VCF_HEADERS( + merge_vcf_headers_input + ) + + BGZIP_TABIX_PED_VCFS( + MERGE_VCF_HEADERS.out.vcf + ) + + ch_versions = ch_versions.mix(MERGE_VCF_HEADERS.out.versions) + ch_versions = ch_versions.mix(BGZIP_TABIX_PED_VCFS.out.versions) + + vcfs_without_index = BGZIP_TABIX_PED_VCFS.out.gz_tbi.map({ meta, vcf, tbi -> [ meta, vcf ]}) + + // + // Filter the variants + // + + if (output_mode == "seqplorer") { + FILTER_SNPS( + vcfs_without_index + ) + + FILTER_INDELS( + FILTER_SNPS.out.vcf + ) + + ch_versions = ch_versions.mix(FILTER_SNPS.out.versions) + ch_versions = ch_versions.mix(FILTER_INDELS.out.versions) + + post_processed_vcfs = FILTER_INDELS.out.vcf + } + else { + post_processed_vcfs = vcfs_without_index + } + + emit: + post_processed_vcfs + versions = ch_versions +} diff --git a/subworkflows/local/vcf_qc.nf b/subworkflows/local/vcf_qc.nf new file mode 100644 index 00000000..6a52e3a7 --- /dev/null +++ b/subworkflows/local/vcf_qc.nf @@ -0,0 +1,58 @@ +// +// VCF_QC +// + +include { BCFTOOLS_STATS } from '../../modules/nf-core/modules/bcftools/stats/main' +include { VCFTOOLS as VCFTOOLS_SUMMARY } from '../../modules/nf-core/modules/vcftools/main' +include { VCFTOOLS as VCFTOOLS_TSTV_COUNT } from '../../modules/nf-core/modules/vcftools/main' +include { VCFTOOLS as VCFTOOLS_TSTV_QUAL } from '../../modules/nf-core/modules/vcftools/main' + +workflow VCF_QC { + take: + vcfs // channel: [mandatory] [ meta, vcfs ] => The post-processed VCFs + + main: + + ch_versions = Channel.empty() + + // + // Perform all quality control steps + // + + BCFTOOLS_STATS( + vcfs.map({ meta, vcf -> [ meta, vcf, [] ]}), + [], + [], + [] + ) + + VCFTOOLS_TSTV_COUNT( + vcfs, + [], + [] + ) + + VCFTOOLS_TSTV_QUAL( + vcfs, + [], + [] + ) + + VCFTOOLS_SUMMARY( + vcfs, + [], + [] + ) + + ch_versions = ch_versions.mix(BCFTOOLS_STATS.out.versions) + ch_versions = ch_versions.mix(VCFTOOLS_TSTV_COUNT.out.versions) + ch_versions = ch_versions.mix(VCFTOOLS_TSTV_QUAL.out.versions) + ch_versions = ch_versions.mix(VCFTOOLS_SUMMARY.out.versions) + + emit: + bcftools_stats = BCFTOOLS_STATS.out.stats + vcftools_tstv_count = VCFTOOLS_TSTV_COUNT.out.tstv_count + vcftools_tstv_qual = VCFTOOLS_TSTV_QUAL.out.tstv_qual + vcftools_filter_summary = VCFTOOLS_SUMMARY.out.filter_summary + versions = ch_versions +} \ No newline at end of file diff --git a/tests/default.test b/tests/default.test new file mode 100644 index 00000000..363ab0c2 --- /dev/null +++ b/tests/default.test @@ -0,0 +1,23 @@ +nextflow_pipeline { + + name "Tests of the pipeline with all optional parameters on default" + script "main.nf" + + test("Success") { + + then { + assert workflow.success + assert path("${outputDir}/families/Proband_12345/reports/Proband_12345.TsTv.count").md5 == "c4c5d1d04ad9090639c3524736a4b60e" + assert path("${outputDir}/families/Proband_12345/reports/Proband_12345.bcftools_stats.txt").md5 == "966a901419d1714d75340243e785333f" + assert file("${outputDir}/families/Proband_12345/reports/Proband_12345.TsTv.qual").exists() + assert path("${outputDir}/families/Proband_12345/reports/Proband_12345.FILTER.summary").md5 == "1427238776c9965b01c144b3140a1d7a" + assert path("${outputDir}/families/Proband_12345/Proband_12345.vcf.gz").linesGzip.contains("chr21\t6448991\t.\tG\tC\t31.64\t.\tAC=1;AF=0.500;AN=2;AS_QD=2.91;BaseQRankSum=2.74;DP=28;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=19.21;MQRankSum=-9.800e-01;QD=2.88;ReadPosRankSum=0.992;SOR=0.180\tGT:AD:DP:GQ:PL\t./.:17,0:17:40\t0/1:8,3:11:21:39,0,21") + assert file("${outputDir}/families/Proband_12345/Proband_12345.vcf.gz.tbi").exists() + assert file("${outputDir}/multiqc_reports/multiqc_report.html").exists() + assert path("${outputDir}/individuals/NA24385D2_NVQ_034/NA24385D2_NVQ_034.g.vcf.gz").linesGzip.contains("chr21\t2\t.\tN\t\t.\t.\tEND=6118038\tGT:DP:GQ:MIN_DP:PL\t0/0:0:0:0:0,0,0") + assert file("${outputDir}/individuals/NA24385D2_NVQ_034/NA24385D2_NVQ_034.g.vcf.gz.tbi").exists() + assert path("${outputDir}/individuals/NA12878K12_NVQ_034/NA12878K12_NVQ_034.g.vcf.gz").linesGzip.contains("chr21\t2\t.\tN\t\t.\t.\tEND=6118023\tGT:DP:GQ:MIN_DP:PL\t0/0:0:0:0:0,0,0") + assert file("${outputDir}/individuals/NA12878K12_NVQ_034/NA12878K12_NVQ_034.g.vcf.gz.tbi").exists() + } + } +} diff --git a/tests/fails.test b/tests/fails.test new file mode 100644 index 00000000..783d80fe --- /dev/null +++ b/tests/fails.test @@ -0,0 +1,102 @@ +nextflow_pipeline { + + name "Tests of the pipeline with failing options" + script "main.nf" + + test("Missing required parameter") { + + when { + params { + fasta = null + } + } + + then { + assert workflow.failed + assert workflow.stdout.contains("* Missing required parameter: --fasta") + } + + } + + test("Seqplorer mode VCFanno - No required inputs") { + + when { + params { + output_mode = "seqplorer" + vcfanno = true + } + } + + then { + assert workflow.failed + assert workflow.stdout.contains("A TOML file and resource directory should be supplied when using vcfanno (use --vcfanno_toml and --vcfanno_resources)") + } + + } + + test("Seqplorer mode DBNSFP - No required inputs") { + + when { + params { + output_mode = "seqplorer" + vep_dbnsfp = true + } + } + + then { + assert workflow.failed + assert workflow.stdout.contains("Please specify '--vep_dbsnf true', '--dbnsfp PATH/TO/DBNSFP/FILE' and '--dbnspf_tbi PATH/TO/DBNSFP/INDEX/FILE' to use the dbnsfp VEP plugin.") + } + + } + + test("Seqplorer mode SpliceAI - No required inputs") { + + when { + params { + output_mode = "seqplorer" + vep_spliceai = true + } + } + + then { + assert workflow.failed + assert workflow.stdout.contains("Please specify '--vep_spliceai true', '--spliceai_snv PATH/TO/SPLICEAI/SNV/FILE', '--spliceai_snv_tbi PATH/TO/SPLICEAI/SNV/INDEX/FILE', '--spliceai_indel PATH/TO/SPLICEAI/INDEL/FILE' and '--spliceai_indel_tbi PATH/TO/SPLICEAI/INDEL/INDEX/FILE' to use the SpliceAI VEP plugin.") + } + + } + + + test("Seqplorer mode MasterMind - No required inputs") { + + when { + params { + output_mode = "seqplorer" + vep_mastermind = true + } + } + + then { + assert workflow.failed + assert workflow.stdout.contains("Please specify '--vep_mastermind true', '--mastermind PATH/TO/MASTERMIND/FILE' and '--mastermind_tbi PATH/TO/MASTERMIND/INDEX/FILE' to use the mastermind VEP plugin.") + } + + } + + + test("Seqplorer mode EOG - No required inputs") { + + when { + params { + output_mode = "seqplorer" + vep_eog = true + } + } + + then { + assert workflow.failed + assert workflow.stdout.contains("Please specify '--vep_eog true', '--eog PATH/TO/EOG/FILE' and '--eog_tbi PATH/TO/EOG/INDEX/FILE' to use the EOG custom VEP plugin.") + } + + } +} diff --git a/tests/generate_nf_test_assertions.py b/tests/generate_nf_test_assertions.py new file mode 100644 index 00000000..1efe29ea --- /dev/null +++ b/tests/generate_nf_test_assertions.py @@ -0,0 +1,64 @@ +import argparse +from fileinput import filename +import glob +import gzip +import hashlib +import os +import re + +if __name__ == "__main__": + # Setting up argparser + parser = argparse.ArgumentParser(description="A script to create file assertions for nf-test") + parser.add_argument( + "test_dir", + metavar="TEST_DIRECTORY", + type=str, + help="The folder containing the test outputs (usually called `.nf-test/tests//outputs`", + ) + + args = parser.parse_args() + + test_dir = args.test_dir + all_outputs = glob.glob(f"{test_dir}/**", recursive=True) + + tab = "\\t" + + path_length = len(test_dir) + + print("assert workflow.success") + + for output in all_outputs: + abs_path = os.path.abspath(output) + if ( + re.search("^.*/multiqc_data/", output) + or re.search("^.*/multiqc_plots/", output) + or re.search("^.*/pipeline_info/", output) + ): + continue + if os.path.isfile(abs_path): + file_name = output[path_length:] + if ( + re.search("^.*\.tbi$", output) + or re.search("^.*\.db$", output) + or re.search("^.*multiqc_report.html$", output) + ): + print(f'assert file("${{outputDir}}/{file_name}").exists()') + elif re.search("^.*\.vcf.gz$", output): + with gzip.open(abs_path, "rt") as vcf: + for line in vcf: + if re.search("^chr.*$", line): + print( + f'assert path("${{outputDir}}/{file_name}").linesGzip.contains("{tab.join(line.split()).strip()}")' + ) + break + elif re.search("^.*\.vcf$", output): + with open(abs_path, "r") as vcf: + for line in vcf: + if re.search("^chr.*$", line): + print( + f'assert path("${{outputDir}}/{file_name}").text.contains("{tab.join(line.split()).strip()}")' + ) + break + else: + md5sum = hashlib.md5(open(abs_path, "rb").read()).hexdigest() + print(f'assert path("${{outputDir}}/{file_name}").md5 == "{md5sum}"') diff --git a/tests/inputs/samplesheet.csv b/tests/inputs/samplesheet.csv new file mode 100644 index 00000000..b151a8fc --- /dev/null +++ b/tests/inputs/samplesheet.csv @@ -0,0 +1,3 @@ +sample,family,cram,crai,bed,ped +NA12878K12_NVQ_034,Proband_12345,https://github.com/nf-core/test-datasets/raw/modules/data/genomics/homo_sapiens/illumina/cram/test.paired_end.markduplicates.sorted.cram,https://github.com/nf-core/test-datasets/raw/modules/data/genomics/homo_sapiens/illumina/cram/test.paired_end.markduplicates.sorted.cram.crai,https://github.com/nf-core/test-datasets/raw/modules/data/genomics/homo_sapiens/genome/chr21/sequence/multi_intervals.bed,tests/inputs/test.ped +NA24385D2_NVQ_034,,https://github.com/nf-core/test-datasets/raw/modules/data/genomics/homo_sapiens/illumina/cram/test2.paired_end.markduplicates.sorted.cram,,https://github.com/nf-core/test-datasets/raw/modules/data/genomics/homo_sapiens/genome/chr21/sequence/multi_intervals.bed,tests/inputs/test.ped diff --git a/tests/inputs/test.ped b/tests/inputs/test.ped new file mode 100644 index 00000000..87d40d5b --- /dev/null +++ b/tests/inputs/test.ped @@ -0,0 +1,3 @@ +#fam-id ind-id pat-id mat-id sex phen +Proband_12345 normal tumour 0 2 0 +Proband_12345 tumour 0 0 1 0 \ No newline at end of file diff --git a/tests/nextflow.config b/tests/nextflow.config new file mode 100644 index 00000000..12b14a38 --- /dev/null +++ b/tests/nextflow.config @@ -0,0 +1,21 @@ +/* +======================================================================================== + Nextflow config file for running tests +======================================================================================== +*/ + +params { + + // Limit resources so that this can run on GitHub Actions + max_cpus = 2 + max_memory = '6.GB' + max_time = '6.h' + + // Input data + input = "${params.baseDir}/tests/inputs/samplesheet.csv" + outdir = "${params.outputDir}" + + // Genome references + fasta = "https://github.com/nf-core/test-datasets/raw/modules/data/genomics/homo_sapiens/genome/chr21/sequence/genome.fasta" + +} diff --git a/tests/seqplorer_min.test b/tests/seqplorer_min.test new file mode 100644 index 00000000..a0152cc4 --- /dev/null +++ b/tests/seqplorer_min.test @@ -0,0 +1,30 @@ +nextflow_pipeline { + + name "The bare minimum seqplorer mode tests" + script "main.nf" + + test("Success") { + + when { + params { + output_mode = "seqplorer" + } + } + + then { + assert workflow.success + assert path("${outputDir}/families/Proband_12345/reports/Proband_12345.TsTv.count").md5 == "c4c5d1d04ad9090639c3524736a4b60e" + assert path("${outputDir}/families/Proband_12345/reports/Proband_12345.bcftools_stats.txt").md5 == "c4ebf9494e991f6c40b1d428a3c23a7e" + assert file("${outputDir}/families/Proband_12345/reports/Proband_12345.TsTv.qual").exists() + assert path("${outputDir}/families/Proband_12345/reports/Proband_12345.FILTER.summary").md5 == "b09c0ae633ce5b66cb33e0a3ad046954" + assert path("${outputDir}/families/Proband_12345/Proband_12345.ann.vcf.gz").linesGzip.contains("chr21\t6448991\t.\tG\tC\t31.64\tGATKCutoffSNP\tAC=1;AF=0.5;AN=2;AS_QD=2.91;BaseQRankSum=2.74;DP=28;ExcessHet=0;FS=0;MLEAC=1;MLEAF=0.5;MQ=19.21;MQRankSum=-0.98;QD=2.88;ReadPosRankSum=0.992;SOR=0.18;CSQ=C|intron_variant|MODIFIER|CBSL|ENSG00000274276|Transcript|ENST00000398168|protein_coding||15/16|ENST00000398168.6:c.1468-426C>G||||||||1||-1||SNV|HGNC|HGNC:51829|YES|NM_001354009.3||1|P4|CCDS82646.1|ENSP00000381234|P0DN79.36||UPI0000036BC5||||||||chr21:g.6448991G>C||||||||||||||||||||||||||||||\tGT:AD:DP:GQ:PL\t./.:17,0:17:40:.\t0/1:8,3:11:21:39,0,21") + assert file("${outputDir}/families/Proband_12345/Proband_12345.db").exists() + assert path("${outputDir}/families/Proband_12345/Proband_12345_filtered_snps_indels.vcf.gz").linesGzip.contains("chr21\t6448991\t.\tG\tC\t31.64\tGATKCutoffSNP\tAC=1;AF=0.5;AN=2;AS_QD=2.91;BaseQRankSum=2.74;DP=28;ExcessHet=0;FS=0;MLEAC=1;MLEAF=0.5;MQ=19.21;MQRankSum=-0.98;QD=2.88;ReadPosRankSum=0.992;SOR=0.18\tGT:AD:DP:GQ:PL\t./.:17,0:17:40:.\t0/1:8,3:11:21:39,0,21") + assert path("${outputDir}/individuals/NA24385D2_NVQ_034/NA24385D2_NVQ_034.g.vcf.gz").linesGzip.contains("chr21\t2\t.\tN\t\t.\t.\tEND=6118038\tGT:DP:GQ:MIN_DP:PL\t0/0:0:0:0:0,0,0") + assert file("${outputDir}/individuals/NA24385D2_NVQ_034/NA24385D2_NVQ_034.g.vcf.gz.tbi").exists() + assert path("${outputDir}/individuals/NA12878K12_NVQ_034/NA12878K12_NVQ_034.g.vcf.gz").linesGzip.contains("chr21\t2\t.\tN\t\t.\t.\tEND=6118023\tGT:DP:GQ:MIN_DP:PL\t0/0:0:0:0:0,0,0") + assert file("${outputDir}/individuals/NA12878K12_NVQ_034/NA12878K12_NVQ_034.g.vcf.gz.tbi").exists() + assert file("${outputDir}/multiqc_reports/multiqc_report.html").exists() + } + } +} diff --git a/tests/seqplorer_vcfanno.test b/tests/seqplorer_vcfanno.test new file mode 100644 index 00000000..af6635c0 --- /dev/null +++ b/tests/seqplorer_vcfanno.test @@ -0,0 +1,35 @@ +nextflow_pipeline { + + name "The seqplorer mode tests with VCFanno" + script "main.nf" + + test("Success") { + + when { + params { + output_mode = "seqplorer" + + vcfanno = true + vcfanno_toml = "https://github.com/nf-core/test-datasets/raw/modules/data/genomics/homo_sapiens/genome/vcf/vcfanno/vcfanno.toml" + vcfanno_resources = "https://github.com/nf-core/test-datasets/raw/modules/data/genomics/homo_sapiens/genome/vcf/vcfanno/vcfanno_grch38_module_test.tar.gz" + + } + } + + then { + assert workflow.success + assert path("${outputDir}/families/Proband_12345/reports/Proband_12345.TsTv.count").md5 == "c4c5d1d04ad9090639c3524736a4b60e" + assert path("${outputDir}/families/Proband_12345/reports/Proband_12345.bcftools_stats.txt").md5 == "c4ebf9494e991f6c40b1d428a3c23a7e" + assert file("${outputDir}/families/Proband_12345/reports/Proband_12345.TsTv.qual").exists() + assert path("${outputDir}/families/Proband_12345/reports/Proband_12345.FILTER.summary").md5 == "b09c0ae633ce5b66cb33e0a3ad046954" + assert path("${outputDir}/families/Proband_12345/Proband_12345.ann.vcf.gz").linesGzip.contains("chr21\t6448991\t.\tG\tC\t31.6\tGATKCutoffSNP\tAC=1;AF=0.5;AN=2;AS_QD=2.91;BaseQRankSum=2.74;DP=28;ExcessHet=0;FS=0;MLEAC=1;MLEAF=0.5;MQ=19.21;MQRankSum=-0.98;QD=2.88;ReadPosRankSum=0.992;SOR=0.18;CSQ=C|intron_variant|MODIFIER|CBSL|ENSG00000274276|Transcript|ENST00000398168|protein_coding||15/16|ENST00000398168.6:c.1468-426C>G||||||||1||-1||SNV|HGNC|HGNC:51829|YES|NM_001354009.3||1|P4|CCDS82646.1|ENSP00000381234|P0DN79.36||UPI0000036BC5||||||||chr21:g.6448991G>C||||||||||||||||||||||||||||||\tGT:AD:DP:GQ:PL\t./.:17,0:17:40:.\t0/1:8,3:11:21:39,0,21") + assert file("${outputDir}/families/Proband_12345/Proband_12345.db").exists() + assert path("${outputDir}/families/Proband_12345/Proband_12345_filtered_snps_indels.vcf.gz").linesGzip.contains("chr21\t6448991\t.\tG\tC\t31.64\tGATKCutoffSNP\tAC=1;AF=0.5;AN=2;AS_QD=2.91;BaseQRankSum=2.74;DP=28;ExcessHet=0;FS=0;MLEAC=1;MLEAF=0.5;MQ=19.21;MQRankSum=-0.98;QD=2.88;ReadPosRankSum=0.992;SOR=0.18\tGT:AD:DP:GQ:PL\t./.:17,0:17:40:.\t0/1:8,3:11:21:39,0,21") + assert path("${outputDir}/individuals/NA24385D2_NVQ_034/NA24385D2_NVQ_034.g.vcf.gz").linesGzip.contains("chr21\t2\t.\tN\t\t.\t.\tEND=6118038\tGT:DP:GQ:MIN_DP:PL\t0/0:0:0:0:0,0,0") + assert file("${outputDir}/individuals/NA24385D2_NVQ_034/NA24385D2_NVQ_034.g.vcf.gz.tbi").exists() + assert path("${outputDir}/individuals/NA12878K12_NVQ_034/NA12878K12_NVQ_034.g.vcf.gz").linesGzip.contains("chr21\t2\t.\tN\t\t.\t.\tEND=6118023\tGT:DP:GQ:MIN_DP:PL\t0/0:0:0:0:0,0,0") + assert file("${outputDir}/individuals/NA12878K12_NVQ_034/NA12878K12_NVQ_034.g.vcf.gz.tbi").exists() + assert file("${outputDir}/multiqc_reports/multiqc_report.html").exists() + } + } +} diff --git a/tests/seqr_full.test b/tests/seqr_full.test new file mode 100644 index 00000000..8d650892 --- /dev/null +++ b/tests/seqr_full.test @@ -0,0 +1,31 @@ +nextflow_pipeline { + + name "The full Seqr mode test" + script "main.nf" + + test("Success") { + + when { + params { + output_mode = "seqr" + use_dragstr_model = true + always_use_cram = true + } + } + + then { + assert workflow.success + assert path("${outputDir}/families/Proband_12345/reports/Proband_12345.TsTv.count").md5 == "dd579177c21fb1802e98e801e0525291" + assert path("${outputDir}/families/Proband_12345/reports/Proband_12345.bcftools_stats.txt").md5 == "87e6d7e19caf6f2a095f5f34ea175852" + assert file("${outputDir}/families/Proband_12345/reports/Proband_12345.TsTv.qual").exists() + assert path("${outputDir}/families/Proband_12345/reports/Proband_12345.FILTER.summary").md5 == "2aa3fe1aaddec8c8687d6590bc58ba81" + assert path("${outputDir}/families/Proband_12345/Proband_12345.vcf.gz").linesGzip.contains("chr21\t6448991\t.\tG\tC\t31.64\t.\tAC=1;AF=0.500;AN=2;AS_QD=2.91;BaseQRankSum=2.74;DP=28;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=19.21;MQRankSum=-9.800e-01;QD=2.88;ReadPosRankSum=0.992;SOR=0.180\tGT:AD:DP:GQ:PL\t./.:17,0:17:40\t0/1:8,3:11:21:39,0,21") + assert file("${outputDir}/families/Proband_12345/Proband_12345.vcf.gz.tbi").exists() + assert path("${outputDir}/individuals/NA24385D2_NVQ_034/NA24385D2_NVQ_034.g.vcf.gz").linesGzip.contains("chr21\t2\t.\tN\t\t.\t.\tEND=6118038\tGT:DP:GQ:MIN_DP:PL\t0/0:0:0:0:0,0,0") + assert file("${outputDir}/individuals/NA24385D2_NVQ_034/NA24385D2_NVQ_034.g.vcf.gz.tbi").exists() + assert path("${outputDir}/individuals/NA12878K12_NVQ_034/NA12878K12_NVQ_034.g.vcf.gz").linesGzip.contains("chr21\t2\t.\tN\t\t.\t.\tEND=6118023\tGT:DP:GQ:MIN_DP:PL\t0/0:0:0:0:0,0,0") + assert file("${outputDir}/individuals/NA12878K12_NVQ_034/NA12878K12_NVQ_034.g.vcf.gz.tbi").exists() + assert file("${outputDir}/multiqc_reports/multiqc_report.html").exists() + } + } +} diff --git a/tests/seqr_no_genotyping.test b/tests/seqr_no_genotyping.test new file mode 100644 index 00000000..b0e557dc --- /dev/null +++ b/tests/seqr_no_genotyping.test @@ -0,0 +1,31 @@ +nextflow_pipeline { + + name "The Seqr mode test without genotyping and using bcftools merge" + script "main.nf" + + test("Success") { + + when { + params { + output_mode = "seqr" + skip_genotyping = true + use_bcftools_merge = true + } + } + + then { + assert workflow.success + assert path("${outputDir}/families/Proband_12345/reports/Proband_12345.TsTv.count").md5 == "8dcfdbcaac118df1d5ad407dd2af699f" + assert path("${outputDir}/families/Proband_12345/reports/Proband_12345.bcftools_stats.txt").md5 == "8f91f678232df7db35f8dd12e0f4abee" + assert file("${outputDir}/families/Proband_12345/reports/Proband_12345.TsTv.qual").exists() + assert path("${outputDir}/families/Proband_12345/reports/Proband_12345.FILTER.summary").md5 == "2a92e8a11960b52331968b8f95e7bf1c" + assert path("${outputDir}/families/Proband_12345/Proband_12345.vcf.gz").linesGzip.contains("chr21\t6448991\t.\tG\tC,\t4.23\t.\tAS_QUALapprox=|39|0;AS_RAW_BaseQRankSum=|2.7,1|NaN;AS_RAW_MQ=3168.00|891.00|0.00;AS_RAW_MQRankSum=|-1.0,1|NaN;AS_RAW_ReadPosRankSum=|0.9,1|NaN;AS_SB_TABLE=0,8|0,3|0,0;AS_VarDP=8|3|0;BaseQRankSum=2.744;MQRankSum=-0.98;QUALapprox=39;RAW_GT_COUNT=0,1,0;RAW_MQandDP=4059,11;ReadPosRankSum=0.992;VarDP=11;DP=11;AN=2;AC=1,0\tGT:AD:DP:GP:GQ:PG:PL:SB\t./.:.:.:.:.:.:.:.\t0/1:8,3,0:11:4.23,0,51,54.23,86,105.23:4:0,34.77,64.77,30,64.77,60:39,0,21,59,56,80:0,8,0,3") + assert file("${outputDir}/families/Proband_12345/Proband_12345.vcf.gz.tbi").exists() + assert path("${outputDir}/individuals/NA24385D2_NVQ_034/NA24385D2_NVQ_034.g.vcf.gz").linesGzip.contains("chr21\t2\t.\tN\t\t.\t.\tEND=6118038\tGT:DP:GQ:MIN_DP:PL\t0/0:0:0:0:0,0,0") + assert file("${outputDir}/individuals/NA24385D2_NVQ_034/NA24385D2_NVQ_034.g.vcf.gz.tbi").exists() + assert path("${outputDir}/individuals/NA12878K12_NVQ_034/NA12878K12_NVQ_034.g.vcf.gz").linesGzip.contains("chr21\t2\t.\tN\t\t.\t.\tEND=6118023\tGT:DP:GQ:MIN_DP:PL\t0/0:0:0:0:0,0,0") + assert file("${outputDir}/individuals/NA12878K12_NVQ_034/NA12878K12_NVQ_034.g.vcf.gz.tbi").exists() + assert file("${outputDir}/multiqc_reports/multiqc_report.html").exists() + } + } +} diff --git a/workflows/nf-cmgg-germline.nf b/workflows/nf-cmgg-germline.nf new file mode 100644 index 00000000..6fb0bd02 --- /dev/null +++ b/workflows/nf-cmgg-germline.nf @@ -0,0 +1,549 @@ +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + VALIDATE INPUTS +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ + +def summary_params = NfcoreSchema.paramsSummaryMap(workflow, params) + +// +// Validate input parameters +// + +WorkflowNfCmggGermline.initialise(params, log) + +// +// Check input path parameters to see if they exist +// + +def checkPathParamList = [ + params.fasta, + params.fasta_fai, + params.dict, + params.strtablefile, + params.vep_merged_cache, + params.vcfanno_toml, + params.vcfanno_resources +] +for (param in checkPathParamList) { if (param) { file(param, checkIfExists: true) } } + +// +// Check the input samplesheet +// + +if (params.input) { ch_input = file(params.input, checkIfExists: true) } else { exit 1, 'Input samplesheet not specified!' } + +// +// Check for dependencies between parameters +// + +if (params.output_mode == "seqplorer") { + // Check if a genome is given + if (!params.genome) { exit 1, "A genome should be supplied for seqplorer mode (use --genome)"} + + // Check if the VEP versions were given + if (!params.vep_version) { exit 1, "A VEP version should be supplied for seqplorer mode (use --vep_version)"} + if (!params.vep_cache_version) { exit 1, "A VEP cache version should be supplied for seqplorer mode (use --vep_cache_version)"} + + // Check if a species is entered + if (!params.species) { exit 1, "A species should be supplied for seqplorer mode (use --species)"} + + // Check if all vcfanno files are supplied when vcfanno should be used + if (params.vcfanno && (!params.vcfanno_toml || !params.vcfanno_resources)) { + exit 1, "A TOML file and resource directory should be supplied when using vcfanno (use --vcfanno_toml and --vcfanno_resources)" + } +} + +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + IMPORT THE INPUT PARAMETERS +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ + +// +// Importing the file pipeline parameters +// + +// Input files +fasta = params.fasta ? Channel.fromPath(params.fasta).collect() : Channel.empty() +fasta_fai = params.fasta_fai ? Channel.fromPath(params.fasta_fai).collect() : null +dict = params.dict ? Channel.fromPath(params.dict).collect() : null +strtablefile = params.strtablefile ? Channel.fromPath(params.strtablefile).collect() : null + +// Input values +output_mode = params.output_mode ?: Channel.empty() +scatter_count = params.scatter_count ?: Channel.empty() + +// Booleans +always_use_cram = params.always_use_cram +use_dragstr_model = params.use_dragstr_model +skip_genotyping = params.skip_genotyping +use_bcftools_merge = params.use_bcftools_merge + +// +// Importing the value pipeline parameters +// + +genome = params.genome ?: Channel.empty() + +// +// Importing the annotation parameters +// + +vep_cache_version = params.vep_cache_version ?: Channel.empty() +species = params.species ?: Channel.empty() + +vep_merged_cache = params.vep_merged_cache ? Channel.fromPath(params.vep_merged_cache).collect() : [] + +vcfanno = params.vcfanno + +vcfanno_toml = params.vcfanno_toml ? Channel.fromPath(params.vcfanno_toml).collect() : Channel.empty() +vcfanno_res_inp = params.vcfanno_resources ? Channel.fromPath(params.vcfanno_resources).collect() : Channel.empty() + +// +// Check for the presence of EnsemblVEP plugins that use extra files +// + +vep_extra_files = [] + +// Check if all dbnsfp files are given +if (params.dbnsfp && params.dbnsfp_tbi && params.vep_dbnsfp) { + vep_extra_files.add(file(params.dbnsfp, checkIfExists: true)) + vep_extra_files.add(file(params.dbnsfp_tbi, checkIfExists: true)) +} +else if (params.dbnsfp || params.dbnsfp_tbi || params.vep_dbnsfp) { + exit 1, "Please specify '--vep_dbsnf true', '--dbnsfp PATH/TO/DBNSFP/FILE' and '--dbnspf_tbi PATH/TO/DBNSFP/INDEX/FILE' to use the dbnsfp VEP plugin." +} + +// Check if all spliceai files are given +if (params.spliceai_snv && params.spliceai_snv_tbi && params.spliceai_indel && params.spliceai_indel_tbi && params.vep_spliceai) { + vep_extra_files.add(file(params.spliceai_snv, checkIfExists: true)) + vep_extra_files.add(file(params.spliceai_snv_tbi, checkIfExists: true)) + vep_extra_files.add(file(params.spliceai_indel, checkIfExists: true)) + vep_extra_files.add(file(params.spliceai_indel_tbi, checkIfExists: true)) +} +else if (params.spliceai_snv || params.spliceai_snv_tbi || params.spliceai_indel || params.spliceai_indel_tbi || params.vep_spliceai) { + exit 1, "Please specify '--vep_spliceai true', '--spliceai_snv PATH/TO/SPLICEAI/SNV/FILE', '--spliceai_snv_tbi PATH/TO/SPLICEAI/SNV/INDEX/FILE', '--spliceai_indel PATH/TO/SPLICEAI/INDEL/FILE' and '--spliceai_indel_tbi PATH/TO/SPLICEAI/INDEL/INDEX/FILE' to use the SpliceAI VEP plugin." +} + +// Check if all mastermind files are given +if (params.mastermind && params.mastermind_tbi && params.vep_mastermind) { + vep_extra_files.add(file(params.mastermind, checkIfExists: true)) + vep_extra_files.add(file(params.mastermind_tbi, checkIfExists: true)) +} +else if (params.mastermind || params.mastermind_tbi || params.vep_mastermind) { + exit 1, "Please specify '--vep_mastermind true', '--mastermind PATH/TO/MASTERMIND/FILE' and '--mastermind_tbi PATH/TO/MASTERMIND/INDEX/FILE' to use the mastermind VEP plugin." +} + +// Check if all EOG files are given +if (params.eog && params.eog_tbi && params.vep_eog) { + vep_extra_files.add(file(params.eog, checkIfExists: true)) + vep_extra_files.add(file(params.eog_tbi, checkIfExists: true)) +} +else if (params.eog || params.eog_tbi || params.vep_eog) { + exit 1, "Please specify '--vep_eog true', '--eog PATH/TO/EOG/FILE' and '--eog_tbi PATH/TO/EOG/INDEX/FILE' to use the EOG custom VEP plugin." +} + +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CONFIG FILES +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ + +ch_multiqc_config = params.multiqc_config ? file(params.multiqc_config, checkIfExists: true) : file("$projectDir/assets/multiqc_config.yml", checkIfExists: true) +multiqc_logo = params.multiqc_logo ? file(params.multiqc_logo, checkIfExists: true) : file("$projectDir/assets/CMGG_logo.png", checkIfExists: true) + +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + IMPORT LOCAL MODULES/SUBWORKFLOWS +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ + +// +// SUBWORKFLOW: Consisting of a mix of local and nf-core/modules +// + +include { GERMLINE_VARIANT_CALLING } from '../subworkflows/local/germline_variant_calling' +include { POST_PROCESS } from '../subworkflows/local/postprocess' +include { VCF_QC } from '../subworkflows/local/vcf_qc' +include { ANNOTATION } from '../subworkflows/local/annotation' + +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + IMPORT NF-CORE MODULES/SUBWORKFLOWS +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ + +// +// MODULE: Installed directly from nf-core/modules +// + +include { SAMTOOLS_FAIDX as FAIDX } from '../modules/nf-core/modules/samtools/faidx/main' +include { GATK4_CREATESEQUENCEDICTIONARY as CREATESEQUENCEDICTIONARY } from '../modules/nf-core/modules/gatk4/createsequencedictionary/main' +include { GATK4_COMPOSESTRTABLEFILE as COMPOSESTRTABLEFILE } from '../modules/nf-core/modules/gatk4/composestrtablefile/main' +include { UNTAR } from '../modules/nf-core/modules/untar/main' +include { VCF2DB } from '../modules/nf-core/modules/vcf2db/main' +include { CUSTOM_DUMPSOFTWAREVERSIONS } from '../modules/nf-core/modules/custom/dumpsoftwareversions/main' +include { MULTIQC } from '../modules/nf-core/modules/multiqc/main' + +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + RUN MAIN WORKFLOW +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ + +// Info required for completion email and summary +def multiqc_report = [] + +// The main workflow +workflow NF_CMGG_GERMLINE { + + ch_versions = Channel.empty() + ch_reports = Channel.empty() + + // + // Create the optional input files if they are not supplied + // + + if (!fasta_fai) { + fasta_fai = FAIDX(fasta.map({ fasta -> [ [id:"fasta_fai"], fasta ]})).fai.map({ meta, fasta -> [ fasta ]}) + ch_versions = ch_versions.mix(FAIDX.out.versions) + } + + if (!dict) { + dict = CREATESEQUENCEDICTIONARY(fasta).dict + ch_versions = ch_versions.mix(CREATESEQUENCEDICTIONARY.out.versions) + } + + if (use_dragstr_model && !strtablefile) { + strtablefile = COMPOSESTRTABLEFILE(fasta,fasta_fai,dict).str_table + ch_versions = ch_versions.mix(COMPOSESTRTABLEFILE.out.versions) + } + + if (output_mode == "seqplorer" && vcfanno) { + vcfanno_resources = params.vcfanno_resources.endsWith(".tar.gz") ? + UNTAR(vcfanno_res_inp.map({dir -> [ [], dir ]})).untar.map({meta, dir -> dir}) : + vcfanno_res_inp + ch_versions = params.vcfanno_resources.endsWith(".tar.gz") ? + ch_versions.mix(UNTAR.out.versions) : + ch_versions + } else { + vcfanno_resources = [] + } + + // + // Read in samplesheet, validate and stage input files + // + + inputs = parse_input(ch_input) + .multiMap({meta, cram, crai, bed, ped -> + ped_family_id = get_family_id_from_ped(ped) + + new_meta_ped = [:] + new_meta = meta.clone() + + new_meta_ped.id = meta.family ?: ped_family_id + new_meta_ped.family = meta.family ?: ped_family_id + new_meta.family = meta.family ?: ped_family_id + + beds: [new_meta, bed] + germline_variant_calling_input_cram: [new_meta, cram, crai] + peds: [new_meta_ped, ped] + }) + + peds = inputs.peds.distinct() + + // + // Perform the variant calling + // + + GERMLINE_VARIANT_CALLING( + inputs.germline_variant_calling_input_cram, + inputs.beds, + fasta, + fasta_fai, + dict, + strtablefile, + scatter_count, + use_dragstr_model, + always_use_cram + ) + + ch_versions = ch_versions.mix(GERMLINE_VARIANT_CALLING.out.versions) + + // + // Joint-genotyping of the families + // + + POST_PROCESS( + GERMLINE_VARIANT_CALLING.out.gvcfs, + peds, + fasta, + fasta_fai, + dict, + output_mode, + skip_genotyping, + use_bcftools_merge + ) + + ch_versions = ch_versions.mix(POST_PROCESS.out.versions) + + // + // Quality control of the called variants + // + + VCF_QC( + POST_PROCESS.out.post_processed_vcfs + ) + + ch_versions = ch_versions.mix(VCF_QC.out.versions) + ch_reports = ch_reports.mix(VCF_QC.out.bcftools_stats.collect{it[1]}.ifEmpty([])) + ch_reports = ch_reports.mix(VCF_QC.out.vcftools_tstv_count.collect{it[1]}.ifEmpty([])) + ch_reports = ch_reports.mix(VCF_QC.out.vcftools_tstv_qual.collect{it[1]}.ifEmpty([])) + ch_reports = ch_reports.mix(VCF_QC.out.vcftools_filter_summary.collect{it[1]}.ifEmpty([])) + + // + // Annotation of the variants + // + + if (output_mode == "seqplorer") { + + // Perform the annotation + ANNOTATION( + POST_PROCESS.out.post_processed_vcfs, + fasta, + genome, + species, + vep_cache_version, + vep_merged_cache, + vep_extra_files, + vcfanno, + vcfanno_toml, + vcfanno_resources + ) + + ch_versions = ch_versions.mix(ANNOTATION.out.versions) + ch_reports = ch_reports.mix(ANNOTATION.out.reports) + } + + // + // Create Gemini-compatible database files + // + + if (output_mode == "seqplorer") { + vcf2db_input = ANNOTATION.out.annotated_vcfs + .combine(peds, by: 0) + + VCF2DB( + vcf2db_input + ) + } + + // + // Dump the software versions + // + + CUSTOM_DUMPSOFTWAREVERSIONS( + ch_versions.unique().collectFile(name: 'collated_versions.yml') + ) + + ch_versions_yaml = CUSTOM_DUMPSOFTWAREVERSIONS.out.mqc_yml.collect() + + // + // Perform multiQC on all QC data + // + + ch_multiqc_files = Channel.empty() + + ch_multiqc_files = ch_multiqc_files.mix( + ch_versions_yaml, + ch_reports.collect() + ) + + MULTIQC( + ch_multiqc_files.collect(), + ch_multiqc_config, + [], + multiqc_logo + ) + + // Test comment to be removed +} + + +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + COMPLETION EMAIL AND SUMMARY +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ + +workflow.onComplete { + if (params.email || params.email_on_fail) { + NfcoreTemplate.email(workflow, params, summary_params, projectDir, log, multiqc_report) + } + if (params.hook_url) { + NfcoreTemplate.adaptivecard(workflow, params, summary_params, projectDir, log, multiqc_report) + } + NfcoreTemplate.summary(workflow, params, log) +} + +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + FUNCTIONS +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ + +def parse_input(input_csv) { + + // The samplesheet schema (change this to adjust the input check) + def samplesheet_schema = [ + 'columns': [ + 'sample': [ + 'content': 'meta', + 'meta_name': 'id,samplename', + 'pattern': '', + ], + 'family': [ + 'content': 'meta', + 'meta_name': 'family', + 'pattern': '' + ], + 'cram': [ + 'content': 'file', + 'pattern': '^.*\\.cram$', + ], + 'crai': [ + 'content': 'file', + 'pattern': '^.*\\.crai$', + ], + 'bed': [ + 'content': 'file', + 'pattern': '^.*\\.bed$', + ], + 'ped': [ + 'content': 'file', + 'pattern': '^.*\\.ped$', + ] + ], + 'required': ['sample','cram'], + ] + + // Don't change these variables + def row_count = 1 + def all_columns = samplesheet_schema.columns.keySet().collect() + def mandatory_columns = samplesheet_schema.required + + // Header checks + Channel.value(input_csv).splitCsv(strip:true).first().map({ row -> + + if(row != all_columns) { + def commons = all_columns.intersect(row) + def diffs = all_columns.plus(row) + diffs.removeAll(commons) + + if(diffs.size() > 0){ + def missing_columns = [] + def wrong_columns = [] + for(diff : diffs){ + diff in all_columns ? missing_columns.add(diff) : wrong_columns.add(diff) + } + if(missing_columns.size() > 0){ + exit 1, "[Samplesheet Error] The column(s) $missing_columns is/are not present. The header should look like: $all_columns" + } + else { + exit 1, "[Samplesheet Error] The column(s) $wrong_columns should not be in the header. The header should look like: $all_columns" + } + } + else { + exit 1, "[Samplesheet Error] The columns $row are not in the right order. The header should look like: $all_columns" + } + + } + }) + + // Field checks + returning the channels + Channel.value(input_csv).splitCsv(header:true, strip:true).map({ row -> + + row_count++ + + // Check the mandatory columns + def missing_mandatory_columns = [] + for(column : mandatory_columns) { + row[column] ?: missing_mandatory_columns.add(column) + } + if(missing_mandatory_columns.size > 0){ + exit 1, "[Samplesheet Error] The mandatory column(s) $missing_mandatory_columns is/are empty on line $row_count" + } + + def output = [] + def meta = [:] + for(col : samplesheet_schema.columns) { + key = col.key + content = row[key] + + if(!(content ==~ col.value['pattern']) && col.value['pattern'] != '' && content != '') { + exit 1, "[Samplesheet Error] The content of column '$key' on line $row_count does not match the pattern '${col.value['pattern']}'" + } + + if(col.value['content'] == 'file'){ + output.add(content ? file(content, checkIfExists:true) : []) + } + else if(col.value['content'] == 'meta' && content != ''){ + for(meta_name : col.value['meta_name'].split(",")){ + meta[meta_name] = content.replace(' ', '_') + } + } + } + + output.add(0, meta) + return output + }) + +} + +def get_family_id_from_ped(ped_file){ + + // Read the PED file + def ped = file(ped_file).text + + // Perform a validity check on the PED file since vcf2db is picky and not capable of giving good error messages + comment_count = 0 + line_count = 0 + + for( line : ped.readLines()) { + line_count++ + if (line_count == 1 && line ==~ /^#.*$/) { + continue + } + else if (line_count > 1 && line ==~ /^#.*$/) { + exit 1, "[PED file error] A commented line was found on line ${line_count} in ${ped_file}, the only commented line allowed is an optional header on line 1." + } + else if (line_count == 1 && line ==~ /^#.* $/) { + exit 1, "[PED file error] The header in ${ped_file} contains a trailing space, please remove this." + } + else if (line ==~ /^.+#.*$/) { + exit 1, "[PED file error] A '#' has been found as a non-starting character on line ${line_count} in ${ped_file}, this is an illegal character and should be removed." + } + else if (line ==~ /^[^#].* .*$/) { + exit 1, "[PED file error] A space has been found on line ${line_count} in ${ped_file}, please only use tabs to seperate the values (and change spaces in names to '_')." + } + else if ((line ==~ /^(\w+\t)+\w+$/) == false) { + exit 1, "[PED file error] An illegal character has been found on line ${line_count} in ${ped_file}, only a-z; A-Z; 0-9 and '_' are allowed as column values." + } + else if ((line ==~ /^(\w+\t){5}\w+$/) == false) { + exit 1, "[PED file error] ${ped_file} should contain exactly 6 tab-delimited columns (family_id individual_id paternal_id maternal_id sex phenotype). This is not the case on line ${line_count}." + } + } + if (ped =~ /\n$/) { + exit 1, "[PED file error] An empty new line has been detected at the end of ${ped_file}, please remove this line." + } + + // get family_id + return (ped =~ /\n([^#]\w+)/)[0][1] +} + +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + THE END +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ diff --git a/workflows/tva.nf b/workflows/tva.nf deleted file mode 100644 index 1d90ec1d..00000000 --- a/workflows/tva.nf +++ /dev/null @@ -1,123 +0,0 @@ -/* -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - VALIDATE INPUTS -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -*/ - -def summary_params = NfcoreSchema.paramsSummaryMap(workflow, params) - -// Validate input parameters -WorkflowTva.initialise(params, log) - -// TODO nf-core: Add all file path parameters for the pipeline to the list below -// Check input path parameters to see if they exist -def checkPathParamList = [ params.input, params.multiqc_config, params.fasta ] -for (param in checkPathParamList) { if (param) { file(param, checkIfExists: true) } } - -// Check mandatory parameters -if (params.input) { ch_input = file(params.input) } else { exit 1, 'Input samplesheet not specified!' } - -/* -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - CONFIG FILES -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -*/ - -ch_multiqc_config = file("$projectDir/assets/multiqc_config.yml", checkIfExists: true) -ch_multiqc_custom_config = params.multiqc_config ? Channel.fromPath(params.multiqc_config) : Channel.empty() - -/* -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - IMPORT LOCAL MODULES/SUBWORKFLOWS -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -*/ - -// -// SUBWORKFLOW: Consisting of a mix of local and nf-core/modules -// -include { INPUT_CHECK } from '../subworkflows/local/input_check' - -/* -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - IMPORT NF-CORE MODULES/SUBWORKFLOWS -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -*/ - -// -// MODULE: Installed directly from nf-core/modules -// -include { FASTQC } from '../modules/nf-core/modules/fastqc/main' -include { MULTIQC } from '../modules/nf-core/modules/multiqc/main' -include { CUSTOM_DUMPSOFTWAREVERSIONS } from '../modules/nf-core/modules/custom/dumpsoftwareversions/main' - -/* -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - RUN MAIN WORKFLOW -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -*/ - -// Info required for completion email and summary -def multiqc_report = [] - -workflow TVA { - - ch_versions = Channel.empty() - - // - // SUBWORKFLOW: Read in samplesheet, validate and stage input files - // - INPUT_CHECK ( - ch_input - ) - ch_versions = ch_versions.mix(INPUT_CHECK.out.versions) - - // - // MODULE: Run FastQC - // - FASTQC ( - INPUT_CHECK.out.reads - ) - ch_versions = ch_versions.mix(FASTQC.out.versions.first()) - - CUSTOM_DUMPSOFTWAREVERSIONS ( - ch_versions.unique().collectFile(name: 'collated_versions.yml') - ) - - // - // MODULE: MultiQC - // - workflow_summary = WorkflowTva.paramsSummaryMultiqc(workflow, summary_params) - ch_workflow_summary = Channel.value(workflow_summary) - - ch_multiqc_files = Channel.empty() - ch_multiqc_files = ch_multiqc_files.mix(Channel.from(ch_multiqc_config)) - ch_multiqc_files = ch_multiqc_files.mix(ch_multiqc_custom_config.collect().ifEmpty([])) - ch_multiqc_files = ch_multiqc_files.mix(ch_workflow_summary.collectFile(name: 'workflow_summary_mqc.yaml')) - ch_multiqc_files = ch_multiqc_files.mix(CUSTOM_DUMPSOFTWAREVERSIONS.out.mqc_yml.collect()) - ch_multiqc_files = ch_multiqc_files.mix(FASTQC.out.zip.collect{it[1]}.ifEmpty([])) - - MULTIQC ( - ch_multiqc_files.collect() - ) - multiqc_report = MULTIQC.out.report.toList() - ch_versions = ch_versions.mix(MULTIQC.out.versions) -} - -/* -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - COMPLETION EMAIL AND SUMMARY -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -*/ - -workflow.onComplete { - if (params.email || params.email_on_fail) { - NfcoreTemplate.email(workflow, params, summary_params, projectDir, log, multiqc_report) - } - NfcoreTemplate.summary(workflow, params, log) -} - -/* -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - THE END -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -*/