Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitting up README.md into several documents, all found in the docs folder #152

Merged
merged 38 commits into from
Jan 27, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
ada627d
On the off chance that something happens to this instance, I should h…
harmonbhasin Jan 21, 2025
6cc739e
Updated these slightly.
harmonbhasin Jan 22, 2025
08bc40d
Updated batch docs
willbradshaw Jan 22, 2025
717ca3c
Edited installation docs
willbradshaw Jan 22, 2025
1e83724
Removed TOC from installation doc (not rendering properly, and not re…
willbradshaw Jan 22, 2025
4b8d686
Updated usage and installation docs
willbradshaw Jan 22, 2025
32e3ca9
Minor edit to usage.md
willbradshaw Jan 22, 2025
1cd0986
Updated docs/run.md
willbradshaw Jan 22, 2025
b3054bf
Removed files, will bring t hem back later
harmonbhasin Jan 23, 2025
49694b9
updated usage document.
harmonbhasin Jan 23, 2025
2a97959
Updated reference to test dataset.
harmonbhasin Jan 23, 2025
c830837
Ooops commited installation, but meant to do this, I'll fix that in a…
harmonbhasin Jan 23, 2025
3c4e3ae
Updates
harmonbhasin Jan 24, 2025
78568ad
Forgot to remove troubleshooting but this is gone now.
harmonbhasin Jan 24, 2025
6a2b279
Updated run workflow.
harmonbhasin Jan 24, 2025
51b2b15
Updated these to give more information on the different parameters.
harmonbhasin Jan 24, 2025
abda8fb
Updating these in the meantime, need to update the output.
harmonbhasin Jan 24, 2025
a0cd96c
Docs
harmonbhasin Jan 24, 2025
e946b8d
removing weird spacing.
harmonbhasin Jan 24, 2025
febbdb6
Fixed spelling mistakes
harmonbhasin Jan 24, 2025
e9eea6e
Remove stale param.
harmonbhasin Jan 24, 2025
3bb7daa
Edited config.md
willbradshaw Jan 24, 2025
8dfaa5b
Merge branch 'harmon_documentation' of github.com:naobservatory/mgs-w…
willbradshaw Jan 24, 2025
e710398
Merge pull request #160 from naobservatory/harmon_remove_param
willbradshaw Jan 24, 2025
9f9e979
Edited installation docs
willbradshaw Jan 24, 2025
c9838e6
Edited troubleshooting.md
willbradshaw Jan 24, 2025
2f65912
removed a folder that was used by FASTP that I forgot to remove.
harmonbhasin Jan 24, 2025
2bfab2b
Moved compute resources to usage. Added the one additional file from …
harmonbhasin Jan 24, 2025
bb92515
Added additional qc file to run workflow doc.
harmonbhasin Jan 24, 2025
0d1d394
Edited usage.md
willbradshaw Jan 24, 2025
9ac3072
Edited README
willbradshaw Jan 24, 2025
429fac4
Minor README edits
willbradshaw Jan 24, 2025
7d42445
Added subset to front of file.
harmonbhasin Jan 27, 2025
efa0f3e
Merge branch 'harmon_documentation' of https://github.com/naobservato…
harmonbhasin Jan 27, 2025
29208a4
Updating documentation to add subset to the qc filenames, and fixing …
harmonbhasin Jan 27, 2025
6a9123e
Updating tests
harmonbhasin Jan 27, 2025
9da7957
Merged dev into here
harmonbhasin Jan 27, 2025
23d1f26
Updated snapshot to have the length stats.
harmonbhasin Jan 27, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
- Implement masking of viral genome reference in index workflow with MASK_GENOME_FASTA to remove adapter, low-entropy and repeat sequences.
- Remove TRIMMOMATIC and BBMAP from the pipeline.
- Fixed bug in extractUnconcReadID that would cause the pipeline to fail if it contained the string 'YT' in the read id.
- Remove `params.quality_encoding` as it was used only by TRIMMOMATIC

# v2.6.0.0
- Updated version to reflect the new versioning scheme, which is described in `docs/version_schema.md`.
Expand Down
437 changes: 64 additions & 373 deletions README.md

Large diffs are not rendered by default.

1 change: 0 additions & 1 deletion configs/run.config
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@ params {
n_reads_profile = 1000000 // Number of reads per sample to run through taxonomic profiling
bt2_score_threshold = 20 // Normalized score threshold for calling a host-infecting virus read (typically 15 or 20)
blast_viral_fraction = 0 // Fraction of putative host-infecting virus reads to BLAST vs nt (0 = don't run BLAST)
quality_encoding = "phred33" // FASTQ quality encoding (probably phred33, maybe phred64)
fuzzy_match_alignment_duplicates = 0 // Fuzzy matching the start coordinate of reads for identification of duplicates through alignment (0 = exact matching; options are 0, 1, or 2)
host_taxon = "vertebrate"

Expand Down
1 change: 0 additions & 1 deletion configs/run_dev_se.config
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@ params {
bt2_score_threshold = 20 // Normalized score threshold for HV calling (typically 15 or 20)
blast_hv_fraction = 0 // Fraction of putative HV reads to BLAST vs nt (0 = don't run BLAST)
kraken_memory = "128 GB" // Memory needed to safely load Kraken DB
quality_encoding = "phred33" // FASTQ quality encoding (probably phred33, maybe phred64)
fuzzy_match_alignment_duplicates = 0 // Fuzzy matching the start coordinate of reads for identification of duplicates through alignment (0 = exact matching; options are 0, 1, or 2)
host_taxon = "vertebrate"
}
Expand Down
92 changes: 92 additions & 0 deletions docs/batch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Running the pipeline on AWS Batch

This page contains recommendations for running the pipeline on [AWS Batch](https://aws.amazon.com/batch/), a tool which allows you to run Nextflow workflows in a highly parallelized and automated manner[^notebook].

[^notebook]: The original source of this guide is [this notebook](https://data.securebio.org/wills-public-notebook/notebooks/2024-06-11_batch.html). This version will be updated to reflect changes to the pipeline and online resources.

## Step 1: Check your AWS permissions

The most common failure modes in setting up Batch to work with Nextflow arise from insufficient AWS permissions. If you run into trouble at any point, make sure you have the following:


1. **AWS Batch Permissions:** You need to have appropriate permissions to view, modify and run Batch compute environments and job queues. The simplest way to do this is to have your administrator add you to the `AWSBatchFullAccess` IAM policy.
2. **S3 Permissions:** You need to have appropriate permissions to read, write and view the S3 bucket where your workflow data is stored, including the ability to list, read, write, and delete objects[^s3].
3. **EC2 Permissions:** You need to have appropriate permissions to create and modify EC2 launch templates and compute environments. The simplest way to do this is to have your administrator add you to the `AmazonEC2FullAccess` IAM policy.
4. **Instance Role Permissions:** You will need to assign an Instance Role when setting up your Batch compute environment. This role should have at least the following permissions: `AmazonEC2ContainerServiceRole`, `AmazonEC2ContainerServiceforEC2Role`, and `AmazonS3FullAccess`. Make sure you can either set up such a role yourself, or have your administrator do so and point you to the role name.

[^s3]: In more depth, you need the following actions to be enabled for the bucket in question for your IAM user or role: `s3:ListBucket`, `s3:GetBucketLocation`, `s3:GetObject`, `s3:GetObjectAcl`, `s3:PutObject`, `s3:PutObjectAcl`, `s3:PutObjectTagging`, and `s3:DeleteObject`. If you're using a bucket specific to your user, all this is easier if you first have your administrator enable `s3:GetBucketPolicy` and `s3:PutBucketPolicy` for your user.

## Step 2: Create an EC2 launch template

First, you need to create an EC2 launch template that specifies the configuration for EC2 instances to be set up through Batch. To do this on the AWS console, Navigate to EC2 on the Services menu, then:

1. Select "Launch templates" in the left-hand menu.
2. Click the orange "Create launch template" button.
3. Enter a launch template name (e.g. "YOURNAME-batch-template") and optionally a description.
4. Check the box under "Auto scaling guidance"
5. Under "Application and OS Images (Amazon Machine Image)", click "Browse more AMIs", then search for "Amazon ECS-Optimized Amazon Linux 2023 (AL2023) x86_64 AMI".
1. Under "AWS Marketplace AMIs", select "Amazon ECS-Optimized Amazon Linux 2023 (AL2023) x86_64 AMI" by Amazon Web Services. (Make sure you select the container by AWS itself and not some third-party source!)
2. In the popup, select "Subscribe now". As this is a free container, you shouldn't need any special permissions or payment to do this.
6. Select an instance type (this isn't hugely important as Batch will modify the instance types provisioned based on the needs of the workflow; we recommend "m5.8xlarge").
7. Under "Key pair (login)", select "Don't include in launch template"
8. Under "Network settings", select "Create security group" and follow the default settings.
9. Now we come to storage. Configuring this correctly is important to avoid IOPS errors!
1. The key thing to realize is that, since Batch is spinning up and terminating instances as needed, the usual costs of creating large EBS volumes don't really apply. As such, you can be relatively greedy in provisioning storage for these instances, to minimize the risk of IOPS-related problems with your workflow.
2. Our recommendation is as follows: under "Storage (volumes)", expand the default "Volume 1 (AMI Root)" volume, then enter the following configuration values[^iops]:
1. Size: 1000 GiB
2. Volume type: gp3
3. IOPS: 16000 (maximum for gp3)
4. Throughput: 1000
10. Finally, add tags so the cost of your provisioned resources can be tracked more effectively. Add one "Instances" tag (e.g. "YOURNAME-batch-template-instance") and one "Volumes" tag (e.g. "YOURNAME-batch-template-volumes").
11. Leave the other settings as default (mostly "Don't include in launch template") and select "Create launch template".

[^iops]: If you want even more IOPS, you can provision an io2 volume instead of gp3. However, that's beyond the scope of this guide.

If you want to modify your template after creating it, you can do so by navigating to it in the panel and selecting "Actions" \> "Modify template (create new version)". Be careful to pay attention to which version of the template any dependent resources (e.g. compute environments, see below) are using.

## Step 3: Set up a Batch compute environment

Next, you need to create a compute environment through which jobs can be allocated to instances. To do this on the AWS console, navigate to Batch on the Services menu, then:

1. Select "Compute environments" in the left-hand menu
2. Click the orange "Create" button
3. Under "Compute environment configuration", select "Amazon Elastic Compute Cloud (Amazon EC2)"[^fargate]. Then:
1. Under "Orchestration Type", select "Managed".
2. Enter an environment name (e.g. "YOURNAME-batch-1").
3. Set up roles:
1. Under "Service role" select "AWSServiceRoleForBatch".
2. Under "Instance role" select the instance role set up in Step 1, or another role with appropriate permissions.
3. If these roles don't exist or aren't available in the drop-down menu, contact your administrator about setting them up for you.
4. Under "Tags", navigate to "EC2 tags" and click the "Add tag" button. Then select a tag name that can be used to uniquely identify your use of the workflow (e.g. "mgs-workflow-YOURNAME"). This is important to let your administrator keep track of how much money you and the wider team are spending on Batch (whose resource consumption is otherwise somewhat inscrutable).
5. Click the orange "Next" button. Then, under "Instance configuration":
1. Under "Use EC2 Spot instances", make sure the "Enable using Spot instances" selector is enabled.
2. Under "Maximum % on-demand price", enter a number between 0 and 100. 100 is a good default. Lower numbers will lower costs but increase the chance of unscheduled instance termination, which will require your workflow to re-run jobs.
3. Enter integer values under "Minimum vCPUs", "Desired vCPUs" and "Maximum vCPUs". We recommend 0, 0, and 1024 as good defaults.
4. Under "Allowed instance types", select "optimal" plus whatever other instance families you want to provision. We recommend optimal, m5, and c6a.
5. Under "Additional configuration":
1. Specify your EC2 key pair to allow direct ssh'ing into Batch instances (you should very rarely need to do this so you can skip this if you like)
2. Under "Launch template" select the launch template you configured previously.
3. Under "Launch template version", enter "\$Default" or "\$Latest" (your preference).
6. Click the orange "Next button", then do so again (i.e. accept defaults for "Network configuration").
1. You can configure your own network setup if you like, but that's beyond the scope of this guide.
7. Review your selections, then click the orange "Create compute environment" button.

[^fargate]: In the future, we'll investigate running Batch with Fargate for Nextflow workflows. For now, using EC2 gives us greater control over configuration than Fargate, at the cost of additional setup complexity and occasional startup delays.

## 3. Set up a Batch job queue

The last step you need to complete on AWS itself is to set up a job queue that Nextflow can use to submit jobs to Batch. To do this on the AWS console, navigate to Batch on the Services menu, then:

1. Select "job queues" in the left-hand menu.
2. Click the orange "Create" button.
3. Under "Orchestration Type", again select "Amazon Elastic Compute Cloud (Amazon EC2)".
4. Under "Job queue configuration", enter:
1. A queue name (e.g. "YOURNAME-batch-queue-nf")
2. A job priority (unimportant if you're only using one queue and you're the only one using that queue)
3. A connected compute environment (select the environment you set up previously from the dropdown menu)
5. Click the orange "Create job queue" button.
6. Success!

## 4. Run Nextflow with Batch

Finally, you need to use all the infrastructure you've just set up to actually run a Nextflow workflow! We recommend using our test dataset to get started. [Click here to see how to run the pipeline on the test dataset](./installation.md#5-run-the-pipeline-on-test-data).
48 changes: 48 additions & 0 deletions docs/config.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Configuration files

Nextflow configuration is controlled by `.config` files, which specify parameters and other options used in executing the pipeline.

All configuration files used in the pipeline are stored in the `configs` directory. To configure a specific pipeline run, copy the appropriate config file for that pipeline mode (e.g. `run.config`) into the launch directory, rename it to `nextflow.config`, and edit it as appropriate. That config file will in turn call other, standard config files included in the `configs` directory.

The rest of this page describes the specific options present in each config file, with a focus on those intended to be copied and edited by users.

## Run workflow configuration (`configs/run.config`)

This configuration file controls the pipeline's main RUN workflow. Its options are as follows:

- `params.mode = "run"` [str]: This instructs the pipeline to execute the [core run workflow](./workflows/run.nf).
- `params.base_dir` [str]: Path to the parent directory for the pipeline working and output directories.
- `params.ref_dir` [str]: Path to the directory containing the outputs of the [`index` workflow](./docs/index.md).
- `params.sample_sheet` [str]: Path to the [sample sheet](./docs/usage.md#11-the-sample-sheet) used for the pipeline run.
- `params.adapters` [str]: Path to the adapter file for adapter trimming (default [`ref/adapters.fasta`](./ref/adapters.fasta).
- `params.grouping` [bool]: Whether to group samples by the `group` column in the sample sheet.
- `params.n_reads_profile` [int]: The number of reads per sample to run through taxonomic profiling (default 1 million).
- `params.bt2_score_threshold` [float]: The length-normalized Bowtie2 score threshold above which a read is considered a valid hit for a host-infecting virus (typically 15 or 20).
- `params.blast_viral_fraction` [float]: The fraction of putative host-infecting virus reads to validate with BLASTN (0 = don't run BLAST).
- `params.fuzzy_match_alignment_duplicates` [int]: Fuzzy matching the start coordinate of reads for identification of duplicates through alignment (0 = exact matching; options are 0, 1, or 2).
- `params.host_taxon` [str]: The taxon to use for host-infecting virus identification with Kraken2.
- `params.blast_db_prefix` [str]: The prefix for the BLAST database to use for host-infecting virus identification (should match the index workflow's `params.blast_db_name`).
- `process.queue` [str]: The [AWS Batch job queue](./docs/batch.md) to use for this pipeline run.

## Index workflow (`configs/index.config`)

- `params.mode = "index"` [str]: This instructs the pipeline to execute the [index workflow](./workflows/index.nf).
- `params.base_dir` [str]: Path to the parent directory for the pipeline working and output directories.
- `params.taxonomy_url` [str]: URL for the NCBI taxonomy dump to be used in index generation.
- `params.virus_host_db_url` [str]: URL for Virus-Host DB.
- `params.human_url` [str]: URL for downloading the human genome in FASTA format, which is used in index construction for contaminant screening.
- `params.genome_urls` [list(str)]: URLs for downloading other common contaminant genomes.
- `params.ssu_url` [str]: URL for the SILVA SSU reference database, used in ribosomal classification.
- `params.lsu_url` [str]: URL for the SILVA LSU reference database, used in ribosomal classification.
- `params.host_taxon_db` [str]: Path to a TSV mapping host taxon names to taxids (default: [`ref/host-taxa.tsv`](./ref/host-taxa.tsv).
- `params.contaminants` [str]: Path to a local file containing other contaminant genomes to exclude during contaminant filtering (default [`ref/contaminants.fasta.gz`](./ref/contaminants.fasta.gz).
- `params.adapters` [str]: Path to the adapter file for adapter masking during reference DB generation (default [`ref/adapters.fasta`](./ref/adapters.fasta).
- `params.genome_patterns_exclude` [str]: Path to a text file specifying string patterns to hard-exclude genomes during viral genome DB generation (e.g. transgenic sequences) (default [`ref/hv_patterns_exclude.txt`](./ref/hv_patterns_exclude.txt).
- `params.kraken_db` [str]: Path to pre-generated Kraken2 reference database (we use the Standard database by default)
- `params.blast_db_name` [str]: The name of the BLAST database to use for optional validation of taxonomic assignments (should match the run workflow's `params.blast_db_prefix`).
- `params.ncbi_viral_params` [str]: Parameters to pass to ncbi-genome-download when generating viral genome DB. Must at a minimum specify `--section genbank` or `--section refseq`.
- `params.virus_taxid` [int]: The NCBI taxid for the Viruses taxon (currently 10239).
- `params.viral_taxids_exclude` [str]: Space-separated string of taxids to hard-exclude from the list of host-infecting viruses. Currently includes phage taxa that Virus-Host DB erroneously classifies as human-infecting.
- `params.host_taxa_screen`: Space-separated list of host taxon names to screen for when building the viral genome database. Should correspond to taxa included in `params.host_taxon_db`.
- `params.kraken_memory`: Placeholder to initialize `run` workflow params to avoid warnings
- `params.classify_dedup_subset`: Placeholder to initialize `run` workflow params to avoid warnings
21 changes: 21 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Index workflow

The index workflow generates static index files from reference data. These reference data and indices don't depend on the reads processed by the run workflow, so a set of indices built by the index workflow can be used by multiple instantiations of the run workflow — no need to rebuild indices for every set of reads. The index workflow should be run once per reasonable length of time, balancing two considerations: Many of these index/reference files are derived from publicly available reference genomes or other resources, and should thus be updated and re-run periodically as new versions of these become available; however, to keep results comparable across datasets analyzed with the run workflow, this should be done relatively rarely.

Key inputs to the index workflow include:
- A TSV specifying names and taxonomic IDs (taxids) for host taxa for which to search for host-infecting viruses.
- A URL linking to a suitable Kraken2 database for taxonomic profiling (typically the [latest release](https://benlangmead.github.io/aws-indexes/k2) of the `k2_standard` database).
- URLS for up-to-date releases of reference genomes for various common contaminant species that can confound the identification of HV reads (currently [human](https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=9606), [cow](https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=9913), [pig](https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=9823), [carp](https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=7962)[^carp], [mouse](https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=10090), [*E. coli*](https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=562)).
- URLs to sequence databases for small and large ribosomal subunits from [SILVA](https://www.arb-silva.de/download/arb-files/).
- Up-to-date links to [VirusHostDB](https://www.genome.jp/virushostdb) and [NCBI Taxonomy](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/).

[^carp]: The carp genome is included as a last-ditch attempt to [capture any remaining Illumina adapter sequences](https://dgg32.medium.com/carp-in-the-soil-1168818d2191) before moving on to HV identification. I'm not especially confident this is helpful.

Given these inputs, the index workflow:
- Generates a TSV of viral taxa, incorporating taxonomic information from NCBI, and annotates each taxon with infection status for each host taxon of interest (using Virus-Host-DB).
- Makes Bowtie2 indices from (1) all host-infecting viral genomes in Genbank[^genbank], (2) the human genome, (3) common non-human contaminants, plus BBMap indices for (2) and (3).
- Downloads and extracts local copies of (1) the BLAST nt database, (2) the specified Kraken2 DB, (3) the SILVA rRNA reference files.

[^genbank]: Excluding transgenic, contaminated, or erroneous sequences, which are excluded according to a list of sequence ID patterns specified in the config file.

Run the index workflow by setting `mode = "index"` in the relevant config file. For more information, see `workflows/index.nf` and the associated subworkflows and modules.
Loading