naobservatory · willbradshaw · Jan 27, 2025 · Jan 21, 2025 · Jan 22, 2025 · Jan 22, 2025
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -10,6 +10,7 @@
 - Implement masking of viral genome reference in index workflow with MASK_GENOME_FASTA to remove adapter, low-entropy and repeat sequences.
 - Remove TRIMMOMATIC and BBMAP from the pipeline.
 - Fixed bug in extractUnconcReadID that would cause the pipeline to fail if it contained the string 'YT' in the read id.
+- Remove `params.quality_encoding` as it was used only by TRIMMOMATIC
 
 # v2.6.0.0
 - Updated version to reflect the new versioning scheme, which is described in `docs/version_schema.md`.

diff --git a/README.md b/README.md
diff --git a/configs/run.config b/configs/run.config
@@ -21,7 +21,6 @@ params {
     n_reads_profile = 1000000 // Number of reads per sample to run through taxonomic profiling
     bt2_score_threshold = 20 // Normalized score threshold for calling a host-infecting virus read (typically 15 or 20)
     blast_viral_fraction = 0 // Fraction of putative host-infecting virus reads to BLAST vs nt (0 = don't run BLAST)
-    quality_encoding = "phred33" // FASTQ quality encoding (probably phred33, maybe phred64)
     fuzzy_match_alignment_duplicates = 0 // Fuzzy matching the start coordinate of reads for identification of duplicates through alignment (0 = exact matching; options are 0, 1, or 2)
     host_taxon = "vertebrate"
 

diff --git a/configs/run_dev_se.config b/configs/run_dev_se.config
@@ -22,7 +22,6 @@ params {
     bt2_score_threshold = 20 // Normalized score threshold for HV calling (typically 15 or 20)
     blast_hv_fraction = 0 // Fraction of putative HV reads to BLAST vs nt (0 = don't run BLAST)
     kraken_memory = "128 GB" // Memory needed to safely load Kraken DB
-    quality_encoding = "phred33" // FASTQ quality encoding (probably phred33, maybe phred64)
     fuzzy_match_alignment_duplicates = 0 // Fuzzy matching the start coordinate of reads for identification of duplicates through alignment (0 = exact matching; options are 0, 1, or 2)
     host_taxon = "vertebrate"
 }

diff --git a/docs/batch.md b/docs/batch.md
@@ -0,0 +1,92 @@
+# Running the pipeline on AWS Batch
+
+This page contains recommendations for running the pipeline on [AWS Batch](https://aws.amazon.com/batch/), a tool which allows you to run Nextflow workflows in a highly parallelized and automated manner[^notebook].
+
+[^notebook]: The original source of this guide is [this notebook](https://data.securebio.org/wills-public-notebook/notebooks/2024-06-11_batch.html). This version will be updated to reflect changes to the pipeline and online resources.
+
+## Step 1: Check your AWS permissions
+
+The most common failure modes in setting up Batch to work with Nextflow arise from insufficient AWS permissions. If you run into trouble at any point, make sure you have the following:
+
+
+1. **AWS Batch Permissions:** You need to have appropriate permissions to view, modify and run Batch compute environments and job queues. The simplest way to do this is to have your administrator add you to the `AWSBatchFullAccess` IAM policy.
+2. **S3 Permissions:** You need to have appropriate permissions to read, write and view the S3 bucket where your workflow data is stored, including the ability to list, read, write, and delete objects[^s3].
+3. **EC2 Permissions:** You need to have appropriate permissions to create and modify EC2 launch templates and compute environments. The simplest way to do this is to have your administrator add you to the `AmazonEC2FullAccess` IAM policy.
+4. **Instance Role Permissions:** You will need to assign an Instance Role when setting up your Batch compute environment. This role should have at least the following permissions: `AmazonEC2ContainerServiceRole`, `AmazonEC2ContainerServiceforEC2Role`, and `AmazonS3FullAccess`. Make sure you can either set up such a role yourself, or have your administrator do so and point you to the role name.
+
+[^s3]: In more depth, you need the following actions to be enabled for the bucket in question for your IAM user or role: `s3:ListBucket`, `s3:GetBucketLocation`, `s3:GetObject`, `s3:GetObjectAcl`, `s3:PutObject`, `s3:PutObjectAcl`, `s3:PutObjectTagging`, and `s3:DeleteObject`. If you're using a bucket specific to your user, all this is easier if you first have your administrator enable `s3:GetBucketPolicy` and `s3:PutBucketPolicy` for your user.
+
+## Step 2: Create an EC2 launch template
+
+First, you need to create an EC2 launch template that specifies the configuration for EC2 instances to be set up through Batch. To do this on the AWS console, Navigate to EC2 on the Services menu, then:
+
+1.  Select "Launch templates" in the left-hand menu.
+2.  Click the orange "Create launch template" button.
+3.  Enter a launch template name (e.g. "YOURNAME-batch-template") and optionally a description.
+4.  Check the box under "Auto scaling guidance"
+5.  Under "Application and OS Images (Amazon Machine Image)", click "Browse more AMIs", then search for "Amazon ECS-Optimized Amazon Linux 2023 (AL2023) x86_64 AMI".
+    1.  Under "AWS Marketplace AMIs", select "Amazon ECS-Optimized Amazon Linux 2023 (AL2023) x86_64 AMI" by Amazon Web Services. (Make sure you select the container by AWS itself and not some third-party source!)
+    2.  In the popup, select "Subscribe now". As this is a free container, you shouldn't need any special permissions or payment to do this.
+6.  Select an instance type (this isn't hugely important as Batch will modify the instance types provisioned based on the needs of the workflow; we recommend "m5.8xlarge").
+7.  Under "Key pair (login)", select "Don't include in launch template"
+8.  Under "Network settings", select "Create security group" and follow the default settings.
+9.  Now we come to storage. Configuring this correctly is important to avoid IOPS errors!
+    1.  The key thing to realize is that, since Batch is spinning up and terminating instances as needed, the usual costs of creating large EBS volumes don't really apply. As such, you can be relatively greedy in provisioning storage for these instances, to minimize the risk of IOPS-related problems with your workflow.
+    2.  Our recommendation is as follows: under "Storage (volumes)", expand the default "Volume 1 (AMI Root)" volume, then enter the following configuration values[^iops]:
+        1.  Size: 1000 GiB
+        2.  Volume type: gp3
+        3.  IOPS: 16000 (maximum for gp3)
+        4.  Throughput: 1000
+10. Finally, add tags so the cost of your provisioned resources can be tracked more effectively. Add one "Instances" tag (e.g. "YOURNAME-batch-template-instance") and one "Volumes" tag (e.g. "YOURNAME-batch-template-volumes").
+11. Leave the other settings as default (mostly "Don't include in launch template") and select "Create launch template".
+
+[^iops]: If you want even more IOPS, you can provision an io2 volume instead of gp3. However, that's beyond the scope of this guide.
+
+If you want to modify your template after creating it, you can do so by navigating to it in the panel and selecting "Actions" \> "Modify template (create new version)". Be careful to pay attention to which version of the template any dependent resources (e.g. compute environments, see below) are using.
+
+## Step 3: Set up a Batch compute environment
+
+Next, you need to create a compute environment through which jobs can be allocated to instances. To do this on the AWS console, navigate to Batch on the Services menu, then:
+
+1.  Select "Compute environments" in the left-hand menu
+2.  Click the orange "Create" button
+3.  Under "Compute environment configuration", select "Amazon Elastic Compute Cloud (Amazon EC2)"[^fargate]. Then:
+    1.  Under "Orchestration Type", select "Managed".
+    2.  Enter an environment name (e.g. "YOURNAME-batch-1").
+    3.  Set up roles:
+        1.  Under "Service role" select "AWSServiceRoleForBatch".
+        2.  Under "Instance role" select the instance role set up in Step 1, or another role with appropriate permissions.
+        3.  If these roles don't exist or aren't available in the drop-down menu, contact your administrator about setting them up for you.
+4.  Under "Tags", navigate to "EC2 tags" and click the "Add tag" button. Then select a tag name that can be used to uniquely identify your use of the workflow (e.g. "mgs-workflow-YOURNAME"). This is important to let your administrator keep track of how much money you and the wider team are spending on Batch (whose resource consumption is otherwise somewhat inscrutable).
+5.  Click the orange "Next" button. Then, under "Instance configuration":
+    1.  Under "Use EC2 Spot instances", make sure the "Enable using Spot instances" selector is enabled.
+    2.  Under "Maximum % on-demand price", enter a number between 0 and 100. 100 is a good default. Lower numbers will lower costs but increase the chance of unscheduled instance termination, which will require your workflow to re-run jobs.
+    3.  Enter integer values under "Minimum vCPUs", "Desired vCPUs" and "Maximum vCPUs". We recommend 0, 0, and 1024 as good defaults.
+    4.  Under "Allowed instance types", select "optimal" plus whatever other instance families you want to provision. We recommend optimal, m5, and c6a.
+    5.  Under "Additional configuration":
+        1.  Specify your EC2 key pair to allow direct ssh'ing into Batch instances (you should very rarely need to do this so you can skip this if you like)
+        2.  Under "Launch template" select the launch template you configured previously.
+        3.  Under "Launch template version", enter "\$Default" or "\$Latest" (your preference).
+6.  Click the orange "Next button", then do so again (i.e. accept defaults for "Network configuration").
+    1.  You can configure your own network setup if you like, but that's beyond the scope of this guide.
+7.  Review your selections, then click the orange "Create compute environment" button.
+
+[^fargate]: In the future, we'll investigate running Batch with Fargate for Nextflow workflows. For now, using EC2 gives us greater control over configuration than Fargate, at the cost of additional setup complexity and occasional startup delays.
+
+## 3. Set up a Batch job queue
+
+The last step you need to complete on AWS itself is to set up a job queue that Nextflow can use to submit jobs to Batch. To do this on the AWS console, navigate to Batch on the Services menu, then:
+
+1.  Select "job queues" in the left-hand menu.
+2.  Click the orange "Create" button.
+3.  Under "Orchestration Type", again select "Amazon Elastic Compute Cloud (Amazon EC2)".
+4.  Under "Job queue configuration", enter:
+    1.  A queue name (e.g. "YOURNAME-batch-queue-nf")
+    2.  A job priority (unimportant if you're only using one queue and you're the only one using that queue)
+    3.  A connected compute environment (select the environment you set up previously from the dropdown menu)
+5.  Click the orange "Create job queue" button.
+6.  Success!
+
+## 4. Run Nextflow with Batch
+
+Finally, you need to use all the infrastructure you've just set up to actually run a Nextflow workflow! We recommend using our test dataset to get started. [Click here to see how to run the pipeline on the test dataset](./installation.md#5-run-the-pipeline-on-test-data).
diff --git a/docs/config.md b/docs/config.md
@@ -0,0 +1,48 @@
+# Configuration files
+
+Nextflow configuration is controlled by `.config` files, which specify parameters and other options used in executing the pipeline.
+
+All configuration files used in the pipeline are stored in the `configs` directory. To configure a specific pipeline run, copy the appropriate config file for that pipeline mode (e.g. `run.config`) into the launch directory, rename it to `nextflow.config`, and edit it as appropriate. That config file will in turn call other, standard config files included in the `configs` directory.
+
+The rest of this page describes the specific options present in each config file, with a focus on those intended to be copied and edited by users.
+
+## Run workflow configuration (`configs/run.config`)
+
+This configuration file controls the pipeline's main RUN workflow. Its options are as follows:
+
+- `params.mode = "run"` [str]: This instructs the pipeline to execute the [core run workflow](./workflows/run.nf).
+- `params.base_dir` [str]: Path to the parent directory for the pipeline working and output directories.
+- `params.ref_dir` [str]: Path to the directory containing the outputs of the [`index` workflow](./docs/index.md).
+- `params.sample_sheet` [str]: Path to the [sample sheet](./docs/usage.md#11-the-sample-sheet) used for the pipeline run.
+- `params.adapters` [str]: Path to the adapter file for adapter trimming (default [`ref/adapters.fasta`](./ref/adapters.fasta).
+- `params.grouping` [bool]: Whether to group samples by the `group` column in the sample sheet.
+- `params.n_reads_profile` [int]: The number of reads per sample to run through taxonomic profiling (default 1 million).
+- `params.bt2_score_threshold` [float]: The length-normalized Bowtie2 score threshold above which a read is considered a valid hit for a host-infecting virus (typically 15 or 20).
+- `params.blast_viral_fraction` [float]: The fraction of putative host-infecting virus reads to validate with BLASTN (0 = don't run BLAST).
+- `params.fuzzy_match_alignment_duplicates` [int]: Fuzzy matching the start coordinate of reads for identification of duplicates through alignment (0 = exact matching; options are 0, 1, or 2).
+- `params.host_taxon` [str]: The taxon to use for host-infecting virus identification with Kraken2.
+- `params.blast_db_prefix` [str]: The prefix for the BLAST database to use for host-infecting virus identification (should match the index workflow's `params.blast_db_name`).
+- `process.queue` [str]: The [AWS Batch job queue](./docs/batch.md) to use for this pipeline run.
+
+## Index workflow (`configs/index.config`)
+
+- `params.mode = "index"` [str]: This instructs the pipeline to execute the [index workflow](./workflows/index.nf).
+- `params.base_dir` [str]: Path to the parent directory for the pipeline working and output directories.
+- `params.taxonomy_url` [str]: URL for the NCBI taxonomy dump to be used in index generation.
+- `params.virus_host_db_url` [str]: URL for Virus-Host DB.
+- `params.human_url` [str]: URL for downloading the human genome in FASTA format, which is used in index construction for contaminant screening.
+- `params.genome_urls` [list(str)]: URLs for downloading other common contaminant genomes.
+- `params.ssu_url` [str]: URL for the SILVA SSU reference database, used in ribosomal classification.
+- `params.lsu_url` [str]: URL for the SILVA LSU reference database, used in ribosomal classification.
+- `params.host_taxon_db` [str]: Path to a TSV mapping host taxon names to taxids (default: [`ref/host-taxa.tsv`](./ref/host-taxa.tsv).
+- `params.contaminants` [str]: Path to a local file containing other contaminant genomes to exclude during contaminant filtering (default [`ref/contaminants.fasta.gz`](./ref/contaminants.fasta.gz).
+- `params.adapters` [str]: Path to the adapter file for adapter masking during reference DB generation (default [`ref/adapters.fasta`](./ref/adapters.fasta).
+- `params.genome_patterns_exclude` [str]: Path to a text file specifying string patterns to hard-exclude genomes during viral genome DB generation (e.g. transgenic sequences) (default [`ref/hv_patterns_exclude.txt`](./ref/hv_patterns_exclude.txt).
+- `params.kraken_db` [str]: Path to pre-generated Kraken2 reference database (we use the Standard database by default)
+- `params.blast_db_name` [str]: The name of the BLAST database to use for optional validation of taxonomic assignments (should match the run workflow's `params.blast_db_prefix`).
+- `params.ncbi_viral_params` [str]: Parameters to pass to ncbi-genome-download when generating viral genome DB. Must at a minimum specify `--section genbank` or `--section refseq`.
+- `params.virus_taxid` [int]: The NCBI taxid for the Viruses taxon (currently 10239).
+- `params.viral_taxids_exclude` [str]: Space-separated string of taxids to hard-exclude from the list of host-infecting viruses. Currently includes phage taxa that Virus-Host DB erroneously classifies as human-infecting.
+- `params.host_taxa_screen`: Space-separated list of host taxon names to screen for when building the viral genome database. Should correspond to taxa included in `params.host_taxon_db`.
+- `params.kraken_memory`: Placeholder to initialize `run` workflow params to avoid warnings
+- `params.classify_dedup_subset`: Placeholder to initialize `run` workflow params to avoid warnings
diff --git a/docs/index.md b/docs/index.md
@@ -0,0 +1,21 @@
+# Index workflow
+
+The index workflow generates static index files from reference data. These reference data and indices don't depend on the reads processed by the run workflow, so a set of indices built by the index workflow can be used by multiple instantiations of the run workflow — no need to rebuild indices for every set of reads. The index workflow should be run once per reasonable length of time, balancing two considerations: Many of these index/reference files are derived from publicly available reference genomes or other resources, and should thus be updated and re-run periodically as new versions of these become available; however, to keep results comparable across datasets analyzed with the run workflow, this should be done relatively rarely.
+
+Key inputs to the index workflow include:
+- A TSV specifying names and taxonomic IDs (taxids) for host taxa for which to search for host-infecting viruses.
+- A URL linking to a suitable Kraken2 database for taxonomic profiling (typically the [latest release](https://benlangmead.github.io/aws-indexes/k2) of the `k2_standard` database).
+- URLS for up-to-date releases of reference genomes for various common contaminant species that can confound the identification of HV reads (currently [human](https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=9606), [cow](https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=9913), [pig](https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=9823), [carp](https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=7962)[^carp], [mouse](https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=10090), [*E. coli*](https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=562)).
+- URLs to sequence databases for small and large ribosomal subunits from [SILVA](https://www.arb-silva.de/download/arb-files/).
+- Up-to-date links to [VirusHostDB](https://www.genome.jp/virushostdb) and [NCBI Taxonomy](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/).
+
+[^carp]: The carp genome is included as a last-ditch attempt to [capture any remaining Illumina adapter sequences](https://dgg32.medium.com/carp-in-the-soil-1168818d2191) before moving on to HV identification. I'm not especially confident this is helpful.
+
+Given these inputs, the index workflow:
+- Generates a TSV of viral taxa, incorporating taxonomic information from NCBI, and annotates each taxon with infection status for each host taxon of interest (using Virus-Host-DB).
+- Makes Bowtie2 indices from (1) all host-infecting viral genomes in Genbank[^genbank], (2) the human genome, (3) common non-human contaminants, plus BBMap indices for (2) and (3).
+- Downloads and extracts local copies of (1) the BLAST nt database, (2) the specified Kraken2 DB, (3) the SILVA rRNA reference files.
+
+[^genbank]: Excluding transgenic, contaminated, or erroneous sequences, which are excluded according to a list of sequence ID patterns specified in the config file.
+
+Run the index workflow by setting `mode = "index"` in the relevant config file. For more information, see `workflows/index.nf` and the associated subworkflows and modules.