A Snakemake workflow with associated scripts used for detecting spike NTD repaired deletions in SARS-CoV-2 Omicron BA.1 lineage reads. The workflow processes sequencing data retrieved from the ENA Portal API, performing quality filtering, read mapping, variant calling, and classification of the deletion repair genotype. This pipeline was developed as part of a larger study. The associated manuscript is currently under review.
Results generated with this pipeline are available via DOI: 10.20350/digitalCSIC/17032. We ran Snakemake v8.25.3 with Python v3.12.7.
This pipeline fetches and processes SARS-CoV-2 "read run" ENA records with a sample collection date between 1 November 2021 and 1 August 2022, filtered for NCBI taxonomy code 2697049 (SARS-CoV-2) and Homo sapiens host, excluding sequencing platforms DNBseq, Element and capillary sequencing, and RNAseq, transcriptomic, metagenomic, and metatranscriptomic library strategies. Then, the following steps are run for each resulting record:
- FASTQ retrieval via the ENA metadata FTP URLs.
- Read preprocessing and quality filtering using
fastp
v0.23.4. - Read mapping with
minimap2
v2.28 against a BA.1 reference genome (GenBank: OZ070629.1) using the recommended presets depending on the run sequencing platform. - Consensus genome generation with
samtools
v1.20 andiVar
v1.4.3. - Lineage assignment using
pangolin
v4.3. - Variant calling with
iVar
v1.4.3, annotated withSnpEff
v5.2 and filtered withSnpSift
v5.2. - Classification in three "haplotypes", based on the presence or absence of S gene deletions ΔH69/V70 and ΔV143/Y145. Alleles are encoded as insertions in HGVS nomenclature, given the reference genome:
Rep_69_70
: repair of S:ΔH69/V70 (S:p.Val67_Ile68dup
detected,S:p.Asp140_His141insValTyrTyr
absent).Rep_143_145
: repair of S:ΔV143/Y145 (S:p.Asp140_His141insValTyrTyr
detected,S:p.Val67_Ile68dup
absent).Rep_Both
: repair of both deletions (S:p.Val67_Ile68dup
andS:p.Asp140_His141insValTyrTyr
detected).
- Data summarization and visualization using
ape
v5.8,Rsamtools
v2.18.0,tidyverse
v2.0.0, andggpubr
v0.6.0 in R v4.3.3.
This repository contains a Snakemake workflow for processing sequencing data from FASTQ retrieval to classification and result summarization. The pipeline is conceptualized in two main sections: (1) an independent, linear processing pipeline for each record, and (2) summarization tasks that aggregate results and generate reports. Due to the large dataset size, a LIGHT
configuration flag is available to execute only the first section of the DAG, reducing computational load. This also enables a batcher
rule that allows execution using Snakemake batches if needed.
00a_run_search.sh
: queries the ENA Portal API to retrieve sequencing records.00b_generate_chunks.sh
: splits survey results into manageable chunks for processing via SLURM job arrays. The dataset is divided into 16 groups, each containing up to 5000 chunks, with each chunk holding 16 records. This approach addressed Snakemake limitations when handling large DAGs at the time of execution. Chunk settings were set considering our HPC resource limits.
01_run_haplotypes_array_chunked.sh <group number>
: runs the analysis for a specified chunk group. Each execution launches up to 5000 SLURM jobs through a job array. This step must be run for each group. For the final manuscript, only a subset of groups was analyzed. This step is executed withLIGHT=True
. Most parameters can be tweaked via the Snakemakeconfig.yaml
file.
02_run_complete.sh
: Executes the full workflow to generate summary tables, visualizations, and reports. This step is executed withLIGHT=False
. Given computational constraints, the final manuscript analysis usedscripts/summarize_results.py
instead, which parses result files to produce a summary table with key measurements, and data visualizations were created manually using the same code integrated into the workflow.
The manuscript is currently under review.