ENA spike NTD repaired deletion analysis

A Snakemake workflow with associated scripts used for detecting spike NTD repaired deletions in SARS-CoV-2 Omicron BA.1 lineage reads. The workflow processes sequencing data retrieved from the ENA Portal API, performing quality filtering, read mapping, variant calling, and classification of the deletion repair genotype. This pipeline was developed as part of a larger study. The associated manuscript is currently under review.

Results generated with this pipeline are available via DOI: 10.20350/digitalCSIC/17032. We ran Snakemake v8.25.3 with Python v3.12.7.

Workflow summary

This pipeline fetches and processes SARS-CoV-2 "read run" ENA records with a sample collection date between 1 November 2021 and 1 August 2022, filtered for NCBI taxonomy code 2697049 (SARS-CoV-2) and Homo sapiens host, excluding sequencing platforms DNBseq, Element and capillary sequencing, and RNAseq, transcriptomic, metagenomic, and metatranscriptomic library strategies. Then, the following steps are run for each resulting record:

FASTQ retrieval via the ENA metadata FTP URLs.
Read preprocessing and quality filtering using fastp v0.23.4.
Read mapping with minimap2 v2.28 against a BA.1 reference genome (GenBank: OZ070629.1) using the recommended presets depending on the run sequencing platform.
Consensus genome generation with samtools v1.20 and iVar v1.4.3.
Lineage assignment using pangolin v4.3.
Variant calling with iVar v1.4.3, annotated with SnpEff v5.2 and filtered with SnpSift v5.2.
Classification in three "haplotypes", based on the presence or absence of S gene deletions ΔH69/V70 and ΔV143/Y145. Alleles are encoded as insertions in HGVS nomenclature, given the reference genome:
- Rep_69_70: repair of S:ΔH69/V70 (S:p.Val67_Ile68dup detected, S:p.Asp140_His141insValTyrTyr absent).
- Rep_143_145: repair of S:ΔV143/Y145 (S:p.Asp140_His141insValTyrTyr detected, S:p.Val67_Ile68dup absent).
- Rep_Both: repair of both deletions (S:p.Val67_Ile68dup and S:p.Asp140_His141insValTyrTyr detected).
Data summarization and visualization using ape v5.8, Rsamtools v2.18.0, tidyverse v2.0.0, and ggpubr v0.6.0 in R v4.3.3.

Usage

This repository contains a Snakemake workflow for processing sequencing data from FASTQ retrieval to classification and result summarization. The pipeline is conceptualized in two main sections: (1) an independent, linear processing pipeline for each record, and (2) summarization tasks that aggregate results and generate reports. Due to the large dataset size, a LIGHT configuration flag is available to execute only the first section of the DAG, reducing computational load. This also enables a batcher rule that allows execution using Snakemake batches if needed.

1. Data retrieval and chunking

00a_run_search.sh: queries the ENA Portal API to retrieve sequencing records.
00b_generate_chunks.sh: splits survey results into manageable chunks for processing via SLURM job arrays. The dataset is divided into 16 groups, each containing up to 5000 chunks, with each chunk holding 16 records. This approach addressed Snakemake limitations when handling large DAGs at the time of execution. Chunk settings were set considering our HPC resource limits.

2. Haplotype classification

01_run_haplotypes_array_chunked.sh <group number>: runs the analysis for a specified chunk group. Each execution launches up to 5000 SLURM jobs through a job array. This step must be run for each group. For the final manuscript, only a subset of groups was analyzed. This step is executed with LIGHT=True. Most parameters can be tweaked via the Snakemake config.yaml file.

3. Summary and reporting

02_run_complete.sh: Executes the full workflow to generate summary tables, visualizations, and reports. This step is executed with LIGHT=False. Given computational constraints, the final manuscript analysis used scripts/summarize_results.py instead, which parses result files to produce a summary table with key measurements, and data visualizations were created manually using the same code integrated into the workflow.

Citation

The manuscript is currently under review.

Name		Name	Last commit message	Last commit date
Latest commit History 336 Commits
config		config
data/snpEff		data/snpEff
profiles		profiles
scripts		scripts
workflow		workflow
.gitignore		.gitignore
00a_run_search.sh		00a_run_search.sh
00b_generate_chunks.sh		00b_generate_chunks.sh
01_run_haplotypes_array_chunked.sh		01_run_haplotypes_array_chunked.sh
02_run_complete.sh		02_run_complete.sh
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ENA spike NTD repaired deletion analysis

Workflow summary

Usage

1. Data retrieval and chunking

2. Haplotype classification

3. Summary and reporting

Citation

About

Releases 1

Languages

License

PathoGenOmics-Lab/ena-spike-ntd-repdel-analysis

Folders and files

Latest commit

History

Repository files navigation

ENA spike NTD repaired deletion analysis

Workflow summary

Usage

1. Data retrieval and chunking

2. Haplotype classification

3. Summary and reporting

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Languages