Skip to content

A Snakemake workflow with associated scripts used for detecting spike NTD repaired deletions in SARS-CoV-2 Omicron BA.1 lineage reads. Manuscript under review.

License

Notifications You must be signed in to change notification settings

PathoGenOmics-Lab/ena-spike-ntd-repdel-analysis

Repository files navigation

ENA spike NTD repaired deletion analysis

PGO badge DOI Release Snakemake

A Snakemake workflow with associated scripts used for detecting spike NTD repaired deletions in SARS-CoV-2 Omicron BA.1 lineage reads. The workflow processes sequencing data retrieved from the ENA Portal API, performing quality filtering, read mapping, variant calling, and classification of the deletion repair genotype. This pipeline was developed as part of a larger study. The associated manuscript is currently under review.

Results generated with this pipeline are available via DOI: 10.20350/digitalCSIC/17032. We ran Snakemake v8.25.3 with Python v3.12.7.

Workflow summary

This pipeline fetches and processes SARS-CoV-2 "read run" ENA records with a sample collection date between 1 November 2021 and 1 August 2022, filtered for NCBI taxonomy code 2697049 (SARS-CoV-2) and Homo sapiens host, excluding sequencing platforms DNBseq, Element and capillary sequencing, and RNAseq, transcriptomic, metagenomic, and metatranscriptomic library strategies. Then, the following steps are run for each resulting record:

  1. FASTQ retrieval via the ENA metadata FTP URLs.
  2. Read preprocessing and quality filtering using fastp v0.23.4.
  3. Read mapping with minimap2 v2.28 against a BA.1 reference genome (GenBank: OZ070629.1) using the recommended presets depending on the run sequencing platform.
  4. Consensus genome generation with samtools v1.20 and iVar v1.4.3.
  5. Lineage assignment using pangolin v4.3.
  6. Variant calling with iVar v1.4.3, annotated with SnpEff v5.2 and filtered with SnpSift v5.2.
  7. Classification in three "haplotypes", based on the presence or absence of S gene deletions ΔH69/V70 and ΔV143/Y145. Alleles are encoded as insertions in HGVS nomenclature, given the reference genome:
    • Rep_69_70: repair of S:ΔH69/V70 (S:p.Val67_Ile68dup detected, S:p.Asp140_His141insValTyrTyr absent).
    • Rep_143_145: repair of S:ΔV143/Y145 (S:p.Asp140_His141insValTyrTyr detected, S:p.Val67_Ile68dup absent).
    • Rep_Both: repair of both deletions (S:p.Val67_Ile68dup and S:p.Asp140_His141insValTyrTyr detected).
  8. Data summarization and visualization using ape v5.8, Rsamtools v2.18.0, tidyverse v2.0.0, and ggpubr v0.6.0 in R v4.3.3.

Usage

This repository contains a Snakemake workflow for processing sequencing data from FASTQ retrieval to classification and result summarization. The pipeline is conceptualized in two main sections: (1) an independent, linear processing pipeline for each record, and (2) summarization tasks that aggregate results and generate reports. Due to the large dataset size, a LIGHT configuration flag is available to execute only the first section of the DAG, reducing computational load. This also enables a batcher rule that allows execution using Snakemake batches if needed.

1. Data retrieval and chunking

  • 00a_run_search.sh: queries the ENA Portal API to retrieve sequencing records.
  • 00b_generate_chunks.sh: splits survey results into manageable chunks for processing via SLURM job arrays. The dataset is divided into 16 groups, each containing up to 5000 chunks, with each chunk holding 16 records. This approach addressed Snakemake limitations when handling large DAGs at the time of execution. Chunk settings were set considering our HPC resource limits.

2. Haplotype classification

3. Summary and reporting

  • 02_run_complete.sh: Executes the full workflow to generate summary tables, visualizations, and reports. This step is executed with LIGHT=False. Given computational constraints, the final manuscript analysis used scripts/summarize_results.py instead, which parses result files to produce a summary table with key measurements, and data visualizations were created manually using the same code integrated into the workflow.

Citation

The manuscript is currently under review.

About

A Snakemake workflow with associated scripts used for detecting spike NTD repaired deletions in SARS-CoV-2 Omicron BA.1 lineage reads. Manuscript under review.

Topics

Resources

License

Stars

Watchers

Forks