Skip to content

Ellmen/derived-wastewater-lineages

Repository files navigation

Reconstructing SARS-CoV-2 lineages from mixed wastewater sequencing data

This is the repository for running NMF on wastewater sequencing data to determine lineage definitions from mixed samples. Read more in our Scientific Reports paper:

https://www.nature.com/articles/s41598-024-70416-4

Instructions

This describes the workflow for a typical SARS-CoV-2 run (other viruses are similar)

  1. Download or create a run containing Gromstole "coverage" and "mapped" csvs into the data/sars-cov-2/runs directory. A thin wrapper for running the Gromstole alignment on fastqs is provided in the preprocess directory but may require some changes.
  2. Set the virus, number of lineages, fasta name, and runs (including the new run) variables in find_lineages.py. On the first run, all subsequent steps should be uncommented, however the mutation frequency matrix and learned nmf vectors are saved so subsequent runs can comment out the generation steps if the data is unchanged.
  3. Run python find_lineages.py. This will create the mutations frequency matrix, learn the NMF vectors, and create a fasta: data/sars-cov-2/[fasta_name].fasta where fasta_name is specified in find_lineages.py.
  4. (Optional) Analyze the data using a tool like pangolin or nextclade to determine which lineages have been discovered.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages