Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Crimean-Congo hemorrhagic fever virus dataset #265

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

anna-parker
Copy link
Contributor

@anna-parker anna-parker commented Feb 6, 2025

This PR updates the CCHF dataset started in #200

It does the following:

  1. adds basic QC
  2. updates alignment parameters:
  • Decreases the minSeedCover to 10% - this means the total number of sequences on NCBI virus that cannot be aligned using this dataset drops from 148 to 48.
  • Decreases the kmer size to 7 for segment M and S - further drop to only 34 sequences.
  • Add retryReverseComplement - able to align 25 further sequences (drop to only 9)
  • Increases the the gapOpenScores to 18, 19 and 20 for the S and L segment and to 24, 26 and 28 for the M segment (see reasoning below)
  1. Adds clades for the S segment using the annotation defined in
    Serena A. Carroll, Brian H. Bird, Pierre E. Rollin, Stuart T. Nichol,
    Ancient common ancestry of Crimean-Congo hemorrhagic fever virus
    (2010), link. The nextstrain build used for this can be found here: https://github.com/neherlab/CCHFV

More Insertions or Mutations?

I see high divergence right before the terminals of all segments, especially the M segment, this is partially in line with previous findings: "Prominent features of the M RNA segment are a high degree of divergence at the first part of the M genome, along with conservation of the middle and last regions and the 10-nt termini, which are conserved in all nairoviruses" (https://pmc.ncbi.nlm.nih.gov/articles/PMC2730268/).

After discussion with @corneliusroemer I increased the gapOpen score in all segments, but especially in the M segment, to make insertions more expensive. For most sequences the number of insertions relative to the parent is now comparable to the number of mutations relative to the parent, whereas before it was approx. double that.

M segment alignment, only sequences with high coverage, default gapOpen score:
image

Current M segment alignment:
image

Preview

https://clades.nextstrain.org/?dataset-server=https://raw.githubusercontent.com/anna-parker/nextclade_data/cchfv_updates/data_output

@anna-parker anna-parker changed the title add CCHF WIP: Add CCHF Feb 6, 2025
@anna-parker anna-parker force-pushed the cchfv_updates branch 3 times, most recently from a995211 to f0dae5c Compare February 7, 2025 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants