Add a Crimean-Congo hemorrhagic fever virus dataset #265
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR updates the CCHF dataset started in #200
It does the following:
minSeedCover
to 10% - this means the total number of sequences on NCBI virus that cannot be aligned using this dataset drops from 148 to 48.retryReverseComplement
- able to align 25 further sequences (drop to only 9)Serena A. Carroll, Brian H. Bird, Pierre E. Rollin, Stuart T. Nichol,
Ancient common ancestry of Crimean-Congo hemorrhagic fever virus (2010), link. The nextstrain build used for this can be found here: https://github.com/neherlab/CCHFV
More Insertions or Mutations?
I see high divergence right before the terminals of all segments, especially the M segment, this is partially in line with previous findings: "Prominent features of the M RNA segment are a high degree of divergence at the first part of the M genome, along with conservation of the middle and last regions and the 10-nt termini, which are conserved in all nairoviruses" (https://pmc.ncbi.nlm.nih.gov/articles/PMC2730268/).
After discussion with @corneliusroemer I increased the gapOpen score in all segments, but especially in the M segment, to make insertions more expensive. For most sequences the number of insertions relative to the parent is now comparable to the number of mutations relative to the parent, whereas before it was approx. double that.
M segment alignment, only sequences with high coverage, default gapOpen score:

Current M segment alignment:

Preview
https://clades.nextstrain.org/?dataset-server=https://raw.githubusercontent.com/anna-parker/nextclade_data/cchfv_updates/data_output