Haplotype sampling improvements #4329

jltsiren · 2024-07-04T09:31:27Z

Changelog Entry

To be copied to the draft changelog by merger:

Experimental haploid scoring model for haplotype sampling.

Description

This PR adds an experimental haploid scoring model for haplotype sampling. There are two differences to the current (diploid) model: most common kmer count is always used as an estimate of kmer coverage, and heterozygous kmers are reclassified as homozygous.

There is also an optional non-greedy approach for selecting subchain boundaries. The approach tries to make sure each haplotype visits each subchain only once. If a haplotype visits both orientations of a potential end node of a subchain, the subchain will be extended until we find a node where this does not happen. This way, we should be able to avoid sequence loss from haplotypes that reverse their orientation, return to a previous subchain, reverse again, and continue forward.

The PR also includes some improvements to WFAExtender used in long-read Giraffe.

jltsiren added 9 commits May 4, 2024 21:45

Simplify WFAExtender

db62217

Ignore points that have fallen too far behind in WFAExtender

eb26e32

Non-greedy subchain boundaries

64fe857

Move Haplotypes to Recombinator

9e2ef48

Haploid scoring option for haplotype sampling

0bd1c84

Make non-greedy subchain boundaries optional

65aaf71

More parameter validation in vg gbwt

a614f71

Add presets for haplotype sampling

a018e65

Merge branch 'master' of https://github.com/vgteam/vg

333f269

jltsiren merged commit 7009793 into master Jul 4, 2024
2 checks passed

jltsiren deleted the haplotype-sampling-improvements branch July 4, 2024 12:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Haplotype sampling improvements #4329

Haplotype sampling improvements #4329

jltsiren commented Jul 4, 2024

Haplotype sampling improvements #4329

Haplotype sampling improvements #4329

Conversation

jltsiren commented Jul 4, 2024

Changelog Entry

Description