Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2 chromosomal fusions from verkko 2.2 assembly not supported by the data #294

Open
tbenavi1 opened this issue Oct 16, 2024 · 11 comments
Open

Comments

@tbenavi1
Copy link

Hello,

I have a recent verkko 2.2 assembly with two chromosomal fusions (chr5-chr17, chr8-chr11). However, these translocations are not supported by the long reads or the hic data, from what I can tell. Would you be able to take a look at the assembly? Let me know what files you need. This sample has 40x hifi, 35x ONT-UL, and 25x hi-c data. One of the fusions looks to be a telomere-telomere fusion, and the other looks to be a centromere-telomere fusion. The breakpoints in question are approximately haplotype2-0000066:84375000 and haplotype1-0000025:133700000.

@skoren
Copy link
Member

skoren commented Oct 16, 2024

I am not sure what would cause this kind of fusion unless at least some of the reads support it. Can you share the noseq.gfa, paths.tsv, colors.tsv, and scfmap files from this asm?

@tbenavi1
Copy link
Author

@skoren
Copy link
Member

skoren commented Oct 16, 2024

The assembly graph is extremely fragmented, much more than I would expect given the coverage. Is this a regular clonal sample? There are over 12k nodes whereas normally we see 2-3k and there is one component connecting most of the chromosomes, also something we don't normally see. I think the joins in question are from Hi-C scaffolding getting confused by the graph structure but the fundamental issue is the very low quality of the graph.

@tbenavi1
Copy link
Author

Hello, this sample is prepared from PBMCs purified from whole blood and we assume it to be a regular clonal sample. Specifically, it is a normal control for a tumor normal pair (but we have not yet generated sequence data for the tumor). We are investigating if there was anything unusual about this sample during the sequencing preps. Is is possible to see which data type is contributing to the connected component in the graph(e.g. HiFi, ONT, and/or Hi-C)?

@skoren
Copy link
Member

skoren commented Oct 17, 2024

The base graph is built with HiFi and then resolved with ONT. I'd suspect low coverage or short ONT data. What's your N50 of the ONT data and total coverage vs coverage >100kb?

@tbenavi1
Copy link
Author

Hello, the total ONT coverage is 35x, the coverage over 100kb is 8x. The N50 is around 60kbp. It looks like the ONT data was from the ligation sequencing kit SQK-LSK114 and not the ultra long kit.

@skoren
Copy link
Member

skoren commented Oct 17, 2024

Ah, that likely is the reason for the complex graph. The ligation prep isn't going to have a long UL tail and 8x total coverage (so 4x/haplotype) is likely not enough. The best solution is probably more ONT UL data.

@Dmitry-Antipov is there a scaffold log that would let us know if this join is from Hi-C scaffolding or something else in the graph that you want to take a look at?

@tbenavi1
Copy link
Author

Yes, I'm very curious. I understand that scaffolding with a fragmented graph is a difficult problem, but I would hope that it wouldn't put two pieces together that shouldn't be put together. Thanks for taking a look.

@Dmitry-Antipov
Copy link
Contributor

Hi,
Both of that fusions are caused not by Hi-C scaffolding but by assembly graph structure - excessive joins near coverage breaks leaves only one haplotype-preserving path through the assembly graph, which is actually an erroneous fusion.

Seems that for your graph for some reasons telomeres are often connected to the middle of other chromosomes, and we do not see this things normally. Can you also share 8-hicPipeline/unitigs.telo ?

@tbenavi1
Copy link
Author

Hello, I have added unitigs.telo to the folder.

@Dmitry-Antipov
Copy link
Contributor

Dmitry-Antipov commented Nov 6, 2024

So those fusions are similar.
For example here for chromosome to the left graph supports connections from one subtelomeric region to centromere of another haplotype in one of the haplotypes. Other haplotype ends in telomere (black) as expected
telomere

We can further explore why this happens, but this may be tricky.
First suspect would be ONT gap closing, which (either with chimeric reads or because of graphaligner error) erroneously closed a gap in hifi data.
To check it we'll need
6-layoutContigs/combined-nodemap.txt
3-alignTips/alns-ont.gaf
2-processGraph/unitig-unrolled-hifi-resolved.hifi-coverage.csv
2-processGraph/unitig-unrolled-hifi-resolved.ont-coverage.csv
2-processGraph/unitig-unrolled-hifi-resolved.gfa

Last file contains all node sequences and we need only the graph structure. So I'd suggest to replace all the sequences (third element in lines starting with S) with * to simplify sharing and to exclude potential privacy problems.

But possibly the simplest solution will be just generate more ONT data and reassemble.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants