Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trio mode:phasing error #312

Open
cheninouc opened this issue Jan 7, 2025 · 7 comments
Open

Trio mode:phasing error #312

cheninouc opened this issue Jan 7, 2025 · 7 comments

Comments

@cheninouc
Copy link

Hi,
For the species I studied, 0.6Gb, diploid, I used the following procedure for haplotype genome assembly:

meryl count compress k=30 threads=10  memory=100 P48.fq.gz  output P48_compress.k30.meryl
meryl count compress k=30 threads=10  memory=100 P19.fq.gz  output P19_compress.k30.meryl
$MERQURY/trio/hapmers.sh P48_compress.k30.meryl P19_compress.k30.meryl
verkko -d asm --hifi ../data/kw1-1w.all.ccs.fasta.gz --nano ../data/pass.all.fq.gz   --threads 20   --hap-kmers ../data/P19_compress.k30.only.meryl ../data/P48_compress.k30.only.meryl trio

the results:

-rw-r--r-- 1 chenchenguang LSGB 4.8K Jan  6 17:20 verkko.yml
-rwxr-xr-x 1 chenchenguang LSGB  390 Jan  6 17:20 snakemake.sh
-rw-r--r-- 1 chenchenguang LSGB  15G Jan  6 21:12 hifi-corrected.fasta.gz
drwxr-xr-x 6 chenchenguang LSGB 4.0K Jan  6 21:12 0-correction
drwxr-xr-x 2 chenchenguang LSGB 4.0K Jan  7 00:24 1-buildGraph
drwxr-xr-x 3 chenchenguang LSGB 4.0K Jan  7 01:09 3-align
drwxr-xr-x 3 chenchenguang LSGB 4.0K Jan  7 01:20 3-alignTips
drwxr-xr-x 2 chenchenguang LSGB 4.0K Jan  7 01:35 2-processGraph
drwxr-xr-x 2 chenchenguang LSGB 4.0K Jan  7 01:35 4-processONT
drwxr-xr-x 2 chenchenguang LSGB 4.0K Jan  7 01:36 5-untip
drwxr-xr-x 2 chenchenguang LSGB 4.0K Jan  7 01:41 6-rukki
-rw-r--r-- 1 chenchenguang LSGB  52K Jan  7 01:41 assembly.paths.tsv
-rw-r--r-- 1 chenchenguang LSGB  31K Jan  7 01:41 assembly.colors.csv
drwxr-xr-x 2 chenchenguang LSGB 4.0K Jan  7 01:45 6-layoutContigs
-rw-r--r-- 1 chenchenguang LSGB 7.4M Jan  7 02:54 assembly.disconnected.fasta
drwxr-xr-x 3 chenchenguang LSGB 4.0K Jan  7 02:55 7-consensus
-rw-r--r-- 1 chenchenguang LSGB 104M Jan  7 02:56 assembly.haplotype1.fasta
-rw-r--r-- 1 chenchenguang LSGB 1.5M Jan  7 02:56 assembly.unassigned.fasta
-rw-r--r-- 1 chenchenguang LSGB 1.1G Jan  7 02:56 assembly.haplotype2.fasta
-rw-r--r-- 1 chenchenguang LSGB 109K Jan  7 02:58 assembly.homopolymer-compressed.noseq.gfa
-rw-r--r-- 1 chenchenguang LSGB 785M Jan  7 02:58 assembly.homopolymer-compressed.gfa
-rw-r--r-- 1 chenchenguang LSGB  22K Jan  7 02:58 assembly.ont-coverage.csv
-rw-r--r-- 1 chenchenguang LSGB 248M Jan  7 02:58 assembly.homopolymer-compressed.layout
-rw-r--r-- 1 chenchenguang LSGB  22K Jan  7 02:58 assembly.hifi-coverage.csv
-rw-r--r-- 1 chenchenguang LSGB 1.2G Jan  7 02:58 assembly.fasta
-rw-r--r-- 1 chenchenguang LSGB  67K Jan  7 02:58 assembly.scfmap

The size of haplotype 1 and haplotype 2 is inconsistent.
Did one of these steps go wrong?
Thanks for your help!

Best regards,
Chenguang

@skoren
Copy link
Member

skoren commented Jan 7, 2025

I'd suspect there is an issue with your markers/parents, it's possible merqury picked a bad threshold. Is this a true trio? Merqury provides some visual outputs you can use to look at what the parent marker counts look like, can you post those here? Can you also share the noseq.gfa and colors.csv file?

@cheninouc
Copy link
Author

Hi,
When I used only the illumina data of the parent, there was no visual output file, and when I used the hifi data of F1 as a child, I got 3 image files.

meryl count compress k=30 threads=10  memory=100 P48.fq.gz  output P48_compress.k30.meryl
meryl count compress k=30 threads=10  memory=100 P19.fq.gz  output P19_compress.k30.meryl
meryl count compress k=30 threads=10 memory=100 kw1-1w.all.ccs.fasta.gz output child_compress.k30.meryl
$MERQURY/trio/hapmers.sh P48_compress.k30.meryl P19_compress.k30.meryl child_compress.k30.meryl

visual output file:
inherited_hapmers.ln.png:
inherited_hapmers ln
inherited_hapmers.fl.png:
inherited_hapmers fl
inherited_hapmers.st.png:
inherited_hapmers st

assembly.homopolymer-compressed.noseq.gfa:
assembly.homopolymer-compressed.noseq.gfa.txt

assembly.colors:
assembly.colors.csv

@skoren
Copy link
Member

skoren commented Jan 8, 2025

Something is very wrong with the P19 haplotype k-mers. Note how the blue line is flat and there is always a large read-only component which are in the offspring but neither parent. That shouldn't happen. To me it looks like P19 isn't related to the sample you're assembling. If not, then something must be going wrong with the counting of k-mers of that sample. This observation is consistent with the graph where all the nodes only have markers from P48 (your hap2) which is why all the genome is assigned to that sample. If you have Hi-C for this sample I'd recommend using that instead of a trio.

@arangrhie, do you agree with my assessment? Any logs to confirm there isn't another issue? @cheninouc can you post the full Meryl log for P19, the output of meryl print P19_compress.k30.meryl |head and the full merqury log as well.

@cheninouc
Copy link
Author

Meryl log for P19:
meryl_P19_log.txt

meryl print  P19_compress.k30.meryl |head

Found 1 command tree.

PROCESSING TREE #1 using 224 threads.
  opLessThan
    P19_compress.k30.meryl
    print to (stdout)
GCTACACACACACACACACACACACACAGC	1
GCTACACACACACACACACACACACACTAC	2
GCTACACACACACACACACACACACATGAC	3
GCTACACACACACACACACACACACAGCAC	2
GCTACACACACACACACACACACACAGTAC	2
GCTACACACACACACACACACACACGACGC	1
GCTACACACACACACACACACACATCACAC	6
GCTACACACACACACACACACACATGTGAC	20
GCTACACACACACACACACACACTACACAC	1
GCTACACACACACACACACACACTACACGC	1

Also, I tried 2 other assemblies and they all had the same problem:

-rw-r--r-- 1 chenchenguang LSGB  1140678496 Jan  7 09:29 assembly.haplotype1.fasta
-rw-r--r-- 1 chenchenguang LSGB    55853001 Jan  7 09:29 assembly.haplotype2.fasta
-rw-r--r-- 1 chenchenguang LSGB  1102452500 Jan  7 05:10 assembly.haplotype1.fasta
-rw-r--r-- 1 chenchenguang LSGB   115582150 Jan  7 05:10 assembly.haplotype2.fasta

@skoren
Copy link
Member

skoren commented Jan 8, 2025

Running another asm is not going to help, the issue the markers are bad so all assemblies will end up not able to sort the haplotypes since the trio information isn't valid.

It seems that the initial P19 count didn't fail, though I'm suspicious of the high number of 1 and 2 count k-mers and that the first k-mer starts with a G not A (they should be sorted alphabetically). For comparison here's an output of a 35-fold human DB:

meryl print ilmi.meryl |head

Found 1 command tree.

PROCESSING TREE #1 using 8 threads.
  opLessThan
    ilmn.meryl/
    print to (stdout)
ACGACACACACACACACACACACACACACA  11643
ACGACACACACACACACACACACACACACT  447
ACGACACACACACACACACACACACACACG  319
ACGACACACACACACACACACACACACATA  171
ACGACACACACACACACACACACACACATC  143
ACGACACACACACACACACACACACACATG  18
ACGACACACACACACACACACACACACAGA  310
ACGACACACACACACACACACACACACAGC  176
ACGACACACACACACACACACACACACTAC  95
ACGACACACACACACACACACACACACTAT  85

This implies P19 is so low coverage as to be noise.

What is the input datatype for P48 and P19, illumina?
How much coverage do you have for each P48 and P19?
Can you use the output from meryl to make a histogram and run it through genomescope and post the link to the results?
Can you run du --max-depth 1 -L -h on the folder containing all the output from merqury run on the parents + F1 (e.g. inherited*meryl, etc)?

@cheninouc
Copy link
Author

Hi
The input datatype for both P48 and P19 is Illumina sequencing, and the sequencing depth for both is approximately 30X.
P19 genomescope results:http://genomescope.org/genomescope2.0/analysis.php?code=1gFuiMm8y0waEcZTomx5
P48 genomescope results:http://genomescope.org/genomescope2.0/analysis.php?code=S1xsrgmqrc9qwVy0CNbd

du --max-depth 1 -L -h
2.6G	./child_compress.k30.inherited.meryl
1.7M	./child_compress.k30_and_P19_compress.k30.only.meryl
945M	./P48_compress.k30.inherited.gt5.meryl
1.7G	./P48_compress.k30_not_P19_compress.k30.gt3.meryl
2.0M	./P19_compress.k30.only.meryl
1.7G	./shrd.inherited.meryl
4.0G	./P19_compress.k30_not_P48_compress.k30.meryl
1.9G	./shrd.meryl
959M	./child_compress.k30_and_P48_compress.k30.only.meryl
6.2G	./child_compress.k30.meryl
1.6G	./shrd.filt.meryl
5.0G	./P48_compress.k30.meryl
1.6M	./P19_compress.k30.inherited.gt5.meryl
15G	./test
5.7G	./P19_compress.k30.meryl
3.9G	./read.only.meryl
3.3G	./P48_compress.k30_not_P19_compress.k30.meryl

@skoren
Copy link
Member

skoren commented Jan 10, 2025

It looks like both are below the 30x, the P48 looks like 18-ish in genomescope and P19 is lower, probably closer to 12x which is definitely borderline. It also looks like all P19 k-mers are being lost when they are intersected with the offspring. You can see P48_compress.k30.meryl is 5gb similar to P19_compress.k30.meryl at 5.7. But then, the inherited is much smaller. For P48 t's close to 1GB for child_compress.k30_and_P48_compress.k30.only.meryl but child_compress.k30_and_P19_compress.k30.only.meryl is 2 mb.

So I would conclude that adding P19 coverage may help but not sure how much, it really doesn't look like P19 is a parent. I don't think verkko or meryl is doing anything wrong given this data.

How confident are you in the trio?
Do you have another phasing datatype (e.g. Hi-C)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants