Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect number of contigs in the assembly #2360

Open
ahmedelgoutbi opened this issue Jan 8, 2025 · 1 comment
Open

Incorrect number of contigs in the assembly #2360

ahmedelgoutbi opened this issue Jan 8, 2025 · 1 comment

Comments

@ahmedelgoutbi
Copy link

Hi!
I need to assemble more than 30 yeast genomes that have been sequenced
with PacBio Sequel IIe (12.1 Mb each with roughly 40x coverage).
According to the reference genome (Saccharomyces cerevisiae S288C) the
chromosomes are 16 but I get a much higher number of contigs (70 - 100)
after assembling with Canu (see below for the full commands). Could
this be a problem related to repeated sequences that the algorithm
cannot solve? If so, should I adjust the parameters to refine the
assembly? Alternatively, is there any (or set of) parameter/s that I can
tweak to get a more congruent number of chromosomes?

These are the commands that I used to get the full assembly:

canu -p canu_assemble -d canu_assembly genomeSize=12m maxThreads=14 correctedErrorRate=0.015 minReadLength=1000 minOverlapLength=500 saveOverlaps=true -pacbio-hifi rawdata_file.fasta.gz

canu version 2.2

OS Ubuntu 22.04.4 LTS

@skoren
Copy link
Member

skoren commented Jan 8, 2025

You would almost never get the number of chromosomes as the number of contigs. There are always extrachromosomal sequences (e.g. mito) that can appear in multiple contigs due to high copy number and there may be low-frequency variation in the sampled population and/or repeats that are larger than HiFi reads can resolve. Are these all haploid as well, since diploid genomes would also at least double the number of contigs. I recommend instead looking at length of the longest contigs vs chromosome and seeing how many pieces per chromosome you have.

There isn't really a parameter to resolve repeats if the reads are too short/don't contain enough information to resolve it. You could use the defaults which would just drop the error rate from 1.5% to 1% but I doubt it would make much difference. Can you post the full asm.report from your run to get an idea of the assembly statistics?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants