You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Yes, many public health laboratories have limited experience working with Candida auris in the lab (identifying a species via wet lab techniques, maldi-tof, colony morphology, etc.) which may lead to sequencing of samples that are not pure cultures/isolates.
One example: mixed culture of C. auris and C. parapsilosis. We recently looked at a sample that had roughly 36.6% reads assigned to C. auris and 57.7% of reads assigned to C. parapsilosis. A de novo assembly of the FASTQs from this sample resulted in a genome size of roughly 25 Mbases, which also shows evidence of a mixed isolate.
MycoSNP currently does not check for read-level contamination AFAIK. Additionally with consensus/reference-based assembly, it may be difficult to identify mixed/contaminated samples given the current QC outputs from MycoSNP. The above mentioned sample did fail MycoSNP QC, due to a low GC% of 40.3 percent, but it was not obvious what was going on with the sample
kraken2 (along with a proper database & fine-tuned parameters) could be used to screen the reads for potential contamination and ensure that the reads going into assembly are indeed from C. auris alone.
Describe the solution you'd like
add a step early on in the workflow that runs the fastqs through kraken2
check the output kraken2 report and look for a significant percentage of reads (e.g. >80%) that map to Candida auris and low-to-no percentage of reads (e.g. <10%) to another Candida species or other contaminating species.
Provide the kraken2 report file as an output from the workflow
Describe alternatives you've considered
None.
Additional context
I had a bad time using the standard Kraken2 databases built off of sequences in RefSeq, it seems there are not any C. auris assemblies included and there are few other Candida species present. Nearly all of my test FASTQs were assigned unclassified by kraken2
The downside of this database is that it is huge so it required 34GB of RAM to run and obviously would be cumbersome for users to routinely download and use. Not practical for routine screening of FASTQs.
One alternative idea is to create a custom and small kraken2 database, potentially hosted on Azure cloud storage, Zenodo, or some other archival service, that is built using high quality Candida auris (and other Candida spp.) reference genomes and could be used to identify contamination between Candida species as well as other common contaminants (human? others?)
The text was updated successfully, but these errors were encountered:
kapsakcj
changed the title
consider checking for read-level contamination with Kraken2
[feature request] consider checking for read-level contamination with Kraken2
Apr 28, 2022
Is your feature request related to a problem? Please describe.
Yes, many public health laboratories have limited experience working with Candida auris in the lab (identifying a species via wet lab techniques, maldi-tof, colony morphology, etc.) which may lead to sequencing of samples that are not pure cultures/isolates.
One example: mixed culture of C. auris and C. parapsilosis. We recently looked at a sample that had roughly 36.6% reads assigned to C. auris and 57.7% of reads assigned to C. parapsilosis. A de novo assembly of the FASTQs from this sample resulted in a genome size of roughly 25 Mbases, which also shows evidence of a mixed isolate.
MycoSNP currently does not check for read-level contamination AFAIK. Additionally with consensus/reference-based assembly, it may be difficult to identify mixed/contaminated samples given the current QC outputs from MycoSNP. The above mentioned sample did fail MycoSNP QC, due to a low GC% of 40.3 percent, but it was not obvious what was going on with the sample
kraken2
(along with a proper database & fine-tuned parameters) could be used to screen the reads for potential contamination and ensure that the reads going into assembly are indeed from C. auris alone.Describe the solution you'd like
kraken2
Describe alternatives you've considered
None.
Additional context
I had a bad time using the standard Kraken2 databases built off of sequences in RefSeq, it seems there are not any C. auris assemblies included and there are few other Candida species present. Nearly all of my test FASTQs were assigned
unclassified
by kraken2I had good luck with the pre-built k2 database called "EuPathDB48" for Eukaryotic pathogens found here: https://benlangmead.github.io/aws-indexes/k2#:~:text=.txt-,EuPathDB48,-3
If you visit this link and CTRL+F for "Candida" you can see all Candida species present in the database https://genome-idx.s3.amazonaws.com/kraken/k2_eupathdb48_20201113/EuPathDB48_Contents.txt
The downside of this database is that it is huge so it required 34GB of RAM to run and obviously would be cumbersome for users to routinely download and use. Not practical for routine screening of FASTQs.
One alternative idea is to create a custom and small kraken2 database, potentially hosted on Azure cloud storage, Zenodo, or some other archival service, that is built using high quality Candida auris (and other Candida spp.) reference genomes and could be used to identify contamination between Candida species as well as other common contaminants (human? others?)
Example usage & results
The text was updated successfully, but these errors were encountered: