[feature request] consider checking for read-level contamination with Kraken2 #52

kapsakcj · 2022-04-28T19:04:29Z

Is your feature request related to a problem? Please describe.
Yes, many public health laboratories have limited experience working with Candida auris in the lab (identifying a species via wet lab techniques, maldi-tof, colony morphology, etc.) which may lead to sequencing of samples that are not pure cultures/isolates.

One example: mixed culture of C. auris and C. parapsilosis. We recently looked at a sample that had roughly 36.6% reads assigned to C. auris and 57.7% of reads assigned to C. parapsilosis. A de novo assembly of the FASTQs from this sample resulted in a genome size of roughly 25 Mbases, which also shows evidence of a mixed isolate.

MycoSNP currently does not check for read-level contamination AFAIK. Additionally with consensus/reference-based assembly, it may be difficult to identify mixed/contaminated samples given the current QC outputs from MycoSNP. The above mentioned sample did fail MycoSNP QC, due to a low GC% of 40.3 percent, but it was not obvious what was going on with the sample

kraken2 (along with a proper database & fine-tuned parameters) could be used to screen the reads for potential contamination and ensure that the reads going into assembly are indeed from C. auris alone.

Describe the solution you'd like

add a step early on in the workflow that runs the fastqs through kraken2
check the output kraken2 report and look for a significant percentage of reads (e.g. >80%) that map to Candida auris and low-to-no percentage of reads (e.g. <10%) to another Candida species or other contaminating species.
Provide the kraken2 report file as an output from the workflow

Describe alternatives you've considered
None.

Additional context
I had a bad time using the standard Kraken2 databases built off of sequences in RefSeq, it seems there are not any C. auris assemblies included and there are few other Candida species present. Nearly all of my test FASTQs were assigned unclassified by kraken2

I had good luck with the pre-built k2 database called "EuPathDB48" for Eukaryotic pathogens found here: https://benlangmead.github.io/aws-indexes/k2#:~:text=.txt-,EuPathDB48,-3

If you visit this link and CTRL+F for "Candida" you can see all Candida species present in the database https://genome-idx.s3.amazonaws.com/kraken/k2_eupathdb48_20201113/EuPathDB48_Contents.txt

The downside of this database is that it is huge so it required 34GB of RAM to run and obviously would be cumbersome for users to routinely download and use. Not practical for routine screening of FASTQs.

One alternative idea is to create a custom and small kraken2 database, potentially hosted on Azure cloud storage, Zenodo, or some other archival service, that is built using high quality Candida auris (and other Candida spp.) reference genomes and could be used to identify contamination between Candida species as well as other common contaminants (human? others?)

Example usage & results

$ ls k2-db-EuPathDB48/
EuPathDB48_Contents.txt       database150mers.kmer_distrib  database250mers.kmer_distrib  database50mers.kmer_distrib  hash.k2d     k2_eupathdb48_20201113.tar.gz  seqid2taxid.map
database100mers.kmer_distrib  database200mers.kmer_distrib  database300mers.kmer_distrib  database75mers.kmer_distrib  inspect.txt  opts.k2d                       taxo.k2d

# launch staphb kraken2 v2.1.2 docker image; fastq files and EuPathDB database files are in PWD
$ docker run --rm=True -u $(id -u):$(id -g) -v $(pwd):/data -ti staphb/kraken2:2.1.2-no-db

# mostly standard parameters
$ kraken2 --db k2-db-EuPathDB48/ --threads 8 --gzip-compressed \
  --paired mixed-sample*.fastq.gz \
  --output mixed-sample.k2-EuPathDB48.out \
  --report mixed-sample.EuPathDB48.report.out

$ kraken2 --db k2-db-EuPathDB48/ --threads 8 --gzip-compressed \
  --paired good-Cauris-sample*.fastq.gz \
  --output good-Cauris-sample.k2-EuPathDB48.out \
  --report good-Cauris-sample.EuPathDB48.report.out

$ head -n 30 mixed-sample.EuPathDB48.report.out
  4.11  119260  119260  U       0       unclassified
 95.89  2785280 0       R       1       root
 95.89  2785280 0       R1      131567    cellular organisms
 95.89  2785280 1935    D       2759        Eukaryota
 95.72  2780203 0       D1      33154         Opisthokonta
 95.72  2780203 386     K       4751            Fungi
 95.70  2779695 282     K1      451864            Dikarya
 95.65  2778276 5       P       4890                Ascomycota
 95.64  2777974 25      P1      716545                saccharomyceta
 95.38  2770257 0       P2      147537                  Saccharomycotina
 95.38  2770257 0       C       4891                      Saccharomycetes
 95.38  2770257 119     O       4892                        Saccharomycetales
 57.73  1676934 0       F       766764                        Debaryomycetaceae
 57.73  1676934 0       F1      1535325                         Candida/Lodderomyces clade
 57.73  1676934 41      G       1535326                           Candida
 57.68  1675417 0       S       5480                                Candida parapsilosis
 57.68  1675417 1675417 S1      578454                                Candida parapsilosis CDC317
  0.04  1294    35      S       5476                                Candida albicans
  0.04  1163    1163    S1      237561                                Candida albicans SC5314
  0.00  96      96      S1      294748                                Candida albicans WO-1
  0.01  182     0       S       5482                                Candida tropicalis
  0.01  182     182     S1      294747                                Candida tropicalis MYA-3404
 37.63  1093076 0       F       27319                         Metschnikowiaceae
 37.63  1093076 517     G       36910                           Clavispora
 37.43  1087135 1289    G1      1540022                           Clavispora/Candida clade
 36.59  1062774 1062774 S       498019                              [Candida] auris
  0.48  13932   13932   S       45357                               [Candida] haemulonis
  0.31  9140    9140    S       1231522                             [Candida] duobushaemulonis
  0.19  5424    0       S       36911                             Clavispora lusitaniae
  0.19  5424    5424    S1      306902                              Clavispora lusitaniae ATCC 42720

$ head -n 30 good-Cauris-sample.EuPathDB48.report.out
  7.12  107209  107209  U       0       unclassified
 92.88  1398000 0       R       1       root
 92.88  1398000 0       R1      131567    cellular organisms
 92.88  1398000 884     D       2759        Eukaryota
 92.71  1395407 0       D1      33154         Opisthokonta
 92.71  1395407 0       K       4751            Fungi
 92.70  1395389 1       K1      451864            Dikarya
 92.70  1395371 3       P       4890                Ascomycota
 92.69  1395139 26      P1      716545                saccharomyceta
 92.52  1392677 0       P2      147537                  Saccharomycotina
 92.52  1392677 0       C       4891                      Saccharomycetes
 92.52  1392677 90      O       4892                        Saccharomycetales
 92.46  1391641 0       F       27319                         Metschnikowiaceae
 92.46  1391641 556     G       36910                           Clavispora
 92.13  1386684 1478    G1      1540022                           Clavispora/Candida clade
 90.84  1367342 1367342 S       498019                              [Candida] auris
  0.73  10992   10992   S       45357                               [Candida] haemulonis
  0.46  6872    6872    S       1231522                             [Candida] duobushaemulonis
  0.29  4401    0       S       36911                             Clavispora lusitaniae
  0.29  4401    4401    S1      306902                              Clavispora lusitaniae ATCC 42720
  0.06  872     0       F       766764                        Debaryomycetaceae
  0.06  872     0       F1      1535325                         Candida/Lodderomyces clade
  0.06  872     1       G       1535326                           Candida
  0.03  499     7       S       5476                                Candida albicans
  0.03  492     492     S1      237561                                Candida albicans SC5314
  0.02  348     0       S       5480                                Candida parapsilosis
  0.02  348     348     S1      578454                                Candida parapsilosis CDC317
  0.00  24      0       S       5482                                Candida tropicalis
  0.00  24      24      S1      294747                                Candida tropicalis MYA-3404
  0.00  58      0       F       4893                          Saccharomycetaceae

The text was updated successfully, but these errors were encountered:

kapsakcj changed the title ~~consider checking for read-level contamination with Kraken2~~ [feature request] consider checking for read-level contamination with Kraken2 Apr 28, 2022

sateeshperi added the enhancement New feature or request label May 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature request] consider checking for read-level contamination with Kraken2 #52

[feature request] consider checking for read-level contamination with Kraken2 #52

kapsakcj commented Apr 28, 2022

[feature request] consider checking for read-level contamination with Kraken2 #52

[feature request] consider checking for read-level contamination with Kraken2 #52

Comments

kapsakcj commented Apr 28, 2022

Example usage & results