add bbmap tools and allow for clumpify-based dedup of unaligned reads #968

tomkinsc · 2019-06-28T02:40:12Z

In preliminary testing, clumpify is much faster than mvicuna, but does not seem to remove as many reads with the settings I tried (subs=5 to match mvicuna and passes=4). On a bam file that mvicuna spends ~4 minutes on, clumpify spends 14s. For the particular test file I used, the read counts went from 954858 to 707870 for clumpfiy (~25% dups removed) and 671022 for mvicuna (~30% dups removed)

see: https://www.biostars.org/p/225338/

dpark01 · 2019-06-28T13:28:52Z

Two thoughts:

probably fine if clumpify, or most other alternative tools, remove less--it's not clear that mvicuna's aggressiveness is the right way to go anyway
the current sequence of events (buried in the python code, not pipeline code) is not methodologically great but is historically built that way purely for speed reasons because mvicuna was much slower than bmtagger but much faster than blastn. We do some human depletion (bmtagger/bwa), then mvicuna dedup, then more depletion (blastn). A far more methodologically defensible way to do it would be to do all depletion steps together and put the dedup before all of it (especially if new tools are significantly faster anyway) and then be sure to emit the raw-dedup-nondepleted bam as a useful output that folks probably want to see (including a fastqc plot on that bam).

notestaff · 2019-06-28T16:04:02Z

I've added bbmap on an old branch ( https://github.com/broadinstitute/viral-ngs/tree/is-1808081312-add-bbmap ); I'll update it and make a PR.

tomkinsc · 2019-06-28T19:53:54Z

@dpark01, maybe we want to de-dup before metagenomics classification as well—or at least have it as an option. @notestaff: Great! I'll hold off on integrating clumpify until after your PR has been merged in.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add bbmap tools and allow for clumpify-based dedup of unaligned reads #968

add bbmap tools and allow for clumpify-based dedup of unaligned reads #968

tomkinsc commented Jun 28, 2019

dpark01 commented Jun 28, 2019

notestaff commented Jun 28, 2019

tomkinsc commented Jun 28, 2019

add bbmap tools and allow for clumpify-based dedup of unaligned reads #968

add bbmap tools and allow for clumpify-based dedup of unaligned reads #968

Comments

tomkinsc commented Jun 28, 2019

dpark01 commented Jun 28, 2019

notestaff commented Jun 28, 2019

tomkinsc commented Jun 28, 2019