Skip to content

Project Abstract

chorltsd edited this page Oct 15, 2018 · 2 revisions

Filtering unwanted sequences from nucleic acid sequencing data is an important step in many analyses. It has been used to remove technical artefacts (eg. PhiX), discover known and novel pathogens, isolate nucleic acid types (eg. rRNA), and remove noise in metagenomic studies. This step significantly improves the speed and quality of subsequent analyses. Commonly used approaches include mapping of reads to a reference set, or filtration via bloom trees or k-mer hashes. K-mers have been shown to accurately differentiate reads between species in a fraction of the time as traditional read mapping approaches; however, currently implemented approaches such as BBDuk, Kontaminant and Cookiecutter are limited by memory usage, parallelization and other practical features. Here, we develop REUSE, a program to Rapidly Eliminate Useless SEquences. REUSE implements a minimal perfect hash function to generate a reference index with limited RAM and time. Searching the index is performed using the complete k-mer set from each read, and reeads can be discarded or retained, depending on user preference, if they contain a pre-specified number of k-mers found in the index. In comparisons against other tools on simulated and real data, REUSE is consistently faster and uses less RAM. REUSE demonstrates similar accuracy to traditional read mappers, and produces identical results to other k-mer based tools. REUSE is publicly available at https://github.com/chorltsd/REUSE.

Clone this wiki locally