Continuously stream input/out data #65

nick297 · 2017-05-12T14:47:13Z

This is more of a feature request.

Is it possible to intermittently stream data via pipe to centrifuge so indexes do not need to be reloaded and it can be therefore setup as a server?

joshualoving · 2017-07-31T19:54:05Z

Also very interested in this feature. I was under the impression that centrifuge was able to continuously stream data and today realized that it in fact waits for all data to be loaded before beginning to process and output results.

apredeus · 2018-03-21T19:03:50Z

Any updates on this? I would need to run Centrifuge (or Kraken) on some 10k bacterial genomes as a part of the QC, and cannot find any options that would help to minimize the IO (since it's the NT database I'm using, loading it takes quite a while).

I would be thankful for any suggestions.

mourisl · 2018-03-21T19:17:20Z

The access pattern in Centrifuge index is kind of random over all places, so you need to load all the index into memory before processing the data.

If you mean you want to process multiple data sets by only loading the index once, there are two ways.

put multiple fastq files in -1, -2 options (comma separate). Then you can split the classification output by looking at the readID (1st by default) column.
In Support for handling multiple samples #95 , Mr. Mapleson provides a way to run multiple samples. I haven't got a chance to test and include that pull request yet.

Is this what you mean?

apredeus · 2018-03-22T12:47:13Z

Ah, the second option is exactly what I needed. Thank you very much for pointing this out.

homeveg · 2018-03-22T13:23:36Z

We are using different solution for the issue:

create big enough RAM-disk and save the centrifuge indices database there;
use -mm option;
use -p (multi-threading)

With the ramdisk, one have to copy database to the memory only once at every system start. When one tries to access corresponding data, operating system automatically providing a link to the data allocated in the memory.
Here is a link about setting up ramdisks in Linux: https://www.jamescoyle.net/how-to/943-create-a-ram-disk-in-linux

apredeus · 2018-03-22T13:42:28Z

Thank you for the suggestion. Unfortunately I am stuck with the cluster environment we use for all bioinformatic processing - and I don't think I have the permissions to set up things like that. I could use a standalone Unix machine, but processing few thousand genomes would take forever.

PS. The link is broken - it references the wrong url.

homeveg · 2018-03-22T14:04:22Z

Sorry for broken URL. Fixed it.
Well, you still might try to ask cluster admins. It's definitely speeding up significantly all repetitive calls of the database. So, if many people using it, it might be beneficial for computational time and cluster load in general and, therefore, admins responsible for a maintenance could find it attractive :)

mourisl mentioned this issue Jul 10, 2018

Sharing one index between several processes, over time. #135

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continuously stream input/out data #65

Continuously stream input/out data #65

nick297 commented May 12, 2017

joshualoving commented Jul 31, 2017

apredeus commented Mar 21, 2018

mourisl commented Mar 21, 2018 •

edited

Loading

apredeus commented Mar 22, 2018

homeveg commented Mar 22, 2018 •

edited

Loading

apredeus commented Mar 22, 2018

homeveg commented Mar 22, 2018

Continuously stream input/out data #65

Continuously stream input/out data #65

Comments

nick297 commented May 12, 2017

joshualoving commented Jul 31, 2017

apredeus commented Mar 21, 2018

mourisl commented Mar 21, 2018 • edited Loading

apredeus commented Mar 22, 2018

homeveg commented Mar 22, 2018 • edited Loading

apredeus commented Mar 22, 2018

homeveg commented Mar 22, 2018

mourisl commented Mar 21, 2018 •

edited

Loading

homeveg commented Mar 22, 2018 •

edited

Loading