Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuously stream input/out data #65

Open
nick297 opened this issue May 12, 2017 · 7 comments
Open

Continuously stream input/out data #65

nick297 opened this issue May 12, 2017 · 7 comments

Comments

@nick297
Copy link

nick297 commented May 12, 2017

This is more of a feature request.

Is it possible to intermittently stream data via pipe to centrifuge so indexes do not need to be reloaded and it can be therefore setup as a server?

@joshualoving
Copy link

Also very interested in this feature. I was under the impression that centrifuge was able to continuously stream data and today realized that it in fact waits for all data to be loaded before beginning to process and output results.

@apredeus
Copy link

Any updates on this? I would need to run Centrifuge (or Kraken) on some 10k bacterial genomes as a part of the QC, and cannot find any options that would help to minimize the IO (since it's the NT database I'm using, loading it takes quite a while).

I would be thankful for any suggestions.

@mourisl
Copy link
Collaborator

mourisl commented Mar 21, 2018

The access pattern in Centrifuge index is kind of random over all places, so you need to load all the index into memory before processing the data.

If you mean you want to process multiple data sets by only loading the index once, there are two ways.

  1. put multiple fastq files in -1, -2 options (comma separate). Then you can split the classification output by looking at the readID (1st by default) column.
  2. In Support for handling multiple samples #95 , Mr. Mapleson provides a way to run multiple samples. I haven't got a chance to test and include that pull request yet.

Is this what you mean?

@apredeus
Copy link

Ah, the second option is exactly what I needed. Thank you very much for pointing this out.

@homeveg
Copy link

homeveg commented Mar 22, 2018

We are using different solution for the issue:

  • create big enough RAM-disk and save the centrifuge indices database there;
  • use -mm option;
  • use -p (multi-threading)

With the ramdisk, one have to copy database to the memory only once at every system start. When one tries to access corresponding data, operating system automatically providing a link to the data allocated in the memory.
Here is a link about setting up ramdisks in Linux: https://www.jamescoyle.net/how-to/943-create-a-ram-disk-in-linux

@apredeus
Copy link

Thank you for the suggestion. Unfortunately I am stuck with the cluster environment we use for all bioinformatic processing - and I don't think I have the permissions to set up things like that. I could use a standalone Unix machine, but processing few thousand genomes would take forever.

PS. The link is broken - it references the wrong url.

@homeveg
Copy link

homeveg commented Mar 22, 2018

Sorry for broken URL. Fixed it.
Well, you still might try to ask cluster admins. It's definitely speeding up significantly all repetitive calls of the database. So, if many people using it, it might be beneficial for computational time and cluster load in general and, therefore, admins responsible for a maintenance could find it attractive :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants