Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For very large databases, creation of TST is slow and memory intensive #29

Open
3 tasks done
dkoslicki opened this issue Mar 24, 2020 · 6 comments
Open
3 tasks done

Comments

@dkoslicki
Copy link
Owner

dkoslicki commented Mar 24, 2020

Will need to:

  • modularize the content of MakeStreamingDNADatabase.py
  • create methods to make the TST the old way
  • create method to make TST without reading in the whole training database, and yielding the results to mt.Trie()

Work will happen in the branch ModularizeMake

@dkoslicki
Copy link
Owner Author

Will concentrate first on the TST before modularizing the CountEstimators modularization

dkoslicki added a commit that referenced this issue Mar 24, 2020
dkoslicki added a commit that referenced this issue Mar 24, 2020
dkoslicki added a commit that referenced this issue Mar 24, 2020
@dkoslicki
Copy link
Owner Author

On a database with 1000 genomes:
New timing: 31.374449986004038
Old timing: 36.79385429399554
Old timing (including import time): 39.00447271599842

For memory

pip install psrecord
python Make.py & psrecord $(pidof python) --interval 1 --plot <out_file>.png

New way:
new
Old way:
old

So that's a win for the new method of TST creation.

dkoslicki added a commit that referenced this issue Mar 24, 2020
…on't need to keep the entire thing in memory. MH.export_multiple_to_single_hdf5 appears to already be set up to handle this. #29
dkoslicki added a commit that referenced this issue Mar 24, 2020
…for #29 to test in production environment (server)
@dkoslicki
Copy link
Owner Author

Note to self, piping from C++ marisa trie implementation works as well:
LocalTest2.py:

import sys


def main():
    sys.path.append("/home/dkoslicki/Desktop/CMash/")
    from CMash.Make import MakeTSTNew
    small_database_file = "/home/dkoslicki/Desktop/CMash/tests/TempData/cmash_db_n5000_k60_1000.h5"
    TST_export_file_new = "/home/dkoslicki/Desktop/CMash/tests/TempData/cmash_db_n5000_k60_new.tst"
    M = MakeTSTNew(small_database_file, TST_export_file_new)
    for entry in M.yield_trie_items_to_insert_no_import(small_database_file):
        print(entry)


if __name__ == "__main__":
    main()

Then

 python ../../CMash/LocalTest2.py | /usr/bin/time ~/Desktop/marisa-trie/tools/marisa-build > cmash_db_n5000_k60_new.tst

results in same md5sum as old and new methods. Timing is 35.4 sec.

For marisa-trie C++ installation, note:

$ git clone https://github.com/s-yata/marisa-trie.git
$ cd marisa-trie
$ sudo apt-get install autoconf
$ sudo apt-get install libtool
$ autoreconf -i
$ ./configure --enable-native-code

@dkoslicki
Copy link
Owner Author

And C++ implementation results in the following memory usage via:

python ../../CMash/LocalTest2.py | /usr/bin/time ~/Desktop/marisa-trie/tools/marisa-build > cmash_db_n5000_k60_new.tst & sleep 1;  psrecord $(pidof marisa-build) --interval 1 --plot marisa.png

marisa

So a touch better memory usage. Will need to experiment with C++ CLI opts

@dkoslicki
Copy link
Owner Author

Problem: MakeTSTNew is still rather memory intensive when creating large databases (~200K genomes) with many hashes (-n 5000).
C++ marisa-trie seems to overcome some of this due to its ability to stream input in, but may not be able to overcome the fact that the TST needs to be completely held in memory before being constructed (as each additional key needs to have access to the whole TST to know where to be placed).
May need to carefully balance how things are streamed into the TST (so the pipe doesn't need to hold a ton of data as the TST slowly constructs itself).

Alternatives:
Search for a different TST option.

Note: this is mainly for very large databases with lots of sketches, so not a massive issue for applications like Metalign atm.

@dkoslicki
Copy link
Owner Author

dkoslicki commented Apr 7, 2020

Modularization is complete and appears to be working, so merging to master for now as it does give a speedup.

@dkoslicki NOTE: make sure that nothing funky is happening with the file name order in the switch from pool.map to pool.imap
edit: According to the docs, it should be in the same order, but I might consider a chunksize larger than 1 (maybe length of iterable/core count?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant