For very large databases, creation of TST is slow and memory intensive #29

dkoslicki · 2020-03-24T19:11:36Z

Will need to:

modularize the content of MakeStreamingDNADatabase.py
create methods to make the TST the old way
create method to make TST without reading in the whole training database, and yielding the results to mt.Trie()

Work will happen in the branch ModularizeMake

The text was updated successfully, but these errors were encountered:

dkoslicki · 2020-03-24T19:12:45Z

Will concentrate first on the TST before modularizing the CountEstimators modularization

dkoslicki · 2020-03-24T20:04:23Z

On a database with 1000 genomes:
New timing: 31.374449986004038
Old timing: 36.79385429399554
Old timing (including import time): 39.00447271599842

For memory

pip install psrecord
python Make.py & psrecord $(pidof python) --interval 1 --plot <out_file>.png

New way:

Old way:

So that's a win for the new method of TST creation.

…#29

…on't need to keep the entire thing in memory. MH.export_multiple_to_single_hdf5 appears to already be set up to handle this. #29

…for #29 to test in production environment (server)

dkoslicki · 2020-03-25T02:18:52Z

Note to self, piping from C++ marisa trie implementation works as well:
LocalTest2.py:

import sys


def main():
    sys.path.append("/home/dkoslicki/Desktop/CMash/")
    from CMash.Make import MakeTSTNew
    small_database_file = "/home/dkoslicki/Desktop/CMash/tests/TempData/cmash_db_n5000_k60_1000.h5"
    TST_export_file_new = "/home/dkoslicki/Desktop/CMash/tests/TempData/cmash_db_n5000_k60_new.tst"
    M = MakeTSTNew(small_database_file, TST_export_file_new)
    for entry in M.yield_trie_items_to_insert_no_import(small_database_file):
        print(entry)


if __name__ == "__main__":
    main()

Then

 python ../../CMash/LocalTest2.py | /usr/bin/time ~/Desktop/marisa-trie/tools/marisa-build > cmash_db_n5000_k60_new.tst

results in same md5sum as old and new methods. Timing is 35.4 sec.

For marisa-trie C++ installation, note:

$ git clone https://github.com/s-yata/marisa-trie.git
$ cd marisa-trie
$ sudo apt-get install autoconf
$ sudo apt-get install libtool
$ autoreconf -i
$ ./configure --enable-native-code

dkoslicki · 2020-03-25T02:23:19Z

And C++ implementation results in the following memory usage via:

python ../../CMash/LocalTest2.py | /usr/bin/time ~/Desktop/marisa-trie/tools/marisa-build > cmash_db_n5000_k60_new.tst & sleep 1;  psrecord $(pidof marisa-build) --interval 1 --plot marisa.png

So a touch better memory usage. Will need to experiment with C++ CLI opts

dkoslicki · 2020-03-25T03:31:47Z

Problem: MakeTSTNew is still rather memory intensive when creating large databases (~200K genomes) with many hashes (-n 5000).
C++ marisa-trie seems to overcome some of this due to its ability to stream input in, but may not be able to overcome the fact that the TST needs to be completely held in memory before being constructed (as each additional key needs to have access to the whole TST to know where to be placed).
May need to carefully balance how things are streamed into the TST (so the pipe doesn't need to hold a ton of data as the TST slowly constructs itself).

Alternatives:
Search for a different TST option.

Note: this is mainly for very large databases with lots of sketches, so not a massive issue for applications like Metalign atm.

dkoslicki · 2020-04-07T19:30:03Z

Modularization is complete and appears to be working, so merging to master for now as it does give a speedup.

@dkoslicki NOTE: make sure that nothing funky is happening with the file name order in the switch from pool.map to pool.imap
edit: According to the docs, it should be in the same order, but I might consider a chunksize larger than 1 (maybe length of iterable/core count?)

dkoslicki added enhancement priority labels Mar 24, 2020

dkoslicki self-assigned this Mar 24, 2020

dkoslicki added a commit that referenced this issue Mar 24, 2020

add class to make the new TST #29

27123d8

dkoslicki added a commit that referenced this issue Mar 24, 2020

add old method of making TST #29

5855656

dkoslicki added a commit that referenced this issue Mar 24, 2020

add some local tests for #29

1d75802

dkoslicki added a commit that referenced this issue Mar 24, 2020

bit of cleanup in main() local tests for #29

6b1090f

dkoslicki added a commit that referenced this issue Mar 24, 2020

incorporate changes to MakeStreamingDNADatabase.py, all tests passing, …

0472879

…#29

dkoslicki added a commit that referenced this issue Mar 24, 2020

while we're at it, make the creation of the CE's an interable so we d…

0f3da00

…on't need to keep the entire thing in memory. MH.export_multiple_to_single_hdf5 appears to already be set up to handle this. #29

dkoslicki added a commit that referenced this issue Mar 24, 2020

Merge remote-tracking branch 'origin/Retraining' into ModularizeMake …

d4ae9f3

…for #29 to test in production environment (server)

dkoslicki added a commit that referenced this issue Apr 7, 2020

Merge remote-tracking branch 'origin/ModularizeMake' #29

e503a48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For very large databases, creation of TST is slow and memory intensive #29

For very large databases, creation of TST is slow and memory intensive #29

dkoslicki commented Mar 24, 2020 •

edited

Loading

dkoslicki commented Mar 24, 2020

dkoslicki commented Mar 24, 2020

dkoslicki commented Mar 25, 2020

dkoslicki commented Mar 25, 2020

dkoslicki commented Mar 25, 2020

dkoslicki commented Apr 7, 2020 •

edited

Loading

For very large databases, creation of TST is slow and memory intensive #29

For very large databases, creation of TST is slow and memory intensive #29

Comments

dkoslicki commented Mar 24, 2020 • edited Loading

dkoslicki commented Mar 24, 2020

dkoslicki commented Mar 24, 2020

dkoslicki commented Mar 25, 2020

dkoslicki commented Mar 25, 2020

dkoslicki commented Mar 25, 2020

dkoslicki commented Apr 7, 2020 • edited Loading

dkoslicki commented Mar 24, 2020 •

edited

Loading

dkoslicki commented Apr 7, 2020 •

edited

Loading