-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
For very large databases, creation of TST is slow and memory intensive #29
Comments
Will concentrate first on the TST before modularizing the CountEstimators modularization |
On a database with 1000 genomes: For memory
So that's a win for the new method of TST creation. |
…on't need to keep the entire thing in memory. MH.export_multiple_to_single_hdf5 appears to already be set up to handle this. #29
…for #29 to test in production environment (server)
Note to self, piping from C++ marisa trie implementation works as well: import sys
def main():
sys.path.append("/home/dkoslicki/Desktop/CMash/")
from CMash.Make import MakeTSTNew
small_database_file = "/home/dkoslicki/Desktop/CMash/tests/TempData/cmash_db_n5000_k60_1000.h5"
TST_export_file_new = "/home/dkoslicki/Desktop/CMash/tests/TempData/cmash_db_n5000_k60_new.tst"
M = MakeTSTNew(small_database_file, TST_export_file_new)
for entry in M.yield_trie_items_to_insert_no_import(small_database_file):
print(entry)
if __name__ == "__main__":
main() Then python ../../CMash/LocalTest2.py | /usr/bin/time ~/Desktop/marisa-trie/tools/marisa-build > cmash_db_n5000_k60_new.tst results in same md5sum as old and new methods. Timing is 35.4 sec. For marisa-trie C++ installation, note: $ git clone https://github.com/s-yata/marisa-trie.git
$ cd marisa-trie
$ sudo apt-get install autoconf
$ sudo apt-get install libtool
$ autoreconf -i
$ ./configure --enable-native-code |
And C++ implementation results in the following memory usage via: python ../../CMash/LocalTest2.py | /usr/bin/time ~/Desktop/marisa-trie/tools/marisa-build > cmash_db_n5000_k60_new.tst & sleep 1; psrecord $(pidof marisa-build) --interval 1 --plot marisa.png So a touch better memory usage. Will need to experiment with C++ CLI opts |
Problem: Alternatives: Note: this is mainly for very large databases with lots of sketches, so not a massive issue for applications like Metalign atm. |
Modularization is complete and appears to be working, so merging to master for now as it does give a speedup. @dkoslicki NOTE: make sure that nothing funky is happening with the file name order in the switch from |
Will need to:
MakeStreamingDNADatabase.py
mt.Trie()
Work will happen in the branch
ModularizeMake
The text was updated successfully, but these errors were encountered: