-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Advice on choosing bloom filter parameters #38
Comments
Oh wait. Does this data structure only work with DNA? haha |
The input k-mers are hashed with ntHash. So, they should be nucleotide sequences, which are strings of A, C, G, T, and U. |
The Bloom filter can technically be used by itself as BloomFilter.hpp doesn't actually include anything from ntHash. The only weird thing is that we store the k-mer size when the datastructure is serialized but it can technically just be set to anything and not used. The BloomFilter.hpp by itself only uses hash values directly when inserting/querying (doesn't come with its own hash function). To parameterize the bloom filter I would use this function after you pick a false positive rate (FPR) to get the optimal number of hashes (optimal relative to memory usage). btl_bloomfilter/BloomFilter.hpp Line 419 in 0f19567
And this constructor once you get the recommend number of hash functions. btl_bloomfilter/BloomFilter.hpp Line 83 in 0f19567
|
I have 863,049,256 keys whose format is 34 character alphanumeric:
Do you have any recommendations on bloom filter parameters?
In particular these:
It takes a long time for me to initialize the filter, so exerpimenting with different configs takes time.
Curious if you have any insight into a good starting configuration.
The text was updated successfully, but these errors were encountered: