Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advice on choosing bloom filter parameters #38

Open
luxe opened this issue Sep 16, 2021 · 3 comments
Open

Advice on choosing bloom filter parameters #38

luxe opened this issue Sep 16, 2021 · 3 comments

Comments

@luxe
Copy link

luxe commented Sep 16, 2021

I have 863,049,256 keys whose format is 34 character alphanumeric:

Z6DauUBtzw8p77MLxy7VWYCKN92JfKiCK
AJe8BJYnJHp9DDUrvGgFwmn5oBjhUSgowr
114VNKGsr9M4ogvwjz6ESNUqYdroGyht7r
etc..

Do you have any recommendations on bloom filter parameters?
In particular these:

/* k-mer size */
const unsigned k = //?;

/* number of Bloom filter hash functions */
const unsigned numHashes = //?;

/* size of Bloom filter (in bits) */
const unsigned size = //?;

It takes a long time for me to initialize the filter, so exerpimenting with different configs takes time.
Curious if you have any insight into a good starting configuration.

@luxe
Copy link
Author

luxe commented Sep 16, 2021

Oh wait. Does this data structure only work with DNA? haha

@kmnip
Copy link
Collaborator

kmnip commented Sep 16, 2021

The input k-mers are hashed with ntHash. So, they should be nucleotide sequences, which are strings of A, C, G, T, and U.

@JustinChu
Copy link
Collaborator

JustinChu commented Sep 16, 2021

The Bloom filter can technically be used by itself as BloomFilter.hpp doesn't actually include anything from ntHash. The only weird thing is that we store the k-mer size when the datastructure is serialized but it can technically just be set to anything and not used. The BloomFilter.hpp by itself only uses hash values directly when inserting/querying (doesn't come with its own hash function).

To parameterize the bloom filter I would use this function after you pick a false positive rate (FPR) to get the optimal number of hashes (optimal relative to memory usage).

static unsigned calcOptiHashNum(double fpr) { return unsigned(-log(fpr) / log(2)); }

And this constructor once you get the recommend number of hash functions.
BloomFilter(size_t expectedElemNum, double fpr, unsigned hashNum, unsigned kmerSize)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants