Advice on choosing bloom filter parameters #38

luxe · 2021-09-16T17:42:30Z

I have 863,049,256 keys whose format is 34 character alphanumeric:

Z6DauUBtzw8p77MLxy7VWYCKN92JfKiCK
AJe8BJYnJHp9DDUrvGgFwmn5oBjhUSgowr
114VNKGsr9M4ogvwjz6ESNUqYdroGyht7r
etc..

Do you have any recommendations on bloom filter parameters?
In particular these:

/* k-mer size */
const unsigned k = //?;

/* number of Bloom filter hash functions */
const unsigned numHashes = //?;

/* size of Bloom filter (in bits) */
const unsigned size = //?;

It takes a long time for me to initialize the filter, so exerpimenting with different configs takes time.
Curious if you have any insight into a good starting configuration.

The text was updated successfully, but these errors were encountered:

luxe · 2021-09-16T18:04:17Z

Oh wait. Does this data structure only work with DNA? haha

kmnip · 2021-09-16T18:51:53Z

The input k-mers are hashed with ntHash. So, they should be nucleotide sequences, which are strings of A, C, G, T, and U.

JustinChu · 2021-09-16T21:43:34Z

The Bloom filter can technically be used by itself as BloomFilter.hpp doesn't actually include anything from ntHash. The only weird thing is that we store the k-mer size when the datastructure is serialized but it can technically just be set to anything and not used. The BloomFilter.hpp by itself only uses hash values directly when inserting/querying (doesn't come with its own hash function).

To parameterize the bloom filter I would use this function after you pick a false positive rate (FPR) to get the optimal number of hashes (optimal relative to memory usage).

btl_bloomfilter/BloomFilter.hpp

Line 419 in 0f19567

    
           static unsigned calcOptiHashNum(double fpr) { return unsigned(-log(fpr) / log(2)); }

And this constructor once you get the recommend number of hash functions.

btl_bloomfilter/BloomFilter.hpp

Line 83 in 0f19567

    
           BloomFilter(size_t expectedElemNum, double fpr, unsigned hashNum, unsigned kmerSize)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advice on choosing bloom filter parameters #38

Advice on choosing bloom filter parameters #38

luxe commented Sep 16, 2021

luxe commented Sep 16, 2021

kmnip commented Sep 16, 2021

JustinChu commented Sep 16, 2021 •

edited

Loading

Advice on choosing bloom filter parameters #38

Advice on choosing bloom filter parameters #38

Comments

luxe commented Sep 16, 2021

luxe commented Sep 16, 2021

kmnip commented Sep 16, 2021

JustinChu commented Sep 16, 2021 • edited Loading

JustinChu commented Sep 16, 2021 •

edited

Loading