-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: store event bloom filter in DB #473
base: main
Are you sure you want to change the base?
Conversation
ca2b119
to
42e4668
Compare
42e4668
to
f5ef319
Compare
de25cf7
to
cff833c
Compare
crates/madara/client/db/src/lib.rs
Outdated
#[cfg(test)] | ||
#[test] | ||
fn test_column_all() { | ||
assert_eq!(Column::ALL.len(), Column::NUM_COLUMNS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really useful since NUM_COLUMNS
is already defined as Self::ALL.len();
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just in case of update
crates/madara/client/rpc/src/versions/user/v0_7_1/methods/read/get_events.rs
Outdated
Show resolved
Hide resolved
crates/madara/primitives/bloom_filter/benches/bloom_benchmark.rs
Outdated
Show resolved
Hide resolved
fn bench_bloom_filters(c: &mut Criterion) { | ||
let test_data = generate_test_data(TEST_DATA_SIZE); | ||
|
||
// Test different hash functions | ||
let hashers = [ | ||
get_hasher_benchmarks::<DefaultHasher>("DefaultHasher"), | ||
get_hasher_benchmarks::<AHasher>("AHasher"), | ||
get_hasher_benchmarks::<XxHash64>("XxHash64"), | ||
]; | ||
|
||
for hasher in &hashers { | ||
(hasher.sequential_insertion)(c, &test_data, hasher.name); | ||
(hasher.parallel_insertion)(c, &test_data, hasher.name); | ||
(hasher.lookup)(c, &test_data, hasher.name); | ||
(hasher.parallel_lookup)(c, &test_data, hasher.name); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really cool testing structure!
/// where: | ||
/// - k is the number of hash functions | ||
/// - p is the desired false positive rate (0.01) | ||
/// Reference: [Bloom, B. H. (1970). "Space/Time Trade-offs in Hash Coding with Allowable Errors"](https://dl.acm.org/doi/10.1145/362686.362692) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
pub const HASH_COUNT: u8 = 7; | ||
|
||
// Number of bits per element in the Bloom filter. | ||
// The value 9.6 is optimal for a false positive rate of 1% (0.01). | ||
const BITS_PER_ELEMENT: f64 = 9.6; | ||
|
||
pub const FALSE_POSITIF_RATE: f64 = 0.01; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These constants appear several times in the codebase, maybe we want to define them only once and re-use them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I’d also like to keep testing and production separate. Each should have its own constants.
7b985e6
to
56a51cc
Compare
Hey there! I made a first quick pass on the PR and I have 2 first questions:
|
Yes, Here is a graph of our If ever you are interested, you can find the tool I developed for this bench here. |
Why not? This gives us much more control over our codebase and its performance characteristics. In particular, as I remember @jbcaron, the goal was to create a bloom filter with a constant error rate independant of the block size. That and other optimizations might not have been possible had this not been implemented by hand. |
It is also worth noting that we would not be the only team to be developing our own bloom filter, this is already the case for Pathfinder. |
That's exactly right. In some tests, I observed response times that were up to three orders of magnitude faster. The Bloom filter implementation allows the filter size to depend on the number of keys being inserted, ensuring a constant false positive rate regardless of block size. Additionally, this implementation enables precomputation of hashes for a set of keys, so that when scanning the filters of each block, we only compute the filter indices to check for the keys, without needing to recompute the hashes every time. Lastly, the key pattern is also specifically designed for event filtering, optimizing lookup efficiency further. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incredible work here @jbcaron! 🚀
I just requested some changes for:
- adding docs and test in bloom_filter/src/events.rs
- Just general discussion on some topics.
Im also curious if we have some full e2e test that check starknet_getEvent
rpc call
let aligned_size = size_in_u64s << 6; | ||
|
||
Self { | ||
storage: AtomicBitStore::new(size_in_u64s as _), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't be explicit with the type? usize
Quick debate:
- Does it makes sense for
size
to be an u64?
If it makes sense, its also worth saying that this will panic if madara runs in a 32 bits architecture (actually almost all of this impl would fail)
ofc it could be argued that Madara will never run in 32 bits, but happy to hear what you guys think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean if ever someone tries to run Madara on a 32 bit architecture they have bigger problems than just the bloom filter since half the codebase is going to explode/implode/both 😅
/// * `item`: The item to add, which must implement the Hash trait | ||
pub fn add<T: Hash>(&self, item: &T) { | ||
calculate_hashes::<H, T>(self.hash_count, item) | ||
.for_each(|hash| self.storage.set_bit((hash % self.bit_size) as usize)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
fn calculate_hashes<H: Hasher + Default, T: Hash>(hash_count: u8, item: &T) -> impl Iterator<Item = u64> + '_ { | ||
let mut hasher1 = H::default(); | ||
item.hash(&mut hasher1); | ||
let h1 = hasher1.finish(); | ||
|
||
const PHI: u64 = 0x9e37_79b9_7f4a_7c15; // 2^64/φ | ||
let h2 = h1.rotate_left(17).wrapping_mul(PHI); | ||
|
||
(0..hash_count).map(move |i| h1.wrapping_add(h2.wrapping_mul(i as u64))) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice stuff!
@@ -0,0 +1,191 @@ | |||
use crate::*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great test suite 🚀
@@ -0,0 +1,181 @@ | |||
use criterion::{black_box, criterion_group, criterion_main, Criterion}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great as give us a tool to prevent us to introduce downgrades in the bloom filter perf (e.g we add this as part of CI and put alerts if the performance goes down)
Regarding the different hash functions used. There are 3 of them to decide which one to use?
It would be even great if we can have something like that, but with RPC calls (basically the reason for this PR)
My understanding is that @Trantorian1 built some benchmarks that check that.
Would it makes sense to add them in CI? (and also I would love to see those benchs going down with this PR 🚀 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test suite was made more for local exploration and requires a full sync Juno, Pathfinder and Madara node running simultaneously. Because of this it is rather tricky to have it running in the CI as we would need to keep these nodes up to date with the tip of the chain and manage downtime :/
There probably is a better way to do this, I'm thinking a local devnet to which we publish some sample data for each node to sync from (we would essentially be mocking a chain), but I haven't explored this and to be honest I'm not very familiar with publishing state to Starknet (never done Cairo).
So basically while the benches are handy to detect broad performance trends between each nodes, they are pretty hard to run atm which makes it difficult to integrate them into the CI.
fn events_to_bloom_keys<'a, I>(events: I) -> impl Iterator<Item = [u8; 33]> + 'a | ||
where | ||
I: Iterator<Item = &'a Event> + 'a, | ||
{ | ||
events.flat_map(Self::event_to_bloom_keys) | ||
} | ||
|
||
fn event_to_bloom_keys(event: &Event) -> impl Iterator<Item = [u8; 33]> + '_ { | ||
let from_address_key = create_bloom_key(0, &event.from_address); | ||
|
||
let keys = event.keys.iter().enumerate().map(|(i, key)| create_bloom_key(i as u8 + 1, key)); | ||
|
||
std::iter::once(from_address_key).chain(keys) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbcaron please add documentation on this logic (how we go from events to store them in bloom filter)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it would be great to have some basic tests with events that also check might
and definitely not
(would also serve as docs)
Awesome! @Trantorian1 could we run the benchmarks to check that we are indeed gaining performance? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbcaron thank you so much for this awesome work!
I just leave some comments, but we are good to go after discussing them!
@Trantorian1 although not strictly needed for merging this PR, it would be COOL! if we can see those benchmarks going down in this PR. It would be awesome
// Should match either key | ||
let searcher = EventBloomSearcher::new(Some(&Felt::from(1)), Some(&[vec![], vec![Felt::from(3)]])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbcaron Whats the difference with line 282?
} | ||
|
||
#[rstest] | ||
fn test_single_event_exact_match(single_event: Event) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldnt an exact match be:
EventBloomSearcher::new(Some(&Felt::from(1)), Some(&[vec![Felt::from(2)], vec![Felt::from(3)]]));
// Empty vec array should be ignored | ||
let searcher = EventBloomSearcher::new(Some(&Felt::from(1)), Some(&[vec![], vec![Felt::from(3)]])); | ||
assert!(searcher.search(&reader), "Search should succeed when empty arrays are ignored"); | ||
|
||
// Only empty arrays should work | ||
let searcher = EventBloomSearcher::new(Some(&Felt::from(1)), Some(&[vec![], vec![]])); | ||
assert!(searcher.search(&reader), "Search with only empty arrays should work"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbcaron I dont understand the comments.
This two cases should be true
?
let serialized = serde_json::to_string(&writer).unwrap(); | ||
let reader: EventBloomReader = serde_json::from_str(&serialized).unwrap(); | ||
|
||
// Should match either key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are trying to test the (OR) behavior should this be something like:
// Should match either key
let searcher = EventBloomSearcher::new(Some(&Felt::from(1)), Some(&[vec![], vec![Felt::from(3),Felt::from(4)]]));
assert!(searcher.search(&reader));
}
fn test_multiple_events_negative_case(multiple_events: Vec<Event>) { | ||
let writer = EventBloomWriter::from_events(multiple_events.iter()); | ||
let serialized = serde_json::to_string(&writer).unwrap(); | ||
let reader: EventBloomReader = serde_json::from_str(&serialized).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could add a case where some keys match, but one not:
let searcher = EventBloomSearcher::new(Some(&Felt::from(1)), Some(&[vec![Felt::from(5)],vec![Felt::from(2)], vec![Felt::from(3)]]));
assert!(!searcher.search(&reader));
Pull Request type
Please add the labels corresponding to the type of changes your PR introduces:
What is the current behavior?
Currently, the event pagination mechanism does not correctly handle the continuation token
Additionally, event filtering does not leverage a bloom filter, which could improve performance by reducing unnecessary block lookups.
Resolves: #NA
What is the new behavior?
Integration of
mp-bloom-filter
for event filtering:mp-bloom-filter
crate to improve filtering efficiency.Optimized event retrieval:
take(chunk_size + 1)
in the filtering process to determine whether the current block has more events.Does this introduce a breaking change?
Yes
Other information