feat: store event bloom filter in DB #473

jbcaron · 2025-01-23T12:30:53Z

Pull Request type

Please add the labels corresponding to the type of changes your PR introduces:

Feature

What is the current behavior?

Currently, the event pagination mechanism does not correctly handle the continuation token

Additionally, event filtering does not leverage a bloom filter, which could improve performance by reducing unnecessary block lookups.

Resolves: #NA

What is the new behavior?

Integration of mp-bloom-filter for event filtering:
- Added the mp-bloom-filter crate to improve filtering efficiency.
- Uses a bloom filter to quickly check if a block might contain relevant events before fetching it.
- This optimization reduces the number of blocks read, improving query performance.
Optimized event retrieval:
- Introduced take(chunk_size + 1) in the filtering process to determine whether the current block has more events.
- Ensures efficient pagination by accurately detecting when a block is fully processed.

Does this introduce a breaking change?

Yes

The continuation token format has changed.
New DB column for bloom filter

Other information

Trantorian1 · 2025-02-03T13:43:23Z

crates/madara/client/db/src/lib.rs

+#[cfg(test)]
+#[test]
+fn test_column_all() {
+    assert_eq!(Column::ALL.len(), Column::NUM_COLUMNS);


Is this really useful since NUM_COLUMNS is already defined as Self::ALL.len();?

just in case of update

crates/madara/client/rpc/src/versions/user/v0_7_1/methods/read/get_events.rs

crates/madara/primitives/bloom_filter/benches/bloom_benchmark.rs

Trantorian1 · 2025-02-03T13:53:43Z

crates/madara/primitives/bloom_filter/benches/bloom_benchmark.rs

+fn bench_bloom_filters(c: &mut Criterion) {
+    let test_data = generate_test_data(TEST_DATA_SIZE);
+
+    // Test different hash functions
+    let hashers = [
+        get_hasher_benchmarks::<DefaultHasher>("DefaultHasher"),
+        get_hasher_benchmarks::<AHasher>("AHasher"),
+        get_hasher_benchmarks::<XxHash64>("XxHash64"),
+    ];
+
+    for hasher in &hashers {
+        (hasher.sequential_insertion)(c, &test_data, hasher.name);
+        (hasher.parallel_insertion)(c, &test_data, hasher.name);
+        (hasher.lookup)(c, &test_data, hasher.name);
+        (hasher.parallel_lookup)(c, &test_data, hasher.name);
+    }
+}


Really cool testing structure!

Trantorian1 · 2025-02-03T13:54:20Z

crates/madara/primitives/bloom_filter/src/events.rs

+/// where:
+/// - k is the number of hash functions
+/// - p is the desired false positive rate (0.01)
+/// Reference: [Bloom, B. H. (1970). "Space/Time Trade-offs in Hash Coding with Allowable Errors"](https://dl.acm.org/doi/10.1145/362686.362692)


Trantorian1 · 2025-02-03T15:15:31Z

crates/madara/primitives/bloom_filter/src/tests/utils.rs

+pub const HASH_COUNT: u8 = 7;
+
+// Number of bits per element in the Bloom filter.
+// The value 9.6 is optimal for a false positive rate of 1% (0.01).
+const BITS_PER_ELEMENT: f64 = 9.6;
+
+pub const FALSE_POSITIF_RATE: f64 = 0.01;


These constants appear several times in the codebase, maybe we want to define them only once and re-use them?

I agree. I’d also like to keep testing and production separate. Each should have its own constants.

HermanObst · 2025-02-14T21:26:40Z

Hey there! I made a first quick pass on the PR and I have 2 first questions:

Why do we need to introduce a bloom filter? Did we were facing low performance in get_events RPC?
Why are we implementing the bloom filter from scratch and not using something like this?

Trantorian1 · 2025-02-17T08:25:10Z

Why do we need to introduce a bloom filter? Did we were facing low performance in get_events RPC?

Yes, getEvents was by far our lowest performing RPC method, and was lagging behind other clients' implementations by several orders of magnitude.

Here is a graph of our getEvents performance compared to other nodes (this was part of a perf test I made earlier in 2024, no other changes have been made to getEvents since then).

If ever you are interested, you can find the tool I developed for this bench here.

Trantorian1 · 2025-02-17T08:27:02Z

Why are we implementing the bloom filter from scratch and not using something like this?

Why not? This gives us much more control over our codebase and its performance characteristics. In particular, as I remember @jbcaron, the goal was to create a bloom filter with a constant error rate independant of the block size. That and other optimizations might not have been possible had this not been implemented by hand.

Trantorian1 · 2025-02-17T08:30:06Z

It is also worth noting that we would not be the only team to be developing our own bloom filter, this is already the case for Pathfinder.

jbcaron · 2025-02-17T08:58:36Z

Hey there! I made a first quick pass on the PR and I have 2 first questions:

Why do we need to introduce a bloom filter? Did we were facing low performance in get_events RPC?

Why are we implementing the bloom filter from scratch and not using something like this?

That's exactly right. In some tests, I observed response times that were up to three orders of magnitude faster. The Bloom filter implementation allows the filter size to depend on the number of keys being inserted, ensuring a constant false positive rate regardless of block size.

Additionally, this implementation enables precomputation of hashes for a set of keys, so that when scanning the filters of each block, we only compute the filter indices to check for the keys, without needing to recompute the hashes every time.

Lastly, the key pattern is also specifically designed for event filtering, optimizing lookup efficiency further.

HermanObst

Incredible work here @jbcaron! 🚀

I just requested some changes for:

adding docs and test in bloom_filter/src/events.rs
Just general discussion on some topics.

Im also curious if we have some full e2e test that check starknet_getEvent rpc call

HermanObst · 2025-02-17T18:01:45Z

crates/madara/primitives/bloom_filter/src/filter.rs

+        let aligned_size = size_in_u64s << 6;
+
+        Self {
+            storage: AtomicBitStore::new(size_in_u64s as _),


Why don't be explicit with the type? usize

Quick debate:

Does it makes sense for size to be an u64?

If it makes sense, its also worth saying that this will panic if madara runs in a 32 bits architecture (actually almost all of this impl would fail)

ofc it could be argued that Madara will never run in 32 bits, but happy to hear what you guys think

I mean if ever someone tries to run Madara on a 32 bit architecture they have bigger problems than just the bloom filter since half the codebase is going to explode/implode/both 😅

HermanObst · 2025-02-17T18:12:17Z

crates/madara/primitives/bloom_filter/src/filter.rs

+    /// * `item`: The item to add, which must implement the Hash trait
+    pub fn add<T: Hash>(&self, item: &T) {
+        calculate_hashes::<H, T>(self.hash_count, item)
+            .for_each(|hash| self.storage.set_bit((hash % self.bit_size) as usize));


HermanObst · 2025-02-17T18:28:35Z

crates/madara/primitives/bloom_filter/src/filter.rs

+fn calculate_hashes<H: Hasher + Default, T: Hash>(hash_count: u8, item: &T) -> impl Iterator<Item = u64> + '_ {
+    let mut hasher1 = H::default();
+    item.hash(&mut hasher1);
+    let h1 = hasher1.finish();
+
+    const PHI: u64 = 0x9e37_79b9_7f4a_7c15; // 2^64/φ
+    let h2 = h1.rotate_left(17).wrapping_mul(PHI);
+
+    (0..hash_count).map(move |i| h1.wrapping_add(h2.wrapping_mul(i as u64)))
+}


Nice stuff!

HermanObst · 2025-02-17T18:39:14Z

crates/madara/primitives/bloom_filter/src/tests/filter_tests.rs

@@ -0,0 +1,191 @@
+use crate::*;


Great test suite 🚀

HermanObst · 2025-02-17T19:25:25Z

crates/madara/primitives/bloom_filter/benches/bloom_benchmark.rs

@@ -0,0 +1,181 @@
+use criterion::{black_box, criterion_group, criterion_main, Criterion};


This is great as give us a tool to prevent us to introduce downgrades in the bloom filter perf (e.g we add this as part of CI and put alerts if the performance goes down)

Regarding the different hash functions used. There are 3 of them to decide which one to use?

It would be even great if we can have something like that, but with RPC calls (basically the reason for this PR)

My understanding is that @Trantorian1 built some benchmarks that check that.

Would it makes sense to add them in CI? (and also I would love to see those benchs going down with this PR 🚀 )

The test suite was made more for local exploration and requires a full sync Juno, Pathfinder and Madara node running simultaneously. Because of this it is rather tricky to have it running in the CI as we would need to keep these nodes up to date with the tip of the chain and manage downtime :/

There probably is a better way to do this, I'm thinking a local devnet to which we publish some sample data for each node to sync from (we would essentially be mocking a chain), but I haven't explored this and to be honest I'm not very familiar with publishing state to Starknet (never done Cairo).

So basically while the benches are handy to detect broad performance trends between each nodes, they are pretty hard to run atm which makes it difficult to integrate them into the CI.

HermanObst · 2025-02-17T19:41:36Z

crates/madara/primitives/bloom_filter/src/events.rs

+    fn events_to_bloom_keys<'a, I>(events: I) -> impl Iterator<Item = [u8; 33]> + 'a
+    where
+        I: Iterator<Item = &'a Event> + 'a,
+    {
+        events.flat_map(Self::event_to_bloom_keys)
+    }
+
+    fn event_to_bloom_keys(event: &Event) -> impl Iterator<Item = [u8; 33]> + '_ {
+        let from_address_key = create_bloom_key(0, &event.from_address);
+
+        let keys = event.keys.iter().enumerate().map(|(i, key)| create_bloom_key(i as u8 + 1, key));
+
+        std::iter::once(from_address_key).chain(keys)
+    }
+}


@jbcaron please add documentation on this logic (how we go from events to store them in bloom filter)

Also, it would be great to have some basic tests with events that also check might and definitely not
(would also serve as docs)

HermanObst · 2025-02-26T13:58:36Z

Awesome! @Trantorian1 could we run the benchmarks to check that we are indeed gaining performance?

HermanObst

@jbcaron thank you so much for this awesome work!
I just leave some comments, but we are good to go after discussing them!

@Trantorian1 although not strictly needed for merging this PR, it would be COOL! if we can see those benchmarks going down in this PR. It would be awesome

HermanObst · 2025-02-27T21:44:08Z

crates/madara/primitives/bloom_filter/src/events.rs

+        // Should match either key
+        let searcher = EventBloomSearcher::new(Some(&Felt::from(1)), Some(&[vec![], vec![Felt::from(3)]]));


@jbcaron Whats the difference with line 282?

HermanObst · 2025-02-27T21:46:14Z

crates/madara/primitives/bloom_filter/src/events.rs

+    }
+
+    #[rstest]
+    fn test_single_event_exact_match(single_event: Event) {


shouldnt an exact match be:

EventBloomSearcher::new(Some(&Felt::from(1)), Some(&[vec![Felt::from(2)], vec![Felt::from(3)]]));

HermanObst · 2025-02-27T21:48:24Z

crates/madara/primitives/bloom_filter/src/events.rs

+        // Empty vec array should be ignored
+        let searcher = EventBloomSearcher::new(Some(&Felt::from(1)), Some(&[vec![], vec![Felt::from(3)]]));
+        assert!(searcher.search(&reader), "Search should succeed when empty arrays are ignored");
+
+        // Only empty arrays should work
+        let searcher = EventBloomSearcher::new(Some(&Felt::from(1)), Some(&[vec![], vec![]]));
+        assert!(searcher.search(&reader), "Search with only empty arrays should work");


@jbcaron I dont understand the comments.
This two cases should be true?

HermanObst · 2025-02-27T22:05:52Z

crates/madara/primitives/bloom_filter/src/events.rs

+        let serialized = serde_json::to_string(&writer).unwrap();
+        let reader: EventBloomReader = serde_json::from_str(&serialized).unwrap();
+
+        // Should match either key


If you are trying to test the (OR) behavior should this be something like:

// Should match either key let searcher = EventBloomSearcher::new(Some(&Felt::from(1)), Some(&[vec![], vec![Felt::from(3),Felt::from(4)]])); assert!(searcher.search(&reader)); }

HermanObst · 2025-02-27T22:11:35Z

crates/madara/primitives/bloom_filter/src/events.rs

+    fn test_multiple_events_negative_case(multiple_events: Vec<Event>) {
+        let writer = EventBloomWriter::from_events(multiple_events.iter());
+        let serialized = serde_json::to_string(&writer).unwrap();
+        let reader: EventBloomReader = serde_json::from_str(&serialized).unwrap();


Maybe we could add a case where some keys match, but one not:

let searcher = EventBloomSearcher::new(Some(&Felt::from(1)), Some(&[vec![Felt::from(5)],vec![Felt::from(2)], vec![Felt::from(3)]])); assert!(!searcher.search(&reader));

jbcaron added feature Request for new feature or enhancement node Related to the full node implementation db-migration Requires database schema changes or migration labels Jan 23, 2025

jbcaron self-assigned this Jan 23, 2025

jbcaron force-pushed the feat/bloom-filter branch from ca2b119 to 42e4668 Compare February 3, 2025 08:28

jbcaron requested a review from Trantorian1 February 3, 2025 08:28

jbcaron marked this pull request as ready for review February 3, 2025 08:28

jbcaron added 2 commits February 3, 2025 09:37

feat: store event bloom filter in DB

f5ef319

feat: store event bloom filter in DB

3320801

jbcaron force-pushed the feat/bloom-filter branch from 42e4668 to f5ef319 Compare February 3, 2025 09:35

jbcaron added the performance Performance improvements or optimizations label Feb 3, 2025

Merge commit 'HEAD@{5}' into feat/bloom-filter

cff833c

jbcaron force-pushed the feat/bloom-filter branch from de25cf7 to cff833c Compare February 3, 2025 14:25

github-actions bot and others added 2 commits February 3, 2025 14:25

chore: bump db version

7d8af48

refactor(get_events): made code more readable

793316d

Trantorian1 approved these changes Feb 3, 2025

View reviewed changes

chore: fix comment

56a51cc

jbcaron force-pushed the feat/bloom-filter branch from 7b985e6 to 56a51cc Compare February 3, 2025 17:14

jbcaron requested review from notlesh and HermanObst February 11, 2025 08:06

Merge branch 'main' into feat/bloom-filter

939f6c9

HermanObst requested changes Feb 17, 2025

View reviewed changes

test(mp-bloom-filter): add test suite

7388e7e

jbcaron requested a review from HermanObst February 24, 2025 15:20

jbcaron added the Skip-Changelog Skip orchestrator changelog check label Feb 26, 2025

Merge branch 'main' into feat/bloom-filter

4bcdce8

HermanObst reviewed Feb 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: store event bloom filter in DB #473

feat: store event bloom filter in DB #473

jbcaron commented Jan 23, 2025 •

edited by Trantorian1

Loading

Trantorian1 Feb 3, 2025

jbcaron Feb 3, 2025

Trantorian1 Feb 3, 2025

Trantorian1 Feb 3, 2025

Trantorian1 Feb 3, 2025

jbcaron Feb 14, 2025

HermanObst commented Feb 14, 2025

Trantorian1 commented Feb 17, 2025 •

edited

Loading

Trantorian1 commented Feb 17, 2025

Trantorian1 commented Feb 17, 2025

jbcaron commented Feb 17, 2025

HermanObst left a comment

HermanObst Feb 17, 2025

Trantorian1 Feb 18, 2025 •

edited

Loading

HermanObst Feb 17, 2025

HermanObst Feb 17, 2025

HermanObst Feb 17, 2025

HermanObst Feb 17, 2025

Trantorian1 Feb 18, 2025

HermanObst Feb 17, 2025

HermanObst Feb 17, 2025

HermanObst commented Feb 26, 2025

HermanObst left a comment

HermanObst Feb 27, 2025

HermanObst Feb 27, 2025 •

edited

Loading

HermanObst Feb 27, 2025

HermanObst Feb 27, 2025

HermanObst Feb 27, 2025

		@@ -0,0 +1,181 @@
		use criterion::{black_box, criterion_group, criterion_main, Criterion};

		// Should match either key
		let searcher = EventBloomSearcher::new(Some(&Felt::from(1)), Some(&[vec![], vec![Felt::from(3)]]));

feat: store event bloom filter in DB #473

Are you sure you want to change the base?

feat: store event bloom filter in DB #473

Conversation

jbcaron commented Jan 23, 2025 • edited by Trantorian1 Loading

Pull Request type

What is the current behavior?

What is the new behavior?

Does this introduce a breaking change?

Other information

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HermanObst commented Feb 14, 2025

Trantorian1 commented Feb 17, 2025 • edited Loading

Trantorian1 commented Feb 17, 2025

Trantorian1 commented Feb 17, 2025

jbcaron commented Feb 17, 2025

HermanObst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Trantorian1 Feb 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HermanObst commented Feb 26, 2025

HermanObst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HermanObst Feb 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbcaron commented Jan 23, 2025 •

edited by Trantorian1

Loading

Trantorian1 commented Feb 17, 2025 •

edited

Loading

Trantorian1 Feb 18, 2025 •

edited

Loading

HermanObst Feb 27, 2025 •

edited

Loading