Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: store event bloom filter in DB #473

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Conversation

jbcaron
Copy link
Member

@jbcaron jbcaron commented Jan 23, 2025

Pull Request type

Please add the labels corresponding to the type of changes your PR introduces:

  • Feature

What is the current behavior?

Currently, the event pagination mechanism does not correctly handle the continuation token

Additionally, event filtering does not leverage a bloom filter, which could improve performance by reducing unnecessary block lookups.

Resolves: #NA

What is the new behavior?

  • Integration of mp-bloom-filter for event filtering:

    • Added the mp-bloom-filter crate to improve filtering efficiency.
    • Uses a bloom filter to quickly check if a block might contain relevant events before fetching it.
    • This optimization reduces the number of blocks read, improving query performance.
  • Optimized event retrieval:

    • Introduced take(chunk_size + 1) in the filtering process to determine whether the current block has more events.
    • Ensures efficient pagination by accurately detecting when a block is fully processed.

Does this introduce a breaking change?

Yes

  • The continuation token format has changed.
  • New DB column for bloom filter

Other information

@jbcaron jbcaron added feature Request for new feature or enhancement node Related to the full node implementation db-migration Requires database schema changes or migration labels Jan 23, 2025
@jbcaron jbcaron self-assigned this Jan 23, 2025
@jbcaron jbcaron requested a review from Trantorian1 February 3, 2025 08:28
@jbcaron jbcaron marked this pull request as ready for review February 3, 2025 08:28
@jbcaron jbcaron added the performance Performance improvements or optimizations label Feb 3, 2025
#[cfg(test)]
#[test]
fn test_column_all() {
assert_eq!(Column::ALL.len(), Column::NUM_COLUMNS);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really useful since NUM_COLUMNS is already defined as Self::ALL.len();?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just in case of update

Comment on lines 166 to 182
fn bench_bloom_filters(c: &mut Criterion) {
let test_data = generate_test_data(TEST_DATA_SIZE);

// Test different hash functions
let hashers = [
get_hasher_benchmarks::<DefaultHasher>("DefaultHasher"),
get_hasher_benchmarks::<AHasher>("AHasher"),
get_hasher_benchmarks::<XxHash64>("XxHash64"),
];

for hasher in &hashers {
(hasher.sequential_insertion)(c, &test_data, hasher.name);
(hasher.parallel_insertion)(c, &test_data, hasher.name);
(hasher.lookup)(c, &test_data, hasher.name);
(hasher.parallel_lookup)(c, &test_data, hasher.name);
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool testing structure!

/// where:
/// - k is the number of hash functions
/// - p is the desired false positive rate (0.01)
/// Reference: [Bloom, B. H. (1970). "Space/Time Trade-offs in Hash Coding with Allowable Errors"](https://dl.acm.org/doi/10.1145/362686.362692)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines 8 to 14
pub const HASH_COUNT: u8 = 7;

// Number of bits per element in the Bloom filter.
// The value 9.6 is optimal for a false positive rate of 1% (0.01).
const BITS_PER_ELEMENT: f64 = 9.6;

pub const FALSE_POSITIF_RATE: f64 = 0.01;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These constants appear several times in the codebase, maybe we want to define them only once and re-use them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I’d also like to keep testing and production separate. Each should have its own constants.

@HermanObst
Copy link

Hey there! I made a first quick pass on the PR and I have 2 first questions:

  1. Why do we need to introduce a bloom filter? Did we were facing low performance in get_events RPC?
  2. Why are we implementing the bloom filter from scratch and not using something like this?

@Trantorian1
Copy link
Collaborator

Trantorian1 commented Feb 17, 2025

Why do we need to introduce a bloom filter? Did we were facing low performance in get_events RPC?

Yes, getEvents was by far our lowest performing RPC method, and was lagging behind other clients' implementations by several orders of magnitude.

Here is a graph of our getEvents performance compared to other nodes (this was part of a perf test I made earlier in 2024, no other changes have been made to getEvents since then).
image

If ever you are interested, you can find the tool I developed for this bench here.

@Trantorian1
Copy link
Collaborator

Why are we implementing the bloom filter from scratch and not using something like this?

Why not? This gives us much more control over our codebase and its performance characteristics. In particular, as I remember @jbcaron, the goal was to create a bloom filter with a constant error rate independant of the block size. That and other optimizations might not have been possible had this not been implemented by hand.

@Trantorian1
Copy link
Collaborator

It is also worth noting that we would not be the only team to be developing our own bloom filter, this is already the case for Pathfinder.

@jbcaron
Copy link
Member Author

jbcaron commented Feb 17, 2025

Hey there! I made a first quick pass on the PR and I have 2 first questions:

  1. Why do we need to introduce a bloom filter? Did we were facing low performance in get_events RPC?
  2. Why are we implementing the bloom filter from scratch and not using something like this?

That's exactly right. In some tests, I observed response times that were up to three orders of magnitude faster. The Bloom filter implementation allows the filter size to depend on the number of keys being inserted, ensuring a constant false positive rate regardless of block size.

Additionally, this implementation enables precomputation of hashes for a set of keys, so that when scanning the filters of each block, we only compute the filter indices to check for the keys, without needing to recompute the hashes every time.

Lastly, the key pattern is also specifically designed for event filtering, optimizing lookup efficiency further.

Copy link

@HermanObst HermanObst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incredible work here @jbcaron! 🚀

I just requested some changes for:

  • adding docs and test in bloom_filter/src/events.rs
  • Just general discussion on some topics.

Im also curious if we have some full e2e test that check starknet_getEvent rpc call

let aligned_size = size_in_u64s << 6;

Self {
storage: AtomicBitStore::new(size_in_u64s as _),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't be explicit with the type? usize

Quick debate:

  • Does it makes sense for size to be an u64?

If it makes sense, its also worth saying that this will panic if madara runs in a 32 bits architecture (actually almost all of this impl would fail)

ofc it could be argued that Madara will never run in 32 bits, but happy to hear what you guys think

Copy link
Collaborator

@Trantorian1 Trantorian1 Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean if ever someone tries to run Madara on a 32 bit architecture they have bigger problems than just the bloom filter since half the codebase is going to explode/implode/both 😅

/// * `item`: The item to add, which must implement the Hash trait
pub fn add<T: Hash>(&self, item: &T) {
calculate_hashes::<H, T>(self.hash_count, item)
.for_each(|hash| self.storage.set_bit((hash % self.bit_size) as usize));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment on lines 271 to 280
fn calculate_hashes<H: Hasher + Default, T: Hash>(hash_count: u8, item: &T) -> impl Iterator<Item = u64> + '_ {
let mut hasher1 = H::default();
item.hash(&mut hasher1);
let h1 = hasher1.finish();

const PHI: u64 = 0x9e37_79b9_7f4a_7c15; // 2^64/φ
let h2 = h1.rotate_left(17).wrapping_mul(PHI);

(0..hash_count).map(move |i| h1.wrapping_add(h2.wrapping_mul(i as u64)))
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice stuff!

@@ -0,0 +1,191 @@
use crate::*;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great test suite 🚀

@@ -0,0 +1,181 @@
use criterion::{black_box, criterion_group, criterion_main, Criterion};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great as give us a tool to prevent us to introduce downgrades in the bloom filter perf (e.g we add this as part of CI and put alerts if the performance goes down)

Regarding the different hash functions used. There are 3 of them to decide which one to use?

It would be even great if we can have something like that, but with RPC calls (basically the reason for this PR)

My understanding is that @Trantorian1 built some benchmarks that check that.

Would it makes sense to add them in CI? (and also I would love to see those benchs going down with this PR 🚀 )

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test suite was made more for local exploration and requires a full sync Juno, Pathfinder and Madara node running simultaneously. Because of this it is rather tricky to have it running in the CI as we would need to keep these nodes up to date with the tip of the chain and manage downtime :/

There probably is a better way to do this, I'm thinking a local devnet to which we publish some sample data for each node to sync from (we would essentially be mocking a chain), but I haven't explored this and to be honest I'm not very familiar with publishing state to Starknet (never done Cairo).

So basically while the benches are handy to detect broad performance trends between each nodes, they are pretty hard to run atm which makes it difficult to integrate them into the CI.

Comment on lines 74 to 88
fn events_to_bloom_keys<'a, I>(events: I) -> impl Iterator<Item = [u8; 33]> + 'a
where
I: Iterator<Item = &'a Event> + 'a,
{
events.flat_map(Self::event_to_bloom_keys)
}

fn event_to_bloom_keys(event: &Event) -> impl Iterator<Item = [u8; 33]> + '_ {
let from_address_key = create_bloom_key(0, &event.from_address);

let keys = event.keys.iter().enumerate().map(|(i, key)| create_bloom_key(i as u8 + 1, key));

std::iter::once(from_address_key).chain(keys)
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbcaron please add documentation on this logic (how we go from events to store them in bloom filter)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it would be great to have some basic tests with events that also check might and definitely not
(would also serve as docs)

@jbcaron jbcaron requested a review from HermanObst February 24, 2025 15:20
@jbcaron jbcaron added the Skip-Changelog Skip orchestrator changelog check label Feb 26, 2025
@HermanObst
Copy link

Awesome! @Trantorian1 could we run the benchmarks to check that we are indeed gaining performance?

Copy link

@HermanObst HermanObst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbcaron thank you so much for this awesome work!
I just leave some comments, but we are good to go after discussing them!

@Trantorian1 although not strictly needed for merging this PR, it would be COOL! if we can see those benchmarks going down in this PR. It would be awesome

Comment on lines 319 to 320
// Should match either key
let searcher = EventBloomSearcher::new(Some(&Felt::from(1)), Some(&[vec![], vec![Felt::from(3)]]));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbcaron Whats the difference with line 282?

}

#[rstest]
fn test_single_event_exact_match(single_event: Event) {
Copy link

@HermanObst HermanObst Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldnt an exact match be:

EventBloomSearcher::new(Some(&Felt::from(1)), Some(&[vec![Felt::from(2)], vec![Felt::from(3)]]));

Comment on lines 282 to 288
// Empty vec array should be ignored
let searcher = EventBloomSearcher::new(Some(&Felt::from(1)), Some(&[vec![], vec![Felt::from(3)]]));
assert!(searcher.search(&reader), "Search should succeed when empty arrays are ignored");

// Only empty arrays should work
let searcher = EventBloomSearcher::new(Some(&Felt::from(1)), Some(&[vec![], vec![]]));
assert!(searcher.search(&reader), "Search with only empty arrays should work");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbcaron I dont understand the comments.
This two cases should be true?

let serialized = serde_json::to_string(&writer).unwrap();
let reader: EventBloomReader = serde_json::from_str(&serialized).unwrap();

// Should match either key

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are trying to test the (OR) behavior should this be something like:

        // Should match either key
        let searcher = EventBloomSearcher::new(Some(&Felt::from(1)), Some(&[vec![], vec![Felt::from(3),Felt::from(4)]]));
        assert!(searcher.search(&reader));
    }

fn test_multiple_events_negative_case(multiple_events: Vec<Event>) {
let writer = EventBloomWriter::from_events(multiple_events.iter());
let serialized = serde_json::to_string(&writer).unwrap();
let reader: EventBloomReader = serde_json::from_str(&serialized).unwrap();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could add a case where some keys match, but one not:

        let searcher = EventBloomSearcher::new(Some(&Felt::from(1)), Some(&[vec![Felt::from(5)],vec![Felt::from(2)], vec![Felt::from(3)]]));
        assert!(!searcher.search(&reader));

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
db-migration Requires database schema changes or migration feature Request for new feature or enhancement node Related to the full node implementation performance Performance improvements or optimizations Skip-Changelog Skip orchestrator changelog check
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.

3 participants