Add cuco::bloom_filter #573

sleeepyjack · 2024-08-08T01:12:38Z

Superseeds #101

Implementation of a GPU "Blocked Bloom Filter".

This PR is an updated/optimized version of #101 and features the following improvements:

Incorporate the new library design
Improve performance by computing the key's bit pattern based on a single hash value instead of using a double hashing derivative

sleeepyjack · 2024-08-08T01:13:39Z

kkraus14 · 2024-08-08T02:42:42Z

Thanks for throwing this up @sleeepyjack! I've pinged some folks on my side to take a pass at reviewing the host and device APIs as well as general functionality here in order to provide some high level feedback as a starting point.

felipeblazing · 2024-08-09T16:30:00Z

Thanks for working on this. I know that we are really excited at the prospect of being able to use this to accelerate some of our workloads. I want to describe some of the ways we were hoping to use this.

Test if sets are disjoint in a distributed context.

We would like to be able to serialize and deserialize the underlying data of the bloom filter to be able to send it across the network, preferably something that doesn't require a copy of the structure since we use libraries like UCX which can send directly from the GPU. Not being able to do this would probably make this unusable for us because of our distributed use case.

Another thing that might be considered is that in particular for testing if sets are disjoint is considering using partitioned bloom filters as described in these papers Understanding Bloom Filter Intersection
for Lazy Address-Set Disambiguation
A Case for Partitioned Bloom Filters

You might also consider a contains API that doesn't write output to an iterator but rather to a boolean indicating whether or not the the bloom filter contains an element from from the input being offered and an API that can take two bloom filters and test if the sets that produced them are disjoint. so something likebloom_filter.contains(other_bloom_filter).

Use bloom filters on large distributed joins to filter out rows from tables in join

Say you want to join two tables. One is 1TB another is 10GB. We would like to be able to make a bloom filter from subsets of the 10GB table and then use those bloom filters to test if subsets of the 1TB table are disjoint with those subsets of the 10GB table.

There are two features we could benefit from to help enable this work.

One is that we build bloom filters for all subsets of the tables (these might be distributed across many nodes) and then shuffle these around to test and see which subsets combinations can be ruled out for joining. So in this case its the same api I mentioned above where you can see if two bloom filters inputs were disjoint.

The other is that when they are possibly not disjoint we could then apply the bloom filter row by row with APIs that seem to be already available in this PR.

Be able to make bloom filters from multiple columns

Say someone is joining to tables on two columns, e.g. a.x = b.x and a.y = b.y

The bloom filter we would want to make in this case would be one where the input was a combination of x and y rather than having to make two bloom filters one for each column. This would be better and lowering the false positive rate than if we were to test them seperately.

sleeepyjack · 2024-08-09T23:02:25Z

Thanks for the valuable insights, @felipeblazing !

I can address some of the points right away:

We would like to be able to serialize and deserialize the underlying data of the bloom filter to be able to send it across the network, preferably something that doesn't require a copy of the structure

This should be doable. We did a similar thing for our distinct_count_estimator aka HyperLogLog++ implementation. tl;dr we provide access to the underlying memory span so it can be exported and serialized. Deserialization can be done by constructing a new _ref object over the raw byte vector. Here is an example on how this works with our HLL implementation.

Another thing that might be considered is that in particular for testing if sets are disjoint is considering using partitioned bloom filters

Thanks for sharing those papers. I don't have access permissions so I requested them. From my rough understanding, a partitioned Bloom filter stores the signature for each key in k different memory locations that are likely to be far apart from each other. I have some concerns if this scales well on the GPU. With the current blocked Bloom filter approach we aim to minimize the number of sector loads/stores as much as possible (in the best case the entire fingerprint of a key falls into a single sector or even word), since the data structure is strictly bottlenecked by random access performance of the memory system. I will dive through the papers to see if there's something we can do to allow for distributed workloads.

You might also consider a contains API that doesn't write output to an iterator but rather to a boolean indicating whether or not the the bloom filter contains an element from from the input being offered and an API that can take two bloom filters and test if the sets that produced them are disjoint. so something likebloom_filter.contains(other_bloom_filter).

Yep, this one should be easy to implement. Naming-wise I would go with something like contains_any(…) but that's just an idea.

Be able to make bloom filters from multiple columns

I think this would already work by customizing the hash function similar to what cudf does with their row_hasher. You could combine the hash values for each individual column like so: h_a = hash(a.x)^hash(a.y). If a.x = b.x and a.y = b.y then h_a = h_b. Although I think cudf's row hasher does something smarter than that.

include/cuco/bloom_filter.cuh

bdice · 2024-08-12T21:05:10Z

Although I think cudf's row hasher does something smarter than that.

cuDF has 32-bit and 64-bit hash_combine implementations. Usually, the crucial thing here is that you need a non-commutative function so that combine(left, right) and combine(right, left) give different results.

jbapple · 2024-08-13T18:49:10Z

Another thing that might be considered is that in particular for testing if sets are disjoint is considering using partitioned bloom filters as described in these papers Understanding Bloom Filter Intersection for Lazy Address-Set Disambiguation A Case for Partitioned Bloom Filters

Hi @felipeblazing! This looks to me, based on the code as well as the PR message, like it is based on Apache Impala's Bloom filters, which in turn infiltrated Apache Arrow, Apache Parquet, and so on - at least as long as pattern_bits_ is no more than window_size. I mention this because the Impala filters actually are partitioned Bloom filters - but partitioned within a small contiguous block, with each partition being the size of a single word. (See also Section 3 of Performance-Optimal Filtering: Bloom Overtakes Cuckoo at High Throughput, which calls this partitioning within a block "sectorization").

On the other hand, when pattern_bits_ is greater than window_size (which looks like the default, since window_size defaults to 1), this is only partially sectorized/partitioned - multiple bits will be set in some or all partitions. For a window_size of 1, this effectively means no partitioning.

For SIMD on CPU, sectorized/partitioned with a pattern_bits_ == window_size == 8 was the right choice for Impala's performance, as it accessed then only 256 bits (thus fitting within one cache line), but I'm guessing that window_size == 1 has better performance for Nvidia GPUs, based on much smaller (32 bit?) cache lines (or the the GPU equivalent)? I don't know enough about GPU hardware yet to be know that. However, if more cache locality is available, the extent to which the filter is sectorized/partitioned can be increased by increasing window_size while keeping pattern_bits_ the same. This would also reduce the false positive probability without increasing the total filter size.

sleeepyjack · 2024-08-13T22:50:57Z

Impala filters actually are partitioned Bloom filters - but partitioned within a small contiguous block, with each partition being the size of a single word.

Very interesting point! Yes, you could apply the same approach to the blocked (or sectorized) Bloom filter in this PR.

For SIMD on CPU, sectorized/partitioned with a pattern_bits_ == window_size == 8 was the right choice for Impala's performance, as it accessed then only 256 bits (thus fitting within one cache line), but I'm guessing that window_size == 1 has better performance for Nvidia GPUs, based on much smaller (32 bit?) cache lines (or the the GPU equivalent)?

GPUs follow a similar principle where the GPU's cache line size is either 32byte (a sector) or 4*32byte aka an L2 slice depending on how you look at it. In 9332c9a I was able to fix some performance issues for when the window_size > 1. I owe you folks some new benchmarks, but my initial observation was that with this fix performance stays more or less constant (close to SOL GUPS throughput) up to the point where the window_size exceeds the size of a single sector. So yeah, having a partitioned/sectorized filter with good performance is totally feasible with the current design.

jbapple · 2024-08-14T01:31:25Z

. . .performance stays more or less constant (close to SOL GUPS throughput) up to the point where the window_size exceeds the size of a single sector.

Would it make sense, then, for window_size to be min(8, pattern_bits_) by default as long as pattern_bits_ is a power of 2 (since sizeof(word_type) == 4 and a sector is 8 * 4 = 32 bytes?), rather than 1 by default?

With some other small changes, this could also support using the Bloom filters from Parquet, KVRocks, and Impala (and forks Doris and StarRocks) without having to read the input keys and re-encode a new Bloom filter. I can imagine this might be of use mostly in the Parquet case.

Address #573 (comment) This PR updates the existing code to use `cuda::std::byte` as the default device data type instead of `std::byte`. This change addresses potential issues where `cuda::std::` utilities cannot be applied to `std::byte` when relaxed constexprs are disabled.

README.md

benchmarks/bloom_filter/bloom_filter_bench.cu

PointKernel · 2024-09-29T20:13:19Z

include/cuco/bloom_filter.cuh

+          class Extent             = cuco::extent<std::size_t>,
+          cuda::thread_scope Scope = cuda::thread_scope_device,
+          class Policy =
+            cuco::bloom_filter_policy<cuco::xxhash_64<Key>, cuda::std::array<std::uint32_t, 8>>,


probably worth adding a strong type instead of using plain cuda::std::array

Hmm, I also thought about this. The information needed here is the word type and block size (determined by typename Block::value_type and cuda::std::tuple_size_v<Block>, i.e., standardized facilities to describe a container's value type and static size). So technically any container that follows this concept can be used to describe the filter block parameters.

This type has a single use in cuco and since the Block type itself is actually never used, i.e., we need it only to extract the block parameters, I would rather vote for replacing the Block tparam in the policy with bloom_filter_policy<class Hash, class Word, uint32_t WordsPerBlock>. WDYT?

include/cuco/bloom_filter.cuh

include/cuco/bloom_filter_ref.cuh

include/cuco/detail/bloom_filter/bloom_filter_impl.cuh

include/cuco/detail/utility/cuda.cuh

tests/bloom_filter/unique_sequence_test.cu

Co-authored-by: Yunsong Wang <[email protected]>

PointKernel

Great work. @sleeepyjack Thank you!

sleeepyjack · 2024-10-02T23:50:44Z

Forgot to post some final benchmark results:

# Benchmark Results

## bloom_filter_add_unique_size

### [0] NVIDIA H100 80GB HBM3

| Key |   Hash    | Word | WordsPerBlock | Distribution | NumInputs | FilterSizeMB | FalsePositiveRate | HBWPeak | LoadEff | StoreEff | L1HitRate | L2HitRate | Samples | Samples | CPU Time  | Noise | GPU Time  | Noise | Elem/s  | Samples | Batch GPU |
|-----|-----------|------|---------------|--------------|-----------|--------------|-------------------|---------|---------|----------|-----------|-----------|---------|---------|-----------|-------|-----------|-------|---------|---------|-----------|
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |            1 |                 1 |   3.69% | 100.00% |    0.00% |     0.00% |    86.45% |      3x |     20x | 26.092 ms | 0.09% | 26.086 ms | 0.09% | 15.334G |     21x | 26.085 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |            2 |                 1 |   3.69% | 100.00% |    0.00% |     0.00% |    86.45% |      3x |     20x | 26.090 ms | 0.05% | 26.085 ms | 0.04% | 15.335G |     21x | 26.085 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |            4 |                 1 |   3.69% | 100.00% |    0.00% |     0.00% |    86.45% |      3x |     20x | 26.097 ms | 0.07% | 26.091 ms | 0.06% | 15.331G |     21x | 26.078 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |            8 |                 1 |   3.70% | 100.00% |    0.00% |     0.00% |    86.43% |      3x |     20x | 26.097 ms | 0.09% | 26.091 ms | 0.09% | 15.331G |     21x | 26.079 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |           16 |                 1 |   3.71% | 100.00% |    0.00% |     0.00% |    86.41% |      3x |     20x | 26.103 ms | 0.07% | 26.097 ms | 0.07% | 15.327G |     21x | 26.079 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |           32 |           0.99997 |  12.81% | 100.00% |    0.00% |     0.00% |    85.57% |      3x |     19x | 26.896 ms | 0.05% | 26.890 ms | 0.05% | 14.875G |     20x | 26.876 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |           64 |           0.99228 |  25.18% | 100.00% |    0.00% |     0.00% |    55.47% |      3x |     17x | 29.564 ms | 0.07% | 29.558 ms | 0.07% | 13.533G |     18x | 29.548 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |          128 |           0.85593 |  29.53% | 100.00% |    0.00% |     0.00% |    41.13% |      3x |     16x | 31.391 ms | 0.04% | 31.385 ms | 0.04% | 12.745G |     17x | 31.375 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |          256 |           0.47041 |  30.19% | 100.00% |    0.00% |     0.00% |    35.93% |      3x |     16x | 32.932 ms | 0.02% | 32.927 ms | 0.01% | 12.148G |     17x | 32.921 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |          512 |           0.15322 |  30.11% | 100.00% |    0.00% |     0.00% |    33.68% |      3x |     15x | 33.999 ms | 0.03% | 33.994 ms | 0.02% | 11.767G |     16x | 33.986 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |         1024 |          0.035256 |  29.98% | 100.00% |    0.00% |     0.00% |    32.62% |      3x |     15x | 34.610 ms | 0.02% | 34.604 ms | 0.02% | 11.559G |     16x | 34.601 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |         2048 |          0.012553 |  29.88% | 100.00% |    0.00% |     0.00% |    32.11% |      3x |     15x | 34.930 ms | 0.02% | 34.924 ms | 0.01% | 11.453G |     16x | 34.919 ms |

## bloom_filter_add_unique_hash

### [0] NVIDIA H100 80GB HBM3

| Key |     Hash      | Word | WordsPerBlock | Distribution | NumInputs | FilterSizeMB | FalsePositiveRate | HBWPeak | LoadEff | StoreEff | L1HitRate | L2HitRate | Samples | Samples | CPU Time  | Noise | GPU Time  | Noise | Elem/s  | Samples | Batch GPU |
|-----|---------------|------|---------------|--------------|-----------|--------------|-------------------|---------|---------|----------|-----------|-----------|---------|---------|-----------|-------|-----------|-------|---------|---------|-----------|
| I64 | identity_hash |  U32 |             8 |       UNIQUE | 400000000 |         2000 |                 0 |  30.84% | 100.00% |    0.00% |     0.00% |    29.35% |      3x |     15x | 33.856 ms | 0.02% | 33.851 ms | 0.01% | 11.817G |     16x | 33.847 ms |
| I64 |     xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |         2000 |         0.0017157 |  29.89% | 100.00% |    0.00% |     0.00% |    32.12% |      3x |     15x | 34.927 ms | 0.02% | 34.921 ms | 0.02% | 11.454G |     16x | 34.915 ms |

## bloom_filter_add_unique_block_dim

### [0] NVIDIA H100 80GB HBM3

| Key |   Hash    | Word | WordsPerBlock | Distribution | NumInputs | FilterSizeMB | FalsePositiveRate | HBWPeak | LoadEff | StoreEff | L1HitRate | L2HitRate | Samples | Samples | CPU Time  | Noise | GPU Time  | Noise | Elem/s  | Samples | Batch GPU |
|-----|-----------|------|---------------|--------------|-----------|--------------|-------------------|---------|---------|----------|-----------|-----------|---------|---------|-----------|-------|-----------|-------|---------|---------|-----------|
| I64 | xxhash_64 |  U32 |             1 |       UNIQUE | 400000000 |         2000 |           0.53373 |  33.84% |  50.00% |    0.00% |    16.59% |    27.51% |      3x |     17x | 30.846 ms | 0.02% | 30.841 ms | 0.01% | 12.970G |     18x | 30.837 ms |
| I64 | xxhash_64 |  U32 |             2 |       UNIQUE | 400000000 |         2000 |           0.78258 |  33.78% | 100.00% |    0.00% |     0.00% |    27.39% |      3x |     17x | 30.886 ms | 0.02% | 30.881 ms | 0.01% | 12.953G |     18x | 30.880 ms |
| I64 | xxhash_64 |  U32 |             4 |       UNIQUE | 400000000 |         2000 |           0.95274 |  33.51% | 100.00% |    0.00% |     0.00% |    27.23% |      3x |     17x | 31.140 ms | 0.02% | 31.135 ms | 0.01% | 12.847G |     18x | 31.134 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |         2000 |         0.0017157 |  29.89% | 100.00% |    0.00% |     0.00% |    32.13% |      3x |     15x | 34.920 ms | 0.02% | 34.914 ms | 0.02% | 11.457G |     16x | 34.910 ms |
| I64 | xxhash_64 |  U64 |             1 |       UNIQUE | 400000000 |         2000 |           0.78258 |  33.83% |  50.00% |    0.00% |    16.58% |    27.51% |      3x |     17x | 30.843 ms | 0.02% | 30.838 ms | 0.01% | 12.971G |     18x | 30.834 ms |
| I64 | xxhash_64 |  U64 |             2 |       UNIQUE | 400000000 |         2000 |           0.95274 |  33.79% | 100.00% |    0.00% |     0.00% |    27.39% |      3x |     17x | 30.891 ms | 0.02% | 30.885 ms | 0.01% | 12.951G |     18x | 30.884 ms |
| I64 | xxhash_64 |  U64 |             4 |       UNIQUE | 400000000 |         2000 |           0.17364 |  33.52% | 100.00% |    0.00% |     0.00% |    27.22% |      3x |     17x | 31.138 ms | 0.02% | 31.132 ms | 0.01% | 12.848G |     18x | 31.131 ms |
| I64 | xxhash_64 |  U64 |             8 |       UNIQUE | 400000000 |         2000 |        0.00027879 |  41.92% | 100.00% |    0.00% |     0.00% |    32.92% |      3x |     14x | 38.398 ms | 0.02% | 38.392 ms | 0.01% | 10.419G |     15x | 38.387 ms |

## bloom_filter_contains_unique_size

### [0] NVIDIA H100 80GB HBM3

| Key |   Hash    | Word | WordsPerBlock | Distribution | NumInputs | FilterSizeMB | FalsePositiveRate | HBWPeak | LoadEff | StoreEff | L1HitRate | L2HitRate | Samples | Samples | CPU Time  | Noise | GPU Time  | Noise | Elem/s  | Samples | Batch GPU |
|-----|-----------|------|---------------|--------------|-----------|--------------|-------------------|---------|---------|----------|-----------|-----------|---------|---------|-----------|-------|-----------|-------|---------|---------|-----------|
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |            1 |                 1 |  21.05% |  55.56% |  100.00% |    47.39% |    72.33% |      3x |     98x |  5.130 ms | 0.19% |  5.125 ms | 0.15% | 78.056G |    102x |  5.120 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |            2 |                 1 |  21.05% |  55.56% |  100.00% |    45.97% |    73.02% |      3x |     98x |  5.129 ms | 0.12% |  5.123 ms | 0.06% | 78.073G |    102x |  5.120 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |            4 |                 1 |  21.04% |  55.56% |  100.00% |    45.33% |    73.32% |      3x |     98x |  5.130 ms | 0.12% |  5.124 ms | 0.05% | 78.058G |    102x |  5.121 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |            8 |                 1 |  21.08% |  55.56% |  100.00% |    45.03% |    73.45% |      3x |     98x |  5.132 ms | 0.13% |  5.126 ms | 0.07% | 78.033G |    102x |  5.120 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |           16 |                 1 |  22.74% |  55.56% |  100.00% |    44.88% |    72.28% |      3x |     98x |  5.141 ms | 0.13% |  5.135 ms | 0.07% | 77.890G |    102x |  5.127 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |           32 |           0.99997 |  65.42% |  55.56% |  100.00% |    44.81% |    42.71% |      3x |     91x |  5.514 ms | 0.11% |  5.509 ms | 0.05% | 72.610G |     95x |  5.555 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |           64 |           0.99228 |  70.30% |  55.56% |  100.00% |    44.72% |    18.46% |      3x |     55x |  9.176 ms | 0.07% |  9.170 ms | 0.03% | 43.620G |     57x |  9.159 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |          128 |           0.85593 |  66.46% |  55.56% |  100.00% |    44.69% |     9.75% |      3x |     44x | 11.554 ms | 0.05% | 11.548 ms | 0.02% | 34.638G |     45x | 11.537 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |          256 |           0.47041 |  64.21% |  55.56% |  100.00% |    44.67% |     9.20% |      3x |     40x | 12.800 ms | 0.05% | 12.794 ms | 0.01% | 31.264G |     41x | 12.785 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |          512 |           0.15322 |  63.04% |  55.56% |  100.00% |    44.66% |    10.32% |      3x |     38x | 13.436 ms | 0.04% | 13.430 ms | 0.02% | 29.784G |     39x | 13.421 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |         1024 |          0.035256 |  62.46% |  55.56% |  100.00% |    44.66% |    11.13% |      3x |     37x | 13.756 ms | 0.05% | 13.750 ms | 0.02% | 29.091G |     38x | 13.741 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |         2048 |          0.012553 |  62.19% |  55.56% |  100.00% |    44.66% |    11.59% |      3x |     36x | 13.916 ms | 0.04% | 13.911 ms | 0.02% | 28.755G |     37x | 13.903 ms |

## bloom_filter_contains_unique_hash

### [0] NVIDIA H100 80GB HBM3

| Key |     Hash      | Word | WordsPerBlock | Distribution | NumInputs | FilterSizeMB | FalsePositiveRate | HBWPeak | LoadEff | StoreEff | L1HitRate | L2HitRate | Samples | Samples | CPU Time  | Noise | GPU Time  | Noise | Elem/s  | Samples | Batch GPU |
|-----|---------------|------|---------------|--------------|-----------|--------------|-------------------|---------|---------|----------|-----------|-----------|---------|---------|-----------|-------|-----------|-------|---------|---------|-----------|
| I64 | identity_hash |  U32 |             8 |       UNIQUE | 400000000 |         2000 |                 0 |  62.18% |  55.56% |  100.00% |    44.64% |    11.53% |      3x |     36x | 13.920 ms | 0.04% | 13.914 ms | 0.02% | 28.747G |     37x | 13.905 ms |
| I64 |     xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |         2000 |         0.0017157 |  62.15% |  55.56% |  100.00% |    44.64% |    11.51% |      3x |     36x | 13.913 ms | 0.04% | 13.907 ms | 0.02% | 28.762G |     37x | 13.898 ms |

## bloom_filter_contains_unique_block_dim

### [0] NVIDIA H100 80GB HBM3

| Key |   Hash    | Word | WordsPerBlock | Distribution | NumInputs | FilterSizeMB | FalsePositiveRate | HBWPeak | LoadEff | StoreEff | L1HitRate | L2HitRate | Samples | Samples | CPU Time  | Noise | GPU Time  | Noise | Elem/s  | Samples | Batch GPU |
|-----|-----------|------|---------------|--------------|-----------|--------------|-------------------|---------|---------|----------|-----------|-----------|---------|---------|-----------|-------|-----------|-------|---------|---------|-----------|
| I64 | xxhash_64 |  U32 |             1 |       UNIQUE | 400000000 |         2000 |           0.53373 |  62.17% |  30.00% |  100.00% |     1.73% |    11.62% |      3x |     36x | 13.916 ms | 0.05% | 13.911 ms | 0.02% | 28.755G |     37x | 13.902 ms |
| I64 | xxhash_64 |  U32 |             2 |       UNIQUE | 400000000 |         2000 |           0.78258 |  62.17% |  40.00% |  100.00% |     1.74% |    11.61% |      3x |     36x | 13.913 ms | 0.05% | 13.908 ms | 0.02% | 28.761G |     37x | 13.899 ms |
| I64 | xxhash_64 |  U32 |             4 |       UNIQUE | 400000000 |         2000 |           0.95274 |  62.14% |  60.00% |  100.00% |     1.74% |    11.54% |      3x |     36x | 13.915 ms | 0.05% | 13.909 ms | 0.02% | 28.758G |     37x | 13.900 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |         2000 |         0.0017157 |  62.17% |  55.56% |  100.00% |    44.65% |    11.52% |      3x |     36x | 13.912 ms | 0.04% | 13.906 ms | 0.01% | 28.764G |     37x | 13.898 ms |
| I64 | xxhash_64 |  U64 |             1 |       UNIQUE | 400000000 |         2000 |           0.78258 |  62.18% |  40.00% |  100.00% |     1.74% |    11.62% |      3x |     36x | 13.916 ms | 0.05% | 13.910 ms | 0.03% | 28.756G |     37x | 13.903 ms |
| I64 | xxhash_64 |  U64 |             2 |       UNIQUE | 400000000 |         2000 |           0.95274 |  62.20% |  60.00% |  100.00% |     1.75% |    11.66% |      3x |     36x | 13.913 ms | 0.04% | 13.908 ms | 0.02% | 28.761G |     37x | 13.897 ms |
| I64 | xxhash_64 |  U64 |             4 |       UNIQUE | 400000000 |         2000 |           0.17364 |  62.14% |  55.56% |  100.00% |    44.63% |    11.64% |      3x |     36x | 13.913 ms | 0.04% | 13.907 ms | 0.01% | 28.762G |     37x | 13.898 ms |
| I64 | xxhash_64 |  U64 |             8 |       UNIQUE | 400000000 |         2000 |        0.00027879 |  60.65% |  52.94% |   50.00% |    47.42% |     8.38% |      3x |     36x | 14.264 ms | 0.04% | 14.258 ms | 0.01% | 28.054G |     37x | 14.249 ms |

cuco_bf_h100.csv

(includes changes from #609)

sleeepyjack added 2 commits August 8, 2024 01:01

First draft of new version

3d11031

Merge remote-tracking branch 'upstream/dev' into bloom-filter-new

b5c4570

sleeepyjack added type: feature request New feature request helps: rapids Helps or needed by RAPIDS In Progress Currently a work in progress labels Aug 8, 2024

sleeepyjack self-assigned this Aug 8, 2024

sleeepyjack mentioned this pull request Aug 8, 2024

cuco::bloom_filter #101

Closed

Fix merge conflict

ad43a1f

sleeepyjack force-pushed the bloom-filter-new branch from 9b8d32d to ad43a1f Compare August 8, 2024 01:44

sleeepyjack added the Needs Review Awaiting reviews before merging label Aug 8, 2024

sleeepyjack added 6 commits August 8, 2024 13:36

Use cub::Device::ForEachCopyN in add_async

3cb68fa

Fix formatting

7d836b9

Fix contains_if_n kernel

a483bf0

Add unit test

2728d25

Make helper functions private

d0359ee

Fix sub_filter_idx

dc97ae3

sleeepyjack added 2 commits August 10, 2024 01:38

Add cooperative add(...)

9332c9a

Use scalar add(...) for window_size=1

21fa5b0

PointKernel reviewed Aug 12, 2024

View reviewed changes

include/cuco/bloom_filter.cuh Show resolved Hide resolved

sleeepyjack added 3 commits August 27, 2024 01:01

Apply PIMPL

694ea0c

Add word type to benchmark type axes

8db2ee1

Add public accessors

36b1532

sleeepyjack added 4 commits September 18, 2024 17:26

Throw error on invalid policy configuration

8527e01

Clarify docs

6d85a53

Use explicit return type

936f8fe

Set maximum required alignment to 16 bytes

d2f04ff

sleeepyjack and others added 7 commits September 26, 2024 00:56

Add adaptive CG add/contains

21c374a

Improve error handling

695490e

constexpr all the things (where possible)

24fcaab

Rename default_filter_policy -> bloom_filter_policy

40dd683

Add bloom_filter section to README

8fc4b00

Merge remote-tracking branch 'upstream/dev' into bloom-filter-new

591df5a

Fix docs

446e6bb

PointKernel reviewed Sep 29, 2024

View reviewed changes

sleeepyjack and others added 7 commits October 1, 2024 00:45

Fix docs

d8c21fa

Co-authored-by: Yunsong Wang <[email protected]>

Split benchmarks into multiple TUs

8770a42

Fix docs

1f6223f

Refactor design to specify filter block params

251711d

Revise test includes

b6ec260

Make tile_size_v device-accessible only

70de880

Merge remote-tracking branch 'upstream/dev' into bloom-filter-new

d11c050

sleeepyjack requested a review from PointKernel October 1, 2024 02:06

sleeepyjack added 2 commits October 1, 2024 02:15

Clarify docs regarding heterogeneous lookup

85ff60b

Fix docs regarding stream description

95a0221

PointKernel changed the title ~~cuco::bloom_filter (new)~~ Add cuco::bloom_filter Oct 1, 2024

PointKernel approved these changes Oct 1, 2024

View reviewed changes

Remove unused include

a99e7bc

sleeepyjack merged commit 8c8590c into NVIDIA:dev Oct 2, 2024
19 checks passed

sleeepyjack deleted the bloom-filter-new branch October 2, 2024 00:37

sleeepyjack mentioned this pull request Oct 3, 2024

CUPTI metrics for bloom_filter benchmarks #609

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cuco::bloom_filter #573

Add cuco::bloom_filter #573

sleeepyjack commented Aug 8, 2024 •

edited

Loading

sleeepyjack commented Aug 8, 2024

kkraus14 commented Aug 8, 2024

felipeblazing commented Aug 9, 2024

sleeepyjack commented Aug 9, 2024

bdice commented Aug 12, 2024

jbapple commented Aug 13, 2024

sleeepyjack commented Aug 13, 2024

jbapple commented Aug 14, 2024

PointKernel Sep 29, 2024

sleeepyjack Sep 30, 2024

PointKernel left a comment

sleeepyjack commented Oct 2, 2024 •

edited

Loading

Add cuco::bloom_filter #573

Add cuco::bloom_filter #573

Conversation

sleeepyjack commented Aug 8, 2024 • edited Loading

sleeepyjack commented Aug 8, 2024

kkraus14 commented Aug 8, 2024

felipeblazing commented Aug 9, 2024

sleeepyjack commented Aug 9, 2024

bdice commented Aug 12, 2024

jbapple commented Aug 13, 2024

sleeepyjack commented Aug 13, 2024

jbapple commented Aug 14, 2024

PointKernel Sep 29, 2024

Choose a reason for hiding this comment

sleeepyjack Sep 30, 2024

Choose a reason for hiding this comment

PointKernel left a comment

Choose a reason for hiding this comment

sleeepyjack commented Oct 2, 2024 • edited Loading

sleeepyjack commented Aug 8, 2024 •

edited

Loading

sleeepyjack commented Oct 2, 2024 •

edited

Loading