Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cuco::bloom_filter #573

Merged
merged 58 commits into from
Oct 2, 2024
Merged

Conversation

sleeepyjack
Copy link
Collaborator

@sleeepyjack sleeepyjack commented Aug 8, 2024

Superseeds #101

Implementation of a GPU "Blocked Bloom Filter".

This PR is an updated/optimized version of #101 and features the following improvements:

  • Incorporate the new library design
  • Improve performance by computing the key's bit pattern based on a single hash value instead of using a double hashing derivative

@sleeepyjack sleeepyjack added type: feature request New feature request helps: rapids Helps or needed by RAPIDS In Progress Currently a work in progress labels Aug 8, 2024
@sleeepyjack sleeepyjack self-assigned this Aug 8, 2024
@sleeepyjack
Copy link
Collaborator Author

CC @kkraus14

@sleeepyjack sleeepyjack mentioned this pull request Aug 8, 2024
@sleeepyjack sleeepyjack added the Needs Review Awaiting reviews before merging label Aug 8, 2024
@kkraus14
Copy link

kkraus14 commented Aug 8, 2024

Thanks for throwing this up @sleeepyjack! I've pinged some folks on my side to take a pass at reviewing the host and device APIs as well as general functionality here in order to provide some high level feedback as a starting point.

@felipeblazing
Copy link

Thanks for working on this. I know that we are really excited at the prospect of being able to use this to accelerate some of our workloads. I want to describe some of the ways we were hoping to use this.

  1. Test if sets are disjoint in a distributed context.

We would like to be able to serialize and deserialize the underlying data of the bloom filter to be able to send it across the network, preferably something that doesn't require a copy of the structure since we use libraries like UCX which can send directly from the GPU. Not being able to do this would probably make this unusable for us because of our distributed use case.

Another thing that might be considered is that in particular for testing if sets are disjoint is considering using partitioned bloom filters as described in these papers Understanding Bloom Filter Intersection
for Lazy Address-Set Disambiguation

A Case for Partitioned Bloom Filters

You might also consider a contains API that doesn't write output to an iterator but rather to a boolean indicating whether or not the the bloom filter contains an element from from the input being offered and an API that can take two bloom filters and test if the sets that produced them are disjoint. so something likebloom_filter.contains(other_bloom_filter).

  1. Use bloom filters on large distributed joins to filter out rows from tables in join

Say you want to join two tables. One is 1TB another is 10GB. We would like to be able to make a bloom filter from subsets of the 10GB table and then use those bloom filters to test if subsets of the 1TB table are disjoint with those subsets of the 10GB table.

There are two features we could benefit from to help enable this work.

One is that we build bloom filters for all subsets of the tables (these might be distributed across many nodes) and then shuffle these around to test and see which subsets combinations can be ruled out for joining. So in this case its the same api I mentioned above where you can see if two bloom filters inputs were disjoint.

The other is that when they are possibly not disjoint we could then apply the bloom filter row by row with APIs that seem to be already available in this PR.

  1. Be able to make bloom filters from multiple columns

Say someone is joining to tables on two columns, e.g. a.x = b.x and a.y = b.y

The bloom filter we would want to make in this case would be one where the input was a combination of x and y rather than having to make two bloom filters one for each column. This would be better and lowering the false positive rate than if we were to test them seperately.

@sleeepyjack
Copy link
Collaborator Author

Thanks for the valuable insights, @felipeblazing !

I can address some of the points right away:

We would like to be able to serialize and deserialize the underlying data of the bloom filter to be able to send it across the network, preferably something that doesn't require a copy of the structure

This should be doable. We did a similar thing for our distinct_count_estimator aka HyperLogLog++ implementation. tl;dr we provide access to the underlying memory span so it can be exported and serialized. Deserialization can be done by constructing a new _ref object over the raw byte vector. Here is an example on how this works with our HLL implementation.

Another thing that might be considered is that in particular for testing if sets are disjoint is considering using partitioned bloom filters

Thanks for sharing those papers. I don't have access permissions so I requested them. From my rough understanding, a partitioned Bloom filter stores the signature for each key in k different memory locations that are likely to be far apart from each other. I have some concerns if this scales well on the GPU. With the current blocked Bloom filter approach we aim to minimize the number of sector loads/stores as much as possible (in the best case the entire fingerprint of a key falls into a single sector or even word), since the data structure is strictly bottlenecked by random access performance of the memory system. I will dive through the papers to see if there's something we can do to allow for distributed workloads.

You might also consider a contains API that doesn't write output to an iterator but rather to a boolean indicating whether or not the the bloom filter contains an element from from the input being offered and an API that can take two bloom filters and test if the sets that produced them are disjoint. so something likebloom_filter.contains(other_bloom_filter).

Yep, this one should be easy to implement. Naming-wise I would go with something like contains_any(…) but that's just an idea.

Be able to make bloom filters from multiple columns

I think this would already work by customizing the hash function similar to what cudf does with their row_hasher. You could combine the hash values for each individual column like so: h_a = hash(a.x)^hash(a.y). If a.x = b.x and a.y = b.y then h_a = h_b. Although I think cudf's row hasher does something smarter than that.

@bdice
Copy link
Contributor

bdice commented Aug 12, 2024

Although I think cudf's row hasher does something smarter than that.

cuDF has 32-bit and 64-bit hash_combine implementations. Usually, the crucial thing here is that you need a non-commutative function so that combine(left, right) and combine(right, left) give different results.

@jbapple
Copy link

jbapple commented Aug 13, 2024

Another thing that might be considered is that in particular for testing if sets are disjoint is considering using partitioned bloom filters as described in these papers Understanding Bloom Filter Intersection for Lazy Address-Set Disambiguation A Case for Partitioned Bloom Filters

Hi @felipeblazing! This looks to me, based on the code as well as the PR message, like it is based on Apache Impala's Bloom filters, which in turn infiltrated Apache Arrow, Apache Parquet, and so on - at least as long as pattern_bits_ is no more than window_size. I mention this because the Impala filters actually are partitioned Bloom filters - but partitioned within a small contiguous block, with each partition being the size of a single word. (See also Section 3 of Performance-Optimal Filtering: Bloom Overtakes Cuckoo at High Throughput, which calls this partitioning within a block "sectorization").

On the other hand, when pattern_bits_ is greater than window_size (which looks like the default, since window_size defaults to 1), this is only partially sectorized/partitioned - multiple bits will be set in some or all partitions. For a window_size of 1, this effectively means no partitioning.

For SIMD on CPU, sectorized/partitioned with a pattern_bits_ == window_size == 8 was the right choice for Impala's performance, as it accessed then only 256 bits (thus fitting within one cache line), but I'm guessing that window_size == 1 has better performance for Nvidia GPUs, based on much smaller (32 bit?) cache lines (or the the GPU equivalent)? I don't know enough about GPU hardware yet to be know that. However, if more cache locality is available, the extent to which the filter is sectorized/partitioned can be increased by increasing window_size while keeping pattern_bits_ the same. This would also reduce the false positive probability without increasing the total filter size.

@sleeepyjack
Copy link
Collaborator Author

Impala filters actually are partitioned Bloom filters - but partitioned within a small contiguous block, with each partition being the size of a single word.

Very interesting point! Yes, you could apply the same approach to the blocked (or sectorized) Bloom filter in this PR.

For SIMD on CPU, sectorized/partitioned with a pattern_bits_ == window_size == 8 was the right choice for Impala's performance, as it accessed then only 256 bits (thus fitting within one cache line), but I'm guessing that window_size == 1 has better performance for Nvidia GPUs, based on much smaller (32 bit?) cache lines (or the the GPU equivalent)?

GPUs follow a similar principle where the GPU's cache line size is either 32byte (a sector) or 4*32byte aka an L2 slice depending on how you look at it. In 9332c9a I was able to fix some performance issues for when the window_size > 1. I owe you folks some new benchmarks, but my initial observation was that with this fix performance stays more or less constant (close to SOL GUPS throughput) up to the point where the window_size exceeds the size of a single sector. So yeah, having a partitioned/sectorized filter with good performance is totally feasible with the current design.

@jbapple
Copy link

jbapple commented Aug 14, 2024

. . .performance stays more or less constant (close to SOL GUPS throughput) up to the point where the window_size exceeds the size of a single sector.

Would it make sense, then, for window_size to be min(8, pattern_bits_) by default as long as pattern_bits_ is a power of 2 (since sizeof(word_type) == 4 and a sector is 8 * 4 = 32 bytes?), rather than 1 by default?

With some other small changes, this could also support using the Bloom filters from Parquet, KVRocks, and Impala (and forks Doris and StarRocks) without having to read the input keys and re-encode a new Bloom filter. I can imagine this might be of use mostly in the Parquet case.

PointKernel added a commit that referenced this pull request Sep 18, 2024
Address
#573 (comment)

This PR updates the existing code to use `cuda::std::byte` as the
default device data type instead of `std::byte`. This change addresses
potential issues where `cuda::std::` utilities cannot be applied to
`std::byte` when relaxed constexprs are disabled.
README.md Outdated Show resolved Hide resolved
benchmarks/bloom_filter/bloom_filter_bench.cu Outdated Show resolved Hide resolved
benchmarks/bloom_filter/bloom_filter_bench.cu Outdated Show resolved Hide resolved
class Extent = cuco::extent<std::size_t>,
cuda::thread_scope Scope = cuda::thread_scope_device,
class Policy =
cuco::bloom_filter_policy<cuco::xxhash_64<Key>, cuda::std::array<std::uint32_t, 8>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably worth adding a strong type instead of using plain cuda::std::array

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I also thought about this. The information needed here is the word type and block size (determined by typename Block::value_type and cuda::std::tuple_size_v<Block>, i.e., standardized facilities to describe a container's value type and static size). So technically any container that follows this concept can be used to describe the filter block parameters.

This type has a single use in cuco and since the Block type itself is actually never used, i.e., we need it only to extract the block parameters, I would rather vote for replacing the Block tparam in the policy with bloom_filter_policy<class Hash, class Word, uint32_t WordsPerBlock>. WDYT?

include/cuco/bloom_filter.cuh Outdated Show resolved Hide resolved
include/cuco/bloom_filter_ref.cuh Outdated Show resolved Hide resolved
include/cuco/detail/utility/cuda.cuh Outdated Show resolved Hide resolved
tests/bloom_filter/unique_sequence_test.cu Outdated Show resolved Hide resolved
@PointKernel PointKernel changed the title cuco::bloom_filter (new) Add cuco::bloom_filter Oct 1, 2024
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. @sleeepyjack Thank you!

@sleeepyjack sleeepyjack merged commit 8c8590c into NVIDIA:dev Oct 2, 2024
19 checks passed
@sleeepyjack sleeepyjack deleted the bloom-filter-new branch October 2, 2024 00:37
@sleeepyjack
Copy link
Collaborator Author

sleeepyjack commented Oct 2, 2024

Forgot to post some final benchmark results:

# Benchmark Results

## bloom_filter_add_unique_size

### [0] NVIDIA H100 80GB HBM3

| Key |   Hash    | Word | WordsPerBlock | Distribution | NumInputs | FilterSizeMB | FalsePositiveRate | HBWPeak | LoadEff | StoreEff | L1HitRate | L2HitRate | Samples | Samples | CPU Time  | Noise | GPU Time  | Noise | Elem/s  | Samples | Batch GPU |
|-----|-----------|------|---------------|--------------|-----------|--------------|-------------------|---------|---------|----------|-----------|-----------|---------|---------|-----------|-------|-----------|-------|---------|---------|-----------|
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |            1 |                 1 |   3.69% | 100.00% |    0.00% |     0.00% |    86.45% |      3x |     20x | 26.092 ms | 0.09% | 26.086 ms | 0.09% | 15.334G |     21x | 26.085 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |            2 |                 1 |   3.69% | 100.00% |    0.00% |     0.00% |    86.45% |      3x |     20x | 26.090 ms | 0.05% | 26.085 ms | 0.04% | 15.335G |     21x | 26.085 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |            4 |                 1 |   3.69% | 100.00% |    0.00% |     0.00% |    86.45% |      3x |     20x | 26.097 ms | 0.07% | 26.091 ms | 0.06% | 15.331G |     21x | 26.078 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |            8 |                 1 |   3.70% | 100.00% |    0.00% |     0.00% |    86.43% |      3x |     20x | 26.097 ms | 0.09% | 26.091 ms | 0.09% | 15.331G |     21x | 26.079 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |           16 |                 1 |   3.71% | 100.00% |    0.00% |     0.00% |    86.41% |      3x |     20x | 26.103 ms | 0.07% | 26.097 ms | 0.07% | 15.327G |     21x | 26.079 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |           32 |           0.99997 |  12.81% | 100.00% |    0.00% |     0.00% |    85.57% |      3x |     19x | 26.896 ms | 0.05% | 26.890 ms | 0.05% | 14.875G |     20x | 26.876 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |           64 |           0.99228 |  25.18% | 100.00% |    0.00% |     0.00% |    55.47% |      3x |     17x | 29.564 ms | 0.07% | 29.558 ms | 0.07% | 13.533G |     18x | 29.548 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |          128 |           0.85593 |  29.53% | 100.00% |    0.00% |     0.00% |    41.13% |      3x |     16x | 31.391 ms | 0.04% | 31.385 ms | 0.04% | 12.745G |     17x | 31.375 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |          256 |           0.47041 |  30.19% | 100.00% |    0.00% |     0.00% |    35.93% |      3x |     16x | 32.932 ms | 0.02% | 32.927 ms | 0.01% | 12.148G |     17x | 32.921 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |          512 |           0.15322 |  30.11% | 100.00% |    0.00% |     0.00% |    33.68% |      3x |     15x | 33.999 ms | 0.03% | 33.994 ms | 0.02% | 11.767G |     16x | 33.986 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |         1024 |          0.035256 |  29.98% | 100.00% |    0.00% |     0.00% |    32.62% |      3x |     15x | 34.610 ms | 0.02% | 34.604 ms | 0.02% | 11.559G |     16x | 34.601 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |         2048 |          0.012553 |  29.88% | 100.00% |    0.00% |     0.00% |    32.11% |      3x |     15x | 34.930 ms | 0.02% | 34.924 ms | 0.01% | 11.453G |     16x | 34.919 ms |

## bloom_filter_add_unique_hash

### [0] NVIDIA H100 80GB HBM3

| Key |     Hash      | Word | WordsPerBlock | Distribution | NumInputs | FilterSizeMB | FalsePositiveRate | HBWPeak | LoadEff | StoreEff | L1HitRate | L2HitRate | Samples | Samples | CPU Time  | Noise | GPU Time  | Noise | Elem/s  | Samples | Batch GPU |
|-----|---------------|------|---------------|--------------|-----------|--------------|-------------------|---------|---------|----------|-----------|-----------|---------|---------|-----------|-------|-----------|-------|---------|---------|-----------|
| I64 | identity_hash |  U32 |             8 |       UNIQUE | 400000000 |         2000 |                 0 |  30.84% | 100.00% |    0.00% |     0.00% |    29.35% |      3x |     15x | 33.856 ms | 0.02% | 33.851 ms | 0.01% | 11.817G |     16x | 33.847 ms |
| I64 |     xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |         2000 |         0.0017157 |  29.89% | 100.00% |    0.00% |     0.00% |    32.12% |      3x |     15x | 34.927 ms | 0.02% | 34.921 ms | 0.02% | 11.454G |     16x | 34.915 ms |

## bloom_filter_add_unique_block_dim

### [0] NVIDIA H100 80GB HBM3

| Key |   Hash    | Word | WordsPerBlock | Distribution | NumInputs | FilterSizeMB | FalsePositiveRate | HBWPeak | LoadEff | StoreEff | L1HitRate | L2HitRate | Samples | Samples | CPU Time  | Noise | GPU Time  | Noise | Elem/s  | Samples | Batch GPU |
|-----|-----------|------|---------------|--------------|-----------|--------------|-------------------|---------|---------|----------|-----------|-----------|---------|---------|-----------|-------|-----------|-------|---------|---------|-----------|
| I64 | xxhash_64 |  U32 |             1 |       UNIQUE | 400000000 |         2000 |           0.53373 |  33.84% |  50.00% |    0.00% |    16.59% |    27.51% |      3x |     17x | 30.846 ms | 0.02% | 30.841 ms | 0.01% | 12.970G |     18x | 30.837 ms |
| I64 | xxhash_64 |  U32 |             2 |       UNIQUE | 400000000 |         2000 |           0.78258 |  33.78% | 100.00% |    0.00% |     0.00% |    27.39% |      3x |     17x | 30.886 ms | 0.02% | 30.881 ms | 0.01% | 12.953G |     18x | 30.880 ms |
| I64 | xxhash_64 |  U32 |             4 |       UNIQUE | 400000000 |         2000 |           0.95274 |  33.51% | 100.00% |    0.00% |     0.00% |    27.23% |      3x |     17x | 31.140 ms | 0.02% | 31.135 ms | 0.01% | 12.847G |     18x | 31.134 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |         2000 |         0.0017157 |  29.89% | 100.00% |    0.00% |     0.00% |    32.13% |      3x |     15x | 34.920 ms | 0.02% | 34.914 ms | 0.02% | 11.457G |     16x | 34.910 ms |
| I64 | xxhash_64 |  U64 |             1 |       UNIQUE | 400000000 |         2000 |           0.78258 |  33.83% |  50.00% |    0.00% |    16.58% |    27.51% |      3x |     17x | 30.843 ms | 0.02% | 30.838 ms | 0.01% | 12.971G |     18x | 30.834 ms |
| I64 | xxhash_64 |  U64 |             2 |       UNIQUE | 400000000 |         2000 |           0.95274 |  33.79% | 100.00% |    0.00% |     0.00% |    27.39% |      3x |     17x | 30.891 ms | 0.02% | 30.885 ms | 0.01% | 12.951G |     18x | 30.884 ms |
| I64 | xxhash_64 |  U64 |             4 |       UNIQUE | 400000000 |         2000 |           0.17364 |  33.52% | 100.00% |    0.00% |     0.00% |    27.22% |      3x |     17x | 31.138 ms | 0.02% | 31.132 ms | 0.01% | 12.848G |     18x | 31.131 ms |
| I64 | xxhash_64 |  U64 |             8 |       UNIQUE | 400000000 |         2000 |        0.00027879 |  41.92% | 100.00% |    0.00% |     0.00% |    32.92% |      3x |     14x | 38.398 ms | 0.02% | 38.392 ms | 0.01% | 10.419G |     15x | 38.387 ms |

## bloom_filter_contains_unique_size

### [0] NVIDIA H100 80GB HBM3

| Key |   Hash    | Word | WordsPerBlock | Distribution | NumInputs | FilterSizeMB | FalsePositiveRate | HBWPeak | LoadEff | StoreEff | L1HitRate | L2HitRate | Samples | Samples | CPU Time  | Noise | GPU Time  | Noise | Elem/s  | Samples | Batch GPU |
|-----|-----------|------|---------------|--------------|-----------|--------------|-------------------|---------|---------|----------|-----------|-----------|---------|---------|-----------|-------|-----------|-------|---------|---------|-----------|
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |            1 |                 1 |  21.05% |  55.56% |  100.00% |    47.39% |    72.33% |      3x |     98x |  5.130 ms | 0.19% |  5.125 ms | 0.15% | 78.056G |    102x |  5.120 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |            2 |                 1 |  21.05% |  55.56% |  100.00% |    45.97% |    73.02% |      3x |     98x |  5.129 ms | 0.12% |  5.123 ms | 0.06% | 78.073G |    102x |  5.120 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |            4 |                 1 |  21.04% |  55.56% |  100.00% |    45.33% |    73.32% |      3x |     98x |  5.130 ms | 0.12% |  5.124 ms | 0.05% | 78.058G |    102x |  5.121 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |            8 |                 1 |  21.08% |  55.56% |  100.00% |    45.03% |    73.45% |      3x |     98x |  5.132 ms | 0.13% |  5.126 ms | 0.07% | 78.033G |    102x |  5.120 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |           16 |                 1 |  22.74% |  55.56% |  100.00% |    44.88% |    72.28% |      3x |     98x |  5.141 ms | 0.13% |  5.135 ms | 0.07% | 77.890G |    102x |  5.127 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |           32 |           0.99997 |  65.42% |  55.56% |  100.00% |    44.81% |    42.71% |      3x |     91x |  5.514 ms | 0.11% |  5.509 ms | 0.05% | 72.610G |     95x |  5.555 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |           64 |           0.99228 |  70.30% |  55.56% |  100.00% |    44.72% |    18.46% |      3x |     55x |  9.176 ms | 0.07% |  9.170 ms | 0.03% | 43.620G |     57x |  9.159 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |          128 |           0.85593 |  66.46% |  55.56% |  100.00% |    44.69% |     9.75% |      3x |     44x | 11.554 ms | 0.05% | 11.548 ms | 0.02% | 34.638G |     45x | 11.537 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |          256 |           0.47041 |  64.21% |  55.56% |  100.00% |    44.67% |     9.20% |      3x |     40x | 12.800 ms | 0.05% | 12.794 ms | 0.01% | 31.264G |     41x | 12.785 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |          512 |           0.15322 |  63.04% |  55.56% |  100.00% |    44.66% |    10.32% |      3x |     38x | 13.436 ms | 0.04% | 13.430 ms | 0.02% | 29.784G |     39x | 13.421 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |         1024 |          0.035256 |  62.46% |  55.56% |  100.00% |    44.66% |    11.13% |      3x |     37x | 13.756 ms | 0.05% | 13.750 ms | 0.02% | 29.091G |     38x | 13.741 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |         2048 |          0.012553 |  62.19% |  55.56% |  100.00% |    44.66% |    11.59% |      3x |     36x | 13.916 ms | 0.04% | 13.911 ms | 0.02% | 28.755G |     37x | 13.903 ms |

## bloom_filter_contains_unique_hash

### [0] NVIDIA H100 80GB HBM3

| Key |     Hash      | Word | WordsPerBlock | Distribution | NumInputs | FilterSizeMB | FalsePositiveRate | HBWPeak | LoadEff | StoreEff | L1HitRate | L2HitRate | Samples | Samples | CPU Time  | Noise | GPU Time  | Noise | Elem/s  | Samples | Batch GPU |
|-----|---------------|------|---------------|--------------|-----------|--------------|-------------------|---------|---------|----------|-----------|-----------|---------|---------|-----------|-------|-----------|-------|---------|---------|-----------|
| I64 | identity_hash |  U32 |             8 |       UNIQUE | 400000000 |         2000 |                 0 |  62.18% |  55.56% |  100.00% |    44.64% |    11.53% |      3x |     36x | 13.920 ms | 0.04% | 13.914 ms | 0.02% | 28.747G |     37x | 13.905 ms |
| I64 |     xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |         2000 |         0.0017157 |  62.15% |  55.56% |  100.00% |    44.64% |    11.51% |      3x |     36x | 13.913 ms | 0.04% | 13.907 ms | 0.02% | 28.762G |     37x | 13.898 ms |

## bloom_filter_contains_unique_block_dim

### [0] NVIDIA H100 80GB HBM3

| Key |   Hash    | Word | WordsPerBlock | Distribution | NumInputs | FilterSizeMB | FalsePositiveRate | HBWPeak | LoadEff | StoreEff | L1HitRate | L2HitRate | Samples | Samples | CPU Time  | Noise | GPU Time  | Noise | Elem/s  | Samples | Batch GPU |
|-----|-----------|------|---------------|--------------|-----------|--------------|-------------------|---------|---------|----------|-----------|-----------|---------|---------|-----------|-------|-----------|-------|---------|---------|-----------|
| I64 | xxhash_64 |  U32 |             1 |       UNIQUE | 400000000 |         2000 |           0.53373 |  62.17% |  30.00% |  100.00% |     1.73% |    11.62% |      3x |     36x | 13.916 ms | 0.05% | 13.911 ms | 0.02% | 28.755G |     37x | 13.902 ms |
| I64 | xxhash_64 |  U32 |             2 |       UNIQUE | 400000000 |         2000 |           0.78258 |  62.17% |  40.00% |  100.00% |     1.74% |    11.61% |      3x |     36x | 13.913 ms | 0.05% | 13.908 ms | 0.02% | 28.761G |     37x | 13.899 ms |
| I64 | xxhash_64 |  U32 |             4 |       UNIQUE | 400000000 |         2000 |           0.95274 |  62.14% |  60.00% |  100.00% |     1.74% |    11.54% |      3x |     36x | 13.915 ms | 0.05% | 13.909 ms | 0.02% | 28.758G |     37x | 13.900 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |         2000 |         0.0017157 |  62.17% |  55.56% |  100.00% |    44.65% |    11.52% |      3x |     36x | 13.912 ms | 0.04% | 13.906 ms | 0.01% | 28.764G |     37x | 13.898 ms |
| I64 | xxhash_64 |  U64 |             1 |       UNIQUE | 400000000 |         2000 |           0.78258 |  62.18% |  40.00% |  100.00% |     1.74% |    11.62% |      3x |     36x | 13.916 ms | 0.05% | 13.910 ms | 0.03% | 28.756G |     37x | 13.903 ms |
| I64 | xxhash_64 |  U64 |             2 |       UNIQUE | 400000000 |         2000 |           0.95274 |  62.20% |  60.00% |  100.00% |     1.75% |    11.66% |      3x |     36x | 13.913 ms | 0.04% | 13.908 ms | 0.02% | 28.761G |     37x | 13.897 ms |
| I64 | xxhash_64 |  U64 |             4 |       UNIQUE | 400000000 |         2000 |           0.17364 |  62.14% |  55.56% |  100.00% |    44.63% |    11.64% |      3x |     36x | 13.913 ms | 0.04% | 13.907 ms | 0.01% | 28.762G |     37x | 13.898 ms |
| I64 | xxhash_64 |  U64 |             8 |       UNIQUE | 400000000 |         2000 |        0.00027879 |  60.65% |  52.94% |   50.00% |    47.42% |     8.38% |      3x |     36x | 14.264 ms | 0.04% | 14.258 ms | 0.01% | 28.054G |     37x | 14.249 ms |

cuco_bf_h100.csv

(includes changes from #609)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
helps: rapids Helps or needed by RAPIDS Needs Review Awaiting reviews before merging type: feature request New feature request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants