Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cuco::bloom_filter #573

Merged
merged 58 commits into from
Oct 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
3d11031
First draft of new version
sleeepyjack Aug 8, 2024
b5c4570
Merge remote-tracking branch 'upstream/dev' into bloom-filter-new
sleeepyjack Aug 8, 2024
ad43a1f
Fix merge conflict
sleeepyjack Aug 8, 2024
3cb68fa
Use cub::Device::ForEachCopyN in add_async
sleeepyjack Aug 8, 2024
7d836b9
Fix formatting
sleeepyjack Aug 8, 2024
a483bf0
Fix contains_if_n kernel
sleeepyjack Aug 8, 2024
2728d25
Add unit test
sleeepyjack Aug 8, 2024
d0359ee
Make helper functions private
sleeepyjack Aug 8, 2024
dc97ae3
Fix sub_filter_idx
sleeepyjack Aug 9, 2024
9332c9a
Add cooperative add(...)
sleeepyjack Aug 10, 2024
21fa5b0
Use scalar add(...) for window_size=1
sleeepyjack Aug 10, 2024
694ea0c
Apply PIMPL
sleeepyjack Aug 27, 2024
8db2ee1
Add word type to benchmark type axes
sleeepyjack Aug 27, 2024
36b1532
Add public accessors
sleeepyjack Aug 27, 2024
26ab9f4
Enable single-threaded device add
sleeepyjack Aug 27, 2024
1af54b3
Remove unused code section
sleeepyjack Aug 27, 2024
1c36b9f
Refine benchmarks
sleeepyjack Aug 28, 2024
cb7fef6
Change default block dim and refine benchmarks v2
sleeepyjack Aug 28, 2024
eb9b17a
Adjust benchmark input size
sleeepyjack Aug 28, 2024
fad4f93
Use block comment
sleeepyjack Aug 28, 2024
2e873c3
Add bloom_filter docs
sleeepyjack Aug 29, 2024
0542937
Add bloom_filter_ref docs
sleeepyjack Aug 29, 2024
8c69bb0
Expose pattern_bits accessor
sleeepyjack Aug 29, 2024
c7cd218
Fix some more missing docs
sleeepyjack Aug 29, 2024
2713345
Adjust benchmark input size
sleeepyjack Aug 29, 2024
a87cbd2
Merge remote-tracking branch 'upstream/dev' into bloom-filter-new
sleeepyjack Aug 29, 2024
5b06a01
Use array-like type to specify filter block config
sleeepyjack Aug 29, 2024
07504db
Merge remote-tracking branch 'upstream/dev' into bloom-filter-new
sleeepyjack Aug 29, 2024
14e0bc3
Doc fixup
sleeepyjack Aug 29, 2024
fa1bf3f
Fix docs
sleeepyjack Aug 30, 2024
4c8177c
More doc fixes
sleeepyjack Aug 30, 2024
0689921
Use constexpr vars in example
sleeepyjack Sep 16, 2024
a7cc389
Use cuda::std::byte
sleeepyjack Sep 16, 2024
af97e2c
Rename test->contains
sleeepyjack Sep 16, 2024
5b3fe40
Implement policy concept and address review comments
sleeepyjack Sep 18, 2024
f53cfad
Merge remote-tracking branch 'upstream/dev' into bloom-filter-new
sleeepyjack Sep 18, 2024
89fe182
Rename bloom_filter_policy -> default_filter_policy
sleeepyjack Sep 18, 2024
8527e01
Throw error on invalid policy configuration
sleeepyjack Sep 18, 2024
6d85a53
Clarify docs
sleeepyjack Sep 18, 2024
936f8fe
Use explicit return type
sleeepyjack Sep 18, 2024
d2f04ff
Set maximum required alignment to 16 bytes
sleeepyjack Sep 18, 2024
21c374a
Add adaptive CG add/contains
sleeepyjack Sep 26, 2024
695490e
Improve error handling
sleeepyjack Sep 26, 2024
24fcaab
constexpr all the things (where possible)
sleeepyjack Sep 26, 2024
40dd683
Rename default_filter_policy -> bloom_filter_policy
sleeepyjack Sep 26, 2024
8fc4b00
Add bloom_filter section to README
sleeepyjack Sep 26, 2024
591df5a
Merge remote-tracking branch 'upstream/dev' into bloom-filter-new
sleeepyjack Sep 26, 2024
446e6bb
Fix docs
sleeepyjack Sep 26, 2024
d8c21fa
Fix docs
sleeepyjack Sep 30, 2024
8770a42
Split benchmarks into multiple TUs
sleeepyjack Sep 30, 2024
1f6223f
Fix docs
sleeepyjack Sep 30, 2024
251711d
Refactor design to specify filter block params
sleeepyjack Oct 1, 2024
b6ec260
Revise test includes
sleeepyjack Oct 1, 2024
70de880
Make tile_size_v device-accessible only
sleeepyjack Oct 1, 2024
d11c050
Merge remote-tracking branch 'upstream/dev' into bloom-filter-new
sleeepyjack Oct 1, 2024
85ff60b
Clarify docs regarding heterogeneous lookup
sleeepyjack Oct 1, 2024
95a0221
Fix docs regarding stream description
sleeepyjack Oct 1, 2024
a99e7bc
Remove unused include
sleeepyjack Oct 1, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -242,4 +242,11 @@ We plan to add many GPU-accelerated, concurrent data structures to `cuCollection
- [Host-bulk APIs](https://github.com/NVIDIA/cuCollections/blob/dev/examples/hyperloglog/host_bulk_example.cu) (see [live example in godbolt](https://godbolt.org/z/G4qdcTezE))
- [Device-ref APIs](https://github.com/NVIDIA/cuCollections/blob/dev/examples/hyperloglog/device_ref_example.cu) (see [live example in godbolt](https://godbolt.org/z/n88713o4n))

### `bloom_filter`

`cuco::bloom_filter` implements a Blocked Bloom Filter for approximate set membership queries.

#### Examples:
- [Host-bulk APIs](https://github.com/NVIDIA/cuCollections/blob/dev/examples/bloom_filter/host_bulk_example.cu) (see [live example in godbolt](https://godbolt.org/z/EY7T5v5aE))


6 changes: 6 additions & 0 deletions benchmarks/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -98,3 +98,9 @@ ConfigureBench(HASH_FUNCTION_BENCH
# - hyperloglog benchmarks -----------------------------------------------------------
ConfigureBench(HYPERLOGLOG_BENCH
hyperloglog/hyperloglog_bench.cu)

###################################################################################################
# - bloom_filter benchmarks -----------------------------------------------------------------------
ConfigureBench(BLOOM_FILTER_BENCH
bloom_filter/add_bench.cu
bloom_filter/contains_bench.cu)
8 changes: 8 additions & 0 deletions benchmarks/benchmark_defaults.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@

#pragma once

#include <cuco/hash_functions.cuh>

#include <nvbench/nvbench.cuh>

#include <cstdint>
Expand All @@ -25,6 +27,12 @@ namespace cuco::benchmark::defaults {

using KEY_TYPE_RANGE = nvbench::type_list<nvbench::int32_t, nvbench::int64_t>;
using VALUE_TYPE_RANGE = nvbench::type_list<nvbench::int32_t, nvbench::int64_t>;
using HASH_RANGE = nvbench::type_list<cuco::identity_hash<char>,
cuco::xxhash_32<char>,
cuco::xxhash_64<char>,
cuco::murmurhash3_32<char>>; //,
// cuco::murmurhash3_x86_128<char>,
// cuco::murmurhash3_x64_128<char>>; // TODO handle tuple-like hash value
sleeepyjack marked this conversation as resolved.
Show resolved Hide resolved

auto constexpr N = 100'000'000;
auto constexpr OCCUPANCY = 0.5;
Expand Down
11 changes: 11 additions & 0 deletions benchmarks/benchmark_utils.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,17 @@ auto dist_from_state(nvbench::state const& state)
}
}

template <typename T, typename NewType>
struct rebind_hasher;

template <template <typename> class Template, typename OldType, typename NewType>
struct rebind_hasher<Template<OldType>, NewType> {
using type = Template<NewType>;
};

template <typename T, typename NewType>
using rebind_hasher_t = typename rebind_hasher<T, NewType>::type;

} // namespace cuco::benchmark

NVBENCH_DECLARE_TYPE_STRINGS(cuco::utility::distribution::unique, "UNIQUE", "distribution::unique");
Expand Down
117 changes: 117 additions & 0 deletions benchmarks/bloom_filter/add_bench.cu
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
/*
* Copyright (c) 2024, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include "defaults.hpp"
#include "utils.hpp"

#include <benchmark_defaults.hpp>
#include <benchmark_utils.hpp>

#include <cuco/bloom_filter.cuh>
#include <cuco/utility/key_generator.cuh>

#include <nvbench/nvbench.cuh>

#include <cuda/std/limits>
#include <thrust/device_vector.h>

#include <cstdint>
#include <exception>

using namespace cuco::benchmark; // defaults, dist_from_state, rebind_hasher_t, add_fpr_summary
using namespace cuco::utility; // key_generator, distribution

/**
* @brief A benchmark evaluating `cuco::bloom_filter::add_async` performance
*/
template <typename Key, typename Hash, typename Word, nvbench::int32_t WordsPerBlock, typename Dist>
void bloom_filter_add(nvbench::state& state,
nvbench::type_list<Key, Hash, Word, nvbench::enum_type<WordsPerBlock>, Dist>)
{
using policy_type = cuco::bloom_filter_policy<rebind_hasher_t<Hash, Key>,
Word,
static_cast<std::uint32_t>(WordsPerBlock)>;
using filter_type =
cuco::bloom_filter<Key, cuco::extent<size_t>, cuda::thread_scope_device, policy_type>;

auto const num_keys = state.get_int64("NumInputs");
auto const filter_size_mb = state.get_int64("FilterSizeMB");
auto const pattern_bits = state.get_int64("PatternBits");

try {
auto const policy = policy_type{static_cast<uint32_t>(pattern_bits)};
} catch (std::exception const& e) {
state.skip(e.what()); // skip invalid configurations
}

std::size_t const num_sub_filters =
(filter_size_mb * 1024 * 1024) /
(sizeof(typename filter_type::word_type) * filter_type::words_per_block);

thrust::device_vector<Key> keys(num_keys);

key_generator gen;
gen.generate(dist_from_state<Dist>(state), keys.begin(), keys.end());

state.add_element_count(num_keys);

filter_type filter{num_sub_filters, {}, {static_cast<uint32_t>(pattern_bits)}};

add_fpr_summary(state, filter);

state.exec([&](nvbench::launch& launch) {
filter.add_async(keys.begin(), keys.end(), {launch.get_stream()});
});
}

NVBENCH_BENCH_TYPES(bloom_filter_add,
NVBENCH_TYPE_AXES(nvbench::type_list<defaults::BF_KEY>,
nvbench::type_list<defaults::BF_HASH>,
nvbench::type_list<defaults::BF_WORD>,
nvbench::enum_type_list<defaults::BF_WORDS_PER_BLOCK>,
nvbench::type_list<distribution::unique>))
.set_name("bloom_filter_add_unique_size")
.set_type_axes_names({"Key", "Hash", "Word", "WordsPerBlock", "Distribution"})
.set_max_noise(defaults::MAX_NOISE)
.add_int64_axis("NumInputs", {defaults::BF_N})
.add_int64_axis("FilterSizeMB", defaults::BF_SIZE_MB_RANGE_CACHE)
.add_int64_axis("PatternBits", {defaults::BF_PATTERN_BITS});

NVBENCH_BENCH_TYPES(bloom_filter_add,
NVBENCH_TYPE_AXES(nvbench::type_list<defaults::BF_KEY>,
defaults::HASH_RANGE,
nvbench::type_list<defaults::BF_WORD>,
nvbench::enum_type_list<defaults::BF_WORDS_PER_BLOCK>,
nvbench::type_list<distribution::unique>))
.set_name("bloom_filter_add_unique_hash")
.set_type_axes_names({"Key", "Hash", "Word", "WordsPerBlock", "Distribution"})
.set_max_noise(defaults::MAX_NOISE)
.add_int64_axis("NumInputs", {defaults::BF_N})
.add_int64_axis("FilterSizeMB", {defaults::BF_SIZE_MB})
.add_int64_axis("PatternBits", {defaults::BF_PATTERN_BITS});

NVBENCH_BENCH_TYPES(bloom_filter_add,
NVBENCH_TYPE_AXES(nvbench::type_list<defaults::BF_KEY>,
nvbench::type_list<defaults::BF_HASH>,
nvbench::type_list<nvbench::uint32_t, nvbench::uint64_t>,
nvbench::enum_type_list<1, 2, 4, 8>,
nvbench::type_list<distribution::unique>))
.set_name("bloom_filter_add_unique_block_dim")
.set_type_axes_names({"Key", "Hash", "Word", "WordsPerBlock", "Distribution"})
.set_max_noise(defaults::MAX_NOISE)
.add_int64_axis("NumInputs", {defaults::BF_N})
.add_int64_axis("FilterSizeMB", {defaults::BF_SIZE_MB})
.add_int64_axis("PatternBits", {defaults::BF_PATTERN_BITS});
122 changes: 122 additions & 0 deletions benchmarks/bloom_filter/contains_bench.cu
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
/*
* Copyright (c) 2024, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include "defaults.hpp"
#include "utils.hpp"

#include <benchmark_defaults.hpp>
#include <benchmark_utils.hpp>

#include <cuco/bloom_filter.cuh>
#include <cuco/utility/key_generator.cuh>

#include <nvbench/nvbench.cuh>

#include <cuda/std/limits>
#include <thrust/device_vector.h>

#include <exception>

using namespace cuco::benchmark; // defaults, dist_from_state, rebind_hasher_t, add_fpr_summary
using namespace cuco::utility; // key_generator, distribution

/**
* @brief A benchmark evaluating `cuco::bloom_filter::contains_async` performance
*/
template <typename Key, typename Hash, typename Word, nvbench::int32_t WordsPerBlock, typename Dist>
void bloom_filter_contains(
nvbench::state& state,
nvbench::type_list<Key, Hash, Word, nvbench::enum_type<WordsPerBlock>, Dist>)
{
// cudaDeviceSetLimit(cudaLimitMaxL2FetchGranularity, 32); // slightly improves peformance if
// filter block fits into a 32B sector
using policy_type = cuco::bloom_filter_policy<rebind_hasher_t<Hash, Key>,
Word,
static_cast<std::uint32_t>(WordsPerBlock)>;
using filter_type =
cuco::bloom_filter<Key, cuco::extent<size_t>, cuda::thread_scope_device, policy_type>;

auto const num_keys = state.get_int64("NumInputs");
auto const filter_size_mb = state.get_int64("FilterSizeMB");
auto const pattern_bits = state.get_int64("PatternBits");

try {
auto const policy = policy_type{static_cast<uint32_t>(pattern_bits)};
} catch (std::exception const& e) {
state.skip(e.what()); // skip invalid configurations
}

std::size_t const num_sub_filters =
(filter_size_mb * 1024 * 1024) /
(sizeof(typename filter_type::word_type) * filter_type::words_per_block);

thrust::device_vector<Key> keys(num_keys);
thrust::device_vector<bool> result(num_keys, false);

key_generator gen;
gen.generate(dist_from_state<Dist>(state), keys.begin(), keys.end());

state.add_element_count(num_keys);

filter_type filter{num_sub_filters, {}, {static_cast<uint32_t>(pattern_bits)}};

add_fpr_summary(state, filter);

filter.add(keys.begin(), keys.end());

state.exec([&](nvbench::launch& launch) {
filter.contains_async(keys.begin(), keys.end(), result.begin(), {launch.get_stream()});
});
}

NVBENCH_BENCH_TYPES(bloom_filter_contains,
NVBENCH_TYPE_AXES(nvbench::type_list<defaults::BF_KEY>,
nvbench::type_list<defaults::BF_HASH>,
nvbench::type_list<defaults::BF_WORD>,
nvbench::enum_type_list<defaults::BF_WORDS_PER_BLOCK>,
nvbench::type_list<distribution::unique>))
.set_name("bloom_filter_contains_unique_size")
.set_type_axes_names({"Key", "Hash", "Word", "WordsPerBlock", "Distribution"})
.set_max_noise(defaults::MAX_NOISE)
.add_int64_axis("NumInputs", {defaults::BF_N})
.add_int64_axis("FilterSizeMB", defaults::BF_SIZE_MB_RANGE_CACHE)
.add_int64_axis("PatternBits", {defaults::BF_PATTERN_BITS});

NVBENCH_BENCH_TYPES(bloom_filter_contains,
NVBENCH_TYPE_AXES(nvbench::type_list<defaults::BF_KEY>,
defaults::HASH_RANGE,
nvbench::type_list<defaults::BF_WORD>,
nvbench::enum_type_list<defaults::BF_WORDS_PER_BLOCK>,
nvbench::type_list<distribution::unique>))
.set_name("bloom_filter_contains_unique_hash")
.set_type_axes_names({"Key", "Hash", "Word", "WordsPerBlock", "Distribution"})
.set_max_noise(defaults::MAX_NOISE)
.add_int64_axis("NumInputs", {defaults::BF_N})
.add_int64_axis("FilterSizeMB", {defaults::BF_SIZE_MB})
.add_int64_axis("PatternBits", {defaults::BF_PATTERN_BITS});

NVBENCH_BENCH_TYPES(bloom_filter_contains,
NVBENCH_TYPE_AXES(nvbench::type_list<defaults::BF_KEY>,
nvbench::type_list<defaults::BF_HASH>,
nvbench::type_list<nvbench::uint32_t, nvbench::uint64_t>,
nvbench::enum_type_list<1, 2, 4, 8>,
nvbench::type_list<distribution::unique>))
.set_name("bloom_filter_contains_unique_block_dim")
.set_type_axes_names({"Key", "Hash", "Word", "WordsPerBlock", "Distribution"})
.set_max_noise(defaults::MAX_NOISE)
.add_int64_axis("NumInputs", {defaults::BF_N})
.add_int64_axis("FilterSizeMB", {defaults::BF_SIZE_MB})
.add_int64_axis("PatternBits", {defaults::BF_PATTERN_BITS});
42 changes: 42 additions & 0 deletions benchmarks/bloom_filter/defaults.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
/*
* Copyright (c) 2024, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#pragma once

#include <cuco/hash_functions.cuh>

#include <nvbench/nvbench.cuh>

#include <cuda/std/array>

#include <vector>

namespace cuco::benchmark::defaults {

using BF_KEY = nvbench::int64_t;
using BF_HASH = cuco::xxhash_64<char>;
using BF_WORD = nvbench::uint32_t;

static constexpr auto BF_N = 400'000'000;
static constexpr auto BF_SIZE_MB = 2'000;
static constexpr auto BF_WORDS_PER_BLOCK = 8;
static constexpr auto BF_PATTERN_BITS = BF_WORDS_PER_BLOCK;

auto const BF_SIZE_MB_RANGE_CACHE =
std::vector<nvbench::int64_t>{1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048};
auto const BF_PATTERN_BITS_RANGE = std::vector<nvbench::int64_t>{1, 2, 4, 6, 8, 16};
PointKernel marked this conversation as resolved.
Show resolved Hide resolved

} // namespace cuco::benchmark::defaults
Loading
Loading