-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat] CAGRA filtering with BFKNN when sparsity matching threshold #378
Conversation
@@ -140,6 +142,61 @@ void search_main(raft::resources const& res, | |||
raft::device_matrix_view<DistanceT, int64_t, raft::row_major> distances, | |||
CagraSampleFilterT sample_filter = CagraSampleFilterT()) | |||
{ | |||
if constexpr (!std::is_same_v<CagraSampleFilterT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you pull this out into a separate function that can be invoked here please? This search_main
function is gettig pretty massive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@rhdong it looks like the recall rates for CAGRA filtered search are all pretty terrible (and I can recall some other experiments you did recently that seemed to have shown much better recall for the higher filter rates... but i could be mistaken). Was 0.9 just chosen because I mentioned it's a common case or did you choose it based on your benchmarks above? Also, I would do L2 in addition to inner product. I don't think nn-descent actually supports inner product atm. Did you use ivf-pq here to build the graph or do we have a bug where CAGRA is not throwing an error when inner product is used with nn-descent? |
The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to to add an extra member to the cagra index to avoid (1) complicating it even more than it is right now, (2) changing the abi.
cuvs::neighbors::brute_force::build(res, *brute_force_dataset, index.metric()); | ||
|
||
auto brute_force_queries = queries; | ||
auto padding_queries = raft::make_device_matrix<T, int64_t>(res, 0, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto padding_queries = raft::make_device_matrix<T, int64_t>(res, 0, 0); | |
// Allocate the padded queries in the workspace resource | |
auto padding_queries = raft::make_device_mdarray<T, int64_t>( | |
res, | |
raft::resource::get_workspace_resource(res), | |
raft::make_extents<int64_t>(n_queries, dataset_view.extent(1))); | |
// Copy the queries and fill the padded elements with zeros | |
raft::linalg::map_offset(res, | |
padding_queries.view(), | |
[queries, stride = dataset_view.extent(1)] __device__(int64_t i) { | |
auto row_ix = i / stride; | |
auto el_ix = i % stride; | |
return el_ix < queries.extent(1) ? queries(row_ix, el_ix) : T{0}; | |
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion! I found a way to allocate the queries in the workspace while keeping the reuse of copy_with_padding
, which could help with clean code. If I missed something, please feel free to point it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to stick to raft primitives and eventually remove copy_with_padding
, because the latter is not a commonly used primitive and thus one requires to go read its code to understand what does it do exactly.
But I don't mind to keep it here for now.
Hi @achirkin @cjnolet @lowener, per the benchmark result on deep-image-96-inner with the 9990000 rows data on A6000 32GB, the The benchmark of sparsity( ----------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------
PopcBenchI64/0/manual_time 0.005 ms 0.041 ms 131086 2#131072#0.4
PopcBenchI64/1/manual_time 0.009 ms 0.045 ms 77488 8#131072#0.5
PopcBenchI64/2/manual_time 0.014 ms 0.050 ms 49516 16#131072#0.2
PopcBenchI64/3/manual_time 0.004 ms 0.040 ms 182993 2#8192#0.4
PopcBenchI64/4/manual_time 0.004 ms 0.040 ms 157537 16#8192#0.5
PopcBenchI64/5/manual_time 0.009 ms 0.045 ms 76827 128#8192#0.2
PopcBenchI64/6/manual_time 0.012 ms 0.047 ms 56139 1024#8192#0.1
PopcBenchI64/7/manual_time 0.013 ms 0.048 ms 55972 1024#8192#0.1
PopcBenchI64/8/manual_time 0.013 ms 0.048 ms 55916 1024#8192#0.1
PopcBenchI64/9/manual_time 0.013 ms 0.048 ms 55888 1024#8192#0.1
PopcBenchI64/10/manual_time 0.013 ms 0.048 ms 55995 1024#8192#0.1
PopcBenchI64/11/manual_time 0.013 ms 0.048 ms 55764 1024#8192#0.1
PopcBenchI64/12/manual_time 0.013 ms 0.048 ms 55728 1024#8192#0.1
PopcBenchI64/13/manual_time 0.013 ms 0.048 ms 55718 1024#8192#0.1
PopcBenchI64/14/manual_time 0.013 ms 0.048 ms 55757 1024#8192#0.4
PopcBenchI64/15/manual_time 0.013 ms 0.048 ms 55731 1024#8192#0.5
PopcBenchI64/16/manual_time 0.013 ms 0.048 ms 55689 1024#8192#0.2
PopcBenchI64/17/manual_time 0.013 ms 0.048 ms 55713 1024#8192#0.4
PopcBenchI64/18/manual_time 0.013 ms 0.048 ms 55787 1024#8192#0.5
PopcBenchI64/19/manual_time 0.013 ms 0.048 ms 55507 1024#8192#0.2
PopcBenchI64/20/manual_time 0.013 ms 0.048 ms 55536 1024#8192#0.4
PopcBenchI64/21/manual_time 0.013 ms 0.048 ms 55734 1024#8192#0.5
PopcBenchI64/22/manual_time 0.013 ms 0.048 ms 55760 1024#8192#0.2
PopcBenchI64/23/manual_time 0.013 ms 0.048 ms 55594 1024#8192#0.4
PopcBenchI64/24/manual_time 0.013 ms 0.048 ms 55671 1024#8192#0.5
PopcBenchI64/25/manual_time 0.013 ms 0.048 ms 55660 1024#8192#0.2
PopcBenchI64/26/manual_time 0.013 ms 0.048 ms 55663 1024#8192#0.5
PopcBenchI64/27/manual_time 0.013 ms 0.048 ms 55699 1024#8192#0.2
PopcBenchI64/28/manual_time 0.013 ms 0.048 ms 55718 1024#8192#0.4
PopcBenchI64/29/manual_time 0.013 ms 0.048 ms 55729 1024#8192#0.5
PopcBenchI64/30/manual_time 0.013 ms 0.048 ms 55663 1024#8192#0.2
PopcBenchI64/31/manual_time 0.013 ms 0.048 ms 55674 1024#8192#0.4
PopcBenchI64/32/manual_time 0.013 ms 0.048 ms 55485 1024#8192#0.5
PopcBenchI64/33/manual_time 0.013 ms 0.048 ms 55638 1024#8192#0.2
PopcBenchI64/34/manual_time 0.206 ms 0.239 ms 3393 1#1073741824#0.5
PopcBenchI64/35/manual_time 0.206 ms 0.239 ms 3394 1#1073741824#0.2
PopcBenchI64/36/manual_time 0.206 ms 0.239 ms 3394 1#1073741824#0.1
PopcBenchI64/37/manual_time 0.206 ms 0.239 ms 3402 1#1073741824#0.01
name: cuvs_cagra
constraints:
build: cuvs_bench.config.algos.constraints.cuvs_cagra_build
search: cuvs_bench.config.algos.constraints.cuvs_cagra_search
groups:
base:
build:
graph_degree: [64]
intermediate_graph_degree: [96]
graph_build_algo: ["NN_DESCENT"]
search:
itopk: [256]
search_width: [1] dataset: deep-image-96-inner
dim: 96
distance: euclidean
gpu_driver_version: 12.4
gpu_gpuDirectRDMASupported: 1
gpu_hostNativeAtomicSupported: 0
gpu_mem_bus_width: 384
gpu_mem_freq: 8001000000.000000
gpu_mem_global_size: 51033931776
gpu_mem_shared_size: 102400
gpu_name: NVIDIA RTX A6000
gpu_pageableMemoryAccess: 0
gpu_pageableMemoryAccessUsesHostPageTables: 0
gpu_runtime_version: 12.5
gpu_sm_count: 84
gpu_sm_freq: 1800000000.000000
host_cores_used: 16
host_cpu_freq_max: 5881000000
host_cpu_freq_min: 400000000
host_pagesize: 4096
host_processors_sysconf: 32
host_processors_used: 32
host_total_ram_size: 134152171520
host_total_swap_size: 66571988992
max_k: 100
max_n_queries: 10000
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations GPU Latency Recall end_to_end items_per_second itopk k n_queries search_width total_queries
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# filter by cagra / 0.89 / **no sparsity** computing / (baseline 24.12)
cuvs_cagra.graph_degree64.intermediate_graph_degree96.graph_build_algoNN_DESCENT/process_time/real_time/threads:1 7.29 ms 7.29 ms 96 7.28165m 7.28635m 0.108229 0.699489 137.243/s 256 100 1 1 96
cuvs_cagra.graph_degree64.intermediate_graph_degree96.graph_build_algoNN_DESCENT/process_time/real_time/threads:1 18.8 ms 18.8 ms 37 0.0188214 0.0188264 0.108476 0.696578 53.1169k/s 256 100 1000 1 37k
# filter by cagra / 0.89 / **with sparsity** computing (PR #378)
cuvs_cagra.graph_degree64.intermediate_graph_degree96.graph_build_algoNN_DESCENT/process_time/real_time/threads:1 7.32 ms 7.32 ms 96 7.31196m 7.31662m 0.111042 0.702396 136.675/s 256 100 1 1 96
cuvs_cagra.graph_degree64.intermediate_graph_degree96.graph_build_algoNN_DESCENT/process_time/real_time/threads:1 18.7 ms 18.7 ms 37 0.0186631 0.0186691 0.109044 0.690756 53.5646k/s 256 100 1000 1 37k
#filter by **brute force** / 0.90 / with sparsity computing (PR #378)
cuvs_cagra.graph_degree64.intermediate_graph_degree96.graph_build_algoNN_DESCENT/process_time/real_time/threads:1 17.9 ms 17.9 ms 39 0.0178896 0.017895 0.102308 0.697904 55.8817/s 256 100 1 1 39
cuvs_cagra.graph_degree64.intermediate_graph_degree96.graph_build_algoNN_DESCENT/process_time/real_time/threads:1 409 ms 409 ms 2 0.409333 0.409342 0.100875 0.818684 2.44295k/s 256 100 1000 1 2k
# No filter(cuvs::neighbors::filtering::none_sample_filter)/ no sparsity computing / by cagra (baseline 24.12)
cuvs_cagra.graph_degree64.intermediate_graph_degree96.graph_build_algoNN_DESCENT/process_time/real_time/threads:1 0.167 ms 0.167 ms 4180 162.573u 166.937u 0.968471 0.697795 5.99031k/s 256 100 1 1 4.18k
cuvs_cagra.graph_degree64.intermediate_graph_degree96.graph_build_algoNN_DESCENT/process_time/real_time/threads:1 7.86 ms 7.86 ms 89 7.85858m 7.86357m 0.992259 0.699858 127.169k/s 256 100 1000 1 89k
|
Hi @rhdong thanks for the initial benchmarks, but we'd need something more conclusive. First of all, please check what happens to the recall (value 0.1 suggests there's a bug). Let's aim at some realistic levels, such as e.g. 0.95. Please, choose an appropriate build/search config. I'd start looking with the common default graph degrees 48/32 and the search itopk of 256. Second, with this benchmark we aim to find potential cases where the introduced change can hurt performance most. I figure such a case could be a larger dataset (50M-100M records) where the linear sparsity check takes longer while CAGRA is still fast. I think, the ideal candidate for this is a larger subset of the deep dataset you used. The most relevant search setting here is Third, the benchmark setup. I think, there's no need to use the Finally, to exclude the filtering-vs-no-filtering overheads from the benchmark comparison, I'd suggest to benchmark with filtering only. Compare the two executables: before and after the introduced change. I think, it makes sense to test it at a few sparsity levels as you did in the beginning (so that both cagra and bruteforce codepaths are benchmarked). |
Hi @achirkin , thank you so very much for your suggestions, as we discussed in the last night meeting, I have made the benchmark(the results) as you request on the deep-image-96, and I'm trying to do it on the deep-100M, there is only one different: these two arguments are not supported in the cuvs_bench |
Hi @achirkin , @cjnolet , @lowener , here are the benchmark results on the deep-100M. First, the conclusion is still : there is no negative effect on the performance, in the worst situation, the perf. decreased by about 1% dataset: deep-100M
dim: 96
distance: euclidean
gpu_driver_version: 12.7
gpu_gpuDirectRDMASupported: 1
gpu_hostNativeAtomicSupported: 0
gpu_mem_bus_width: 5120
gpu_mem_freq: 1512000000.000000
gpu_mem_global_size: 85097971712
gpu_mem_shared_size: 167936
gpu_name: NVIDIA A100 80GB PCIe
gpu_pageableMemoryAccess: 0
gpu_pageableMemoryAccessUsesHostPageTables: 0
gpu_runtime_version: 12.5
gpu_sm_count: 108
gpu_sm_freq: 1410000000.000000
host_cores_used: 64
host_cpu_freq_max: 2250000000
host_cpu_freq_min: 1500000000
host_pagesize: 4096
host_processors_sysconf: 256
host_processors_used: 256
host_total_ram_size: 1081985519616
host_total_swap_size: 0
max_k: 100
max_n_queries: 10000
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
Sparsity = 0.89--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations GPU Latency Recall end_to_end items_per_second itopk k n_queries total_queries
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# filter by cagra / 0.89 / **no sparsity** computing / (baseline 24.12)
raft_cagra.dim32/0/0/process_time/real_time 0.999 ms 0.998 ms 13919 986.065u 999.468u 0.10397 13.9116 1000.53/s 128 10 1 13.919k algo="single_cta"
raft_cagra.dim32/1/0/process_time/real_time 2.21 ms 2.21 ms 6339 2.19542m 2.20986m 0.106152 14.0083 452.518/s 256 10 1 6.339k algo="single_cta"
raft_cagra.dim32/0/1/process_time/real_time 2.27 ms 2.27 ms 6155 2.25913m 2.27269m 0.10265 13.9884 440.007k/s 128 10 1000 6.155M algo="single_cta"
raft_cagra.dim32/1/1/process_time/real_time 4.89 ms 4.89 ms 2858 4.87807m 4.89017m 0.10583 13.9761 204.492k/s 256 10 1000 2.858M algo="single_cta"
# filter by cagra / 0.89 / **with sparsity** computing (PR #378), ----in worst situation, the perf. decreased by (2 / 999) = 0.2%
raft_cagra.dim32/0/0/process_time/real_time 1.01 ms 1.01 ms 13864 995.968u 1007.31u 0.10311 13.9654 992.74/s 128 10 1 13.864k algo="single_cta"
raft_cagra.dim32/1/0/process_time/real_time 2.20 ms 2.20 ms 6372 2.18463m 2.1961m 0.105728 13.9936 455.352/s 256 10 1 6.372k algo="single_cta"
raft_cagra.dim32/0/1/process_time/real_time 2.31 ms 2.31 ms 6043 2.30312m 2.31494m 0.10375 13.9892 431.977k/s 128 10 1000 6.043M algo="single_cta"
raft_cagra.dim32/1/1/process_time/real_time 4.94 ms 4.94 ms 2835 4.9261m 4.9381m 0.10522 13.9995 202.507k/s 256 10 1000 2.835M algo="single_cta"
Sparsity = 0.5--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations GPU Latency Recall end_to_end items_per_second itopk k n_queries total_queries
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# filter by cagra / 0.50 / **no sparsity** computing / (baseline 24.12)
raft_cagra.dim32/0/0/process_time/real_time 0.967 ms 0.967 ms 14480 955.493u 966.77u 0.46733 13.9988 1034.37/s 128 10 1 14.48k algo="single_cta"
raft_cagra.dim32/1/0/process_time/real_time 2.08 ms 2.08 ms 6720 2.07125m 2.08272m 0.455119 13.9959 480.142/s 256 10 1 6.72k algo="single_cta"
raft_cagra.dim32/0/1/process_time/real_time 2.22 ms 2.22 ms 6298 2.20983m 2.22182m 0.46974 13.993 450.082k/s 128 10 1000 6.298M algo="single_cta"
raft_cagra.dim32/1/1/process_time/real_time 4.90 ms 4.90 ms 2863 4.88828m 4.9002m 0.45572 14.0293 204.073k/s 256 10 1000 2.863M algo="single_cta"
# filter by cagra / 0.50 / **with sparsity** computing (PR #378), ---- in worst situation, the perf. decreased by (9 / 967) = 1%
raft_cagra.dim32/0/0/process_time/real_time 0.976 ms 0.976 ms 14336 965.031u 976.492u 0.47207 13.999 1024.07/s 128 10 1 14.336k algo="single_cta"
raft_cagra.dim32/1/0/process_time/real_time 2.07 ms 2.07 ms 6756 2.05985m 2.07132m 0.452191 13.9938 482.784/s 256 10 1 6.756k algo="single_cta"
raft_cagra.dim32/0/1/process_time/real_time 2.26 ms 2.26 ms 6175 2.25309m 2.26489m 0.46716 13.9857 441.524k/s 128 10 1000 6.175M algo="single_cta"
raft_cagra.dim32/1/1/process_time/real_time 4.94 ms 4.94 ms 2835 4.93155m 4.94368m 0.45646 14.0153 202.279k/s 256 10 1000 2.835M algo="single_cta"
Sparsity = 0.1--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations GPU Latency Recall end_to_end items_per_second itopk k n_queries total_queries
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# filter by cagra / 0.10 / **no sparsity** computing / (baseline 24.12)
raft_cagra.dim32/0/0/process_time/real_time 0.934 ms 0.934 ms 14990 922.6u 933.953u 0.84729 14 1070.72/s 128 10 1 14.99k algo="single_cta"
raft_cagra.dim32/1/0/process_time/real_time 1.95 ms 1.95 ms 7188 1.93549m 1.94687m 0.748637 13.9941 513.646/s 256 10 1 7.188k algo="single_cta"
raft_cagra.dim32/0/1/process_time/real_time 2.16 ms 2.16 ms 6472 2.14989m 2.16168m 0.84653 13.9904 462.603k/s 128 10 1000 6.472M algo="single_cta"
raft_cagra.dim32/1/1/process_time/real_time 4.39 ms 4.39 ms 3189 4.38259m 4.39459m 0.7505 14.0144 227.552k/s 256 10 1000 3.189M algo="single_cta"
# filter by cagra / 0.10 / **with sparsity** computing (PR #378), ---- in worst situation, perf. decreased by (9 / 934) = 1%
raft_cagra.dim32/0/0/process_time/real_time 0.943 ms 0.943 ms 14845 931.751u 943.14u 0.84642 14.0009 1060.29/s 128 10 1 14.845k algo="single_cta"
raft_cagra.dim32/1/0/process_time/real_time 1.93 ms 1.93 ms 7229 1.92334m 1.93483m 0.74843 13.9869 516.84/s 256 10 1 7.229k algo="single_cta"
raft_cagra.dim32/0/1/process_time/real_time 2.21 ms 2.21 ms 6340 2.19379m 2.20577m 0.84641 13.9846 453.357k/s 128 10 1000 6.34M algo="single_cta"
raft_cagra.dim32/1/1/process_time/real_time 4.44 ms 4.44 ms 3153 4.42647m 4.43852m 0.74817 13.9946 225.301k/s 256 10 1000 3.153M algo="single_cta"
Sparsity = 0.9001 (to Brute force with prefilter and sparsity)--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations GPU Latency Recall end_to_end items_per_second itopk k n_queries total_queries
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# filter by **brute force** / 0.9001 / with sparsity computing (PR #378)
raft_cagra.dim32/0/0/process_time/real_time 59.6 ms 59.6 ms 236 0.059615 0.0596262 0.113559 14.0718 16.7712/s 128 10 1 236 algo="single_cta"
raft_cagra.dim32/1/0/process_time/real_time 60.0 ms 60.0 ms 234 0.0600018 0.0600131 0.0987179 14.0431 16.663/s 256 10 1 234 algo="single_cta"
raft_cagra.dim32/0/1/process_time/real_time 2813 ms 2813 ms 5 2.81277 2.81281 0.10188 14.0641 355.516/s 128 10 1000 5k algo="single_cta"
raft_cagra.dim32/1/1/process_time/real_time 2816 ms 2815 ms 5 2.8155 2.81554 0.09832 14.0777 355.172/s 256 10 1000 5k algo="single_cta"
No prefilter & no sparsity computing, CAGRA--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations GPU Latency Recall end_to_end items_per_second itopk k n_queries total_queries
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# No filter(cuvs::neighbors::filtering::none_sample_filter)/ no sparsity computing / by cagra (baseline 24.12)
raft_cagra.dim32/0/0/process_time/real_time 0.815 ms 0.815 ms 17169 804.125u 815.174u 0.93833 13.9957 1.22673k/s 128 10 1 17.169k algo="single_cta"
raft_cagra.dim32/1/0/process_time/real_time 1.76 ms 1.76 ms 7963 1.74589m 1.75703m 0.973038 13.9912 569.144/s 256 10 1 7.963k algo="single_cta"
raft_cagra.dim32/0/1/process_time/real_time 1.77 ms 1.77 ms 7899 1.75715m 1.7687m 0.93833 13.9709 565.389k/s 128 10 1000 7.899M algo="single_cta"
raft_cagra.dim32/1/1/process_time/real_time 3.70 ms 3.70 ms 3788 3.69044m 3.70208m 0.97275 14.0235 270.118k/s 256 10 1000 3.788M algo="single_cta" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @rhdong for the benchmarks, I think the observed 1-2% are a reasonable compromise.
I have one more request though: make this a user-controlled feature. I think we'd better keep it off by default to avoid any surprises. It also may have some compatibility issues with other features (since the sparsity check runs a kernel, it's likely to interfere with the persistent kernel feature of single-cta algorithm).
raft::device_matrix_view<InternalIdxT, int64_t, raft::row_major> neighbors, | ||
raft::device_matrix_view<DistanceT, int64_t, raft::row_major> distances, | ||
CagraSampleFilterT& sample_filter, | ||
double threshold_to_bf = 0.9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please make it possible to disable/enable this feature by the user (see #252 (comment) for the reasoning):
- Expose
threshold_to_bf
as CAGRA search parameter and set it to 1.0 by default there - Add a check here: if
threshold_to_bf >= 1.0
then disable further checks and proceed with CAGRA search immediately (i.e. no need to run the sparsity check).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most users with a filter are going to be specifying the filter in batch, and will know the sparsity of the filter. I suggest instead of turning this feature off by default, we allow the user specified filter to know its own nnz unless updated.
Turning this off by default undermines the fundamental benefits of this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most users are not specifying a filter, and when they do, it's expected the filter is going to be heavy. This should not impact all users.
@artem any user specifying a filter is opting in to this feature. It's already off by default. The users specifying a filter are going to be specifying a heavy filter most of the time. I've asked James to build this feature because it's needed. I don't think it's a bad idea to expose the threshold by which it crosses over the brute force, but we will be setting it to 99% by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates! Please move the new search argument to the search_params
structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's perhaps remove this from the benchmark wrapper for now?
The reason I'm suggesting this, is that filter creation should probably be a part of common benchmarking infrastructure rather than specific for CAGRA and, therefore, is a little out of the scope of this PR.
This reverts commit b5dcc02.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates! LGTM on cagra_search.cuh
side
auto brute_force_dataset = raft::make_device_matrix_view<const T, int64_t, raft::row_major>( | ||
strided_dataset.view().data_handle(), strided_dataset.n_rows(), strided_dataset.stride()); | ||
|
||
auto brute_force_idx = cuvs::neighbors::brute_force::build(res, brute_force_dataset, metric); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is being called each and every time a user performs a search? There's overhead in this call, and this should cache off the built brute-force index because for many common distances this computes a set of norms.
auto n_dataset = strided_dataset.n_rows(); | ||
|
||
auto bitset_filter_view = sample_filter.bitset_view_; | ||
auto sparsity = bitset_filter_view.sparsity(res); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the number of positive bits in the bitmap also needed to compute this? But then we compute n_elements
below. We should cache off the n_elements
. We should consider making the bitmap immutable. That way the sparsity and the number of positive elements can be safely cached. This isn't something users are going to be updating, like ever.
auto n_queries = queries.extent(0); | ||
auto n_dataset = strided_dataset.n_rows(); | ||
|
||
auto bitset_filter_view = sample_filter.bitset_view_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens here if the 2d bitmap isn't able to be converted to a 1d bitet without losing information?
float32
orhalf
distance type