[BUG] Fix CAGRA filter #489

enp1s0 · 2024-11-23T15:58:58Z

Ref : #472

The cause of the bug

The bitonic sort was used on an array that was not a power of 2 long. In the current search implementation, the bitonic sort is used to move the invalid elements to the end of the buffer as:

cuvs/cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh

Lines 758 to 763 in 5062594

    
           topk_by_bitonic_sort_1st<MAX_ITOPK + MAX_CANDIDATES>( 
        
             result_distances_buffer, 
        
             result_indices_buffer, 
        
             internal_topk + search_width * graph_degree, 
        
             top_k, 
        
             false);

cuvs/cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh

Lines 644 to 649 in 5062594

    
           topk_by_bitonic_sort_1st<MAX_ITOPK + MAX_CANDIDATES>( 
        
             result_distances_buffer, 
        
             result_indices_buffer, 
        
             internal_topk + search_width * graph_degree, 
        
             internal_topk, 
        
             false);

The problem is that the (max) array length (=MAX_ITOPK + MAX_CANDIDATES) is not always the power of two.
These bitonic sorts are called even if no elements are filtered out unless cuvs::neighbors::filtering::none_sample_filter is specified as the filter, so #472 occurs.

Fix

This PR changes the filtering process so that the bitonic sort is not used to move the invalid elements to the end of the buffer.

lowener · 2024-11-24T00:02:42Z

Can you add a test that would prevent regression?

…gra-filter

achirkin

Thanks for the PR, @enp1s0!
I'm a little bit confused with the description. Do I understand it right that this PR contains two fixes: (1) make the bitonic sort array always a power-of-two, (2) move filtered elements to the end of the topk buffer?
The big chunk of the PR addresses (2), but that should be irrelevant for #472, because in that bug no elements are filtered out.
Therefore, I think, it would be really beneficial to construct a reproducer for #472 as a test case in this PR and make sure it's fixed with the introduced change.

achirkin · 2024-11-26T07:41:24Z

Also, (1) did you have a chance to check if this affects the QPS? (2) do we need a similar fix for multi-cta and multi-kernel versions of CAGRA?

enp1s0 · 2024-11-26T11:46:24Z

@achirkin, thank you for your comment, and I'm sorry for the bad PR description. I updated it.

Do I understand it right that this PR contains two fixes: (1) make the bitonic sort array always a power-of-two, (2) move filtered elements to the end of the topk buffer?

No, this PR changes the filtering process so that the bitonic sort is not used to move the invalid elements to the end of the buffer. In the current search implementation, the bitonic sort is used to move the invalid elements as:

cuvs/cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh

Lines 758 to 763 in 5062594

    
           topk_by_bitonic_sort_1st<MAX_ITOPK + MAX_CANDIDATES>( 
        
             result_distances_buffer, 
        
             result_indices_buffer, 
        
             internal_topk + search_width * graph_degree, 
        
             top_k, 
        
             false);

cuvs/cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh

Lines 644 to 649 in 5062594

    
           topk_by_bitonic_sort_1st<MAX_ITOPK + MAX_CANDIDATES>( 
        
             result_distances_buffer, 
        
             result_indices_buffer, 
        
             internal_topk + search_width * graph_degree, 
        
             internal_topk, 
        
             false);

The problem is that the (max) array length (=MAX_ITOPK + MAX_CANDIDATES) is not always the power of two.
The second bitonic sort is called even if no elements are filtered out unless cuvs::neighbors::filtering::none_sample_filter is specified as the filter, so #472 occurs.

Although, as you mentioned, making the bitonic sort array always a power-of-two is an alternative way to fix this issue, I didn't do it because 1) the array elements except the filtered-out nodes are already sorted, and 2) more registers are required that will not be used but required to make the bitonic sort array power-of-two.

Also, this bug is the cause of a problem in the CAGRA filtering unit test:

cuvs/cpp/test/neighbors/ann_cagra.cuh

Line 762 in 5062594

    
           // TODO: setting search_params.itopk_size here breaks the filter tests, but is required for

When itop_k is not specified, the default value, 64, is used. The graph degree is also 64. Therefore, MAX_ITOPK (64) + MAX_CANDIDATES (64) equals 128, and the bitonic sort works correctly in this case. However, if itopk size is set to another value, the bitonic sort does not work.

I think, it would be really beneficial to construct a reproducer for #472 as a test case in this PR and make sure it's fixed with the introduced change.

Yes, so I reenabled the test in this PR by changing the following lines to set the itopk size correctly. (@lowener)

cuvs/cpp/test/neighbors/ann_cagra.cuh

Lines 762 to 765 in 5062594

    
           // TODO: setting search_params.itopk_size here breaks the filter tests, but is required for 
        
           // k>1024 skip these tests until fixed 
        
           if (ps.k >= 1024) { GTEST_SKIP(); } 
        
           // search_params.itopk_size   = ps.itopk_size;

did you have a chance to check if this affects the QPS?

I measured the performance of no filtering out search (the same situation as #472 )

do we need a similar fix for multi-cta and multi-kernel versions of CAGRA?

No.

In the case of multi-CTA, the bitonic sort for the power-of-2 array is used to move the invalid elements, so there is no need to change. (We use a bitonic sort here because the array size is relatively small (32+graph_degree), which would not increase the register usage pressure.)
In the case of multi-kernel, the _find_topk routine is called, and this bug is not related.

achirkin

Thanks @enp1s0 for the PR and the comprehensive answer. Now I understand the logic of the change it looks good to me overall.
Nevertheless, I have a few questions to the design below.

cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh

achirkin · 2024-12-03T13:31:14Z

cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh

+        for (unsigned i = 0; i < search_width; i++) {
+          move_invalid_to_end_of_list(
+            result_indices_buffer, result_distances_buffer, internal_topk);
+        }


Do I understand that right, that this algorithm moves one index at a time, and repeats this for each candidate in the list? That's O(search_width*(parent_list_buffer + search_width)) complexity?

Maybe we'd better do a prefix scan (to get the indices of the valid items) followed by a shift of all elements (e.g. we can use cub::WarpScan or cub::BlockScan for that)? Or do you think the performance difference would be negligible?

The reason I didn't use the scan method is that the maximum number of filtered-out elements here is search_width, which is typically 1. In our experience, increasing search_width does not help improve the search performance much, so it is likely to be a small number. So I use the specialized function for 1 here and run it multiple times if search_width is not 1.

Makes sense, thanks for the explanation. For some reason I thought we could have more than search_width new candidates each iteration.

cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh

achirkin · 2024-12-03T13:44:06Z

cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh

+    constexpr std::uint32_t warp_size = 32;
+    if (threadIdx.x < warp_size) {
+      std::uint32_t num_found_valid = 0;
+      for (std::uint32_t buffer_offset = 0; buffer_offset < internal_topk;
+           buffer_offset += warp_size) {
+        // Calculate the new buffer index
+        const auto src_position = buffer_offset + threadIdx.x;
+        const std::uint32_t is_valid_index =
+          (result_indices_buffer[src_position] & (~index_msb_1_mask)) == invalid_index ? 0 : 1;
+        std::uint32_t new_position;
+        scan_op_t(temp_storage).InclusiveSum(is_valid_index, new_position);
+        if (is_valid_index) {
+          const auto dst_position               = num_found_valid + (new_position - 1);
+          result_indices_buffer[dst_position]   = result_indices_buffer[src_position];
+          result_distances_buffer[dst_position] = result_distances_buffer[src_position];
+        }
+
+        // Calculate the largest valid position within a warp and bcast it for the next iteration
+        num_found_valid += new_position;
+        for (std::uint32_t offset = (warp_size >> 1); offset > 0; offset >>= 1) {
+          const auto v = __shfl_xor_sync(~0u, num_found_valid, offset);
+          if ((threadIdx.x & offset) == 0) { num_found_valid = v; }
+        }
+
+        // If the enough number of items are found, do early termination
+        if (num_found_valid >= top_k) { break; }
+      }
+
+      if (num_found_valid < top_k) {
+        // Fill the remaining buffer with invalid values so that `topk_by_bitonic_sort` is usable in
+        // the next step
+        for (std::uint32_t i = num_found_valid + threadIdx.x; i < internal_topk; i += warp_size) {
+          result_indices_buffer[i]   = invalid_index;
+          result_distances_buffer[i] = utils::get_max_value<DISTANCE_T>();
+        }
+      }
+    }


I've just realized you do here exactly what I wanted to suggest in the comment above. Could you please put this in a separate function for better readability? And then consider if it makes sense to re-use that in place of the move_invalid_to_end_of_list loop above?

I use this code here because the number of filtered-out nodes is unknown, while the maximum number is known in the above place. If the maximum number of filtered-out nodes is known and small, we can use more simple code that is not required to do the scan operation like above (although I didn't compare the performance experimentally).

achirkin · 2024-12-03T13:47:38Z

cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh

+    // If the sufficient number of valid indexes are not in the internal topk, pick up from the
+    // candidate list.
+    if (top_k > internal_topk || result_indices_buffer[top_k - 1] == invalid_index) {


Do I understand it right, that we need this because we filter the result indices buffer after we move candidates from the internal workspace, which is larger?
If so, why shouldn't we first filter the internal workspace and only then copy the results instead?

What the code here does is 1) pick up valid nodes (=not filtered-out nodes) from the candidate list by the bitonic sort and 2) concatenate the resulting list and the itopk valid nodes list. This operation is needed when enough valid indices for topk are not obtained only from the itopk valid node list.

I think the function name topk_by_bitonic_sort is problematic because it is not a simple sort but takes a sorted list A and an unsorted list B and outputs merge_sort(A, bitonic_sort(B)). I'll change the name.

Co-authored-by: Artem M. Chirkin <[email protected]>

enp1s0 · 2024-12-03T15:31:43Z

@achirkin Thank you for your review! I fixed the code, so can you check it again?

achirkin

Thank you for the updates! LGTM

achirkin · 2024-12-03T20:07:55Z

cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh

+      std::uint32_t num_found_valid = 0;
+      for (std::uint32_t buffer_offset = 0; buffer_offset < internal_topk;
+           buffer_offset += warp_size) {
+        // Calculate the new buffer index
+        const auto src_position = buffer_offset + threadIdx.x;
+        const std::uint32_t is_valid_index =
+          (result_indices_buffer[src_position] & (~index_msb_1_mask)) == invalid_index ? 0 : 1;
+        std::uint32_t new_position;
+        scan_op_t(temp_storage).InclusiveSum(is_valid_index, new_position);
+        if (is_valid_index) {
+          const auto dst_position               = num_found_valid + (new_position - 1);
+          result_indices_buffer[dst_position]   = result_indices_buffer[src_position];
+          result_distances_buffer[dst_position] = result_distances_buffer[src_position];
+        }
+
+        // Calculate the largest valid position within a warp and bcast it for the next iteration
+        num_found_valid += new_position;
+        for (std::uint32_t offset = (warp_size >> 1); offset > 0; offset >>= 1) {
+          const auto v = raft::shfl_xor(num_found_valid, offset);
+          if ((threadIdx.x & offset) == 0) { num_found_valid = v; }
+        }
+
+        // If the enough number of items are found, do early termination
+        if (num_found_valid >= top_k) { break; }
+      }
+
+      if (num_found_valid < top_k) {
+        // Fill the remaining buffer with invalid values so that `topk_by_bitonic_sort_and_merge` is
+        // usable in the next step
+        for (std::uint32_t i = num_found_valid + threadIdx.x; i < internal_topk; i += warp_size) {
+          result_indices_buffer[i]   = invalid_index;
+          result_distances_buffer[i] = utils::get_max_value<DISTANCE_T>();
+        }
+      }
+    }


If if we use this routine only once, I still think it would be nice to move it out as a separate function alongside the move_invalid_to_end_of_list (so that we'd have two: move_first_invalid_to_end_of_list and move_all_invalid_to_end_of_list).
But given that we're very close to the code freeze and this PR is very important, I'd say we can postpone this grooming to the next release.

achirkin · 2024-12-04T09:06:12Z

/merge

enp1s0 added 2 commits November 24, 2024 00:33

Fix CAGRA filter

3d35cdc

Fix new distance

ed10ed2

enp1s0 requested a review from a team as a code owner November 23, 2024 15:58

Merge branch 'branch-24.12' into fix-cagra-filter

d5faac6

enp1s0 self-assigned this Nov 23, 2024

github-actions bot added the cpp label Nov 23, 2024

enp1s0 added bug Something isn't working non-breaking Introduces a non-breaking change labels Nov 23, 2024

enp1s0 changed the title ~~Fix CAGRA filter~~ [BUG] Fix CAGRA filter Nov 23, 2024

Merge branch 'branch-24.12' into fix-cagra-filter

ecd069a

enp1s0 mentioned this pull request Nov 23, 2024

[BUG] The recall rate significantly drops by 20% when using CAGRA with a full 1s prefilter (nothing filtered out). #472

Open

enp1s0 and others added 5 commits November 26, 2024 11:39

Fix CAGRA filtering test

f476fd7

Merge branch 'branch-24.12' into fix-cagra-filter

6fa2125

Make thew CAGRA filtering postprocess sort-free

692375d

Merge branch 'fix-cagra-filter' of github.com:enp1s0/cuvs into fix-ca…

d38a888

…gra-filter

Update CAGRA filtering postprocess

6716940

achirkin requested changes Nov 26, 2024

View reviewed changes

enp1s0 added 2 commits November 26, 2024 20:42

Fix test

6c75170

Fix invalid node move

b87b027

Merge branch 'branch-24.12' into fix-cagra-filter

e8ad300

achirkin requested changes Dec 3, 2024

View reviewed changes

enp1s0 and others added 4 commits December 3, 2024 23:27

Update cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh

4ad78b3

Co-authored-by: Artem M. Chirkin <[email protected]>

Update cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh

536ed1f

Co-authored-by: Artem M. Chirkin <[email protected]>

Fix style

270b928

Fix smem size calculation for filtering postprocess in single-CAT kernel

3d676e2

Fix the topk function name

49acbab

Fix the topk function name (2)

55afa10

achirkin approved these changes Dec 3, 2024

View reviewed changes

Merge branch 'branch-24.12' into fix-cagra-filter

6bd8443

rapids-bot bot merged commit acbd097 into rapidsai:branch-24.12 Dec 4, 2024
55 checks passed

enp1s0 deleted the fix-cagra-filter branch December 4, 2024 09:09

rhdong mentioned this pull request Dec 9, 2024

[Feat] CAGRA filtering with BFKNN when sparsity matching threshold #378

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Fix CAGRA filter #489

[BUG] Fix CAGRA filter #489

enp1s0 commented Nov 23, 2024 •

edited

Loading

lowener commented Nov 24, 2024

achirkin left a comment

achirkin commented Nov 26, 2024 •

edited

Loading

enp1s0 commented Nov 26, 2024 •

edited

Loading

achirkin left a comment

achirkin Dec 3, 2024

enp1s0 Dec 3, 2024

achirkin Dec 3, 2024

achirkin Dec 3, 2024

enp1s0 Dec 3, 2024

achirkin Dec 3, 2024

enp1s0 Dec 3, 2024

enp1s0 commented Dec 3, 2024

achirkin left a comment

achirkin Dec 3, 2024

achirkin commented Dec 4, 2024

	topk_by_bitonic_sort_1st<MAX_ITOPK + MAX_CANDIDATES>(
	result_distances_buffer,
	result_indices_buffer,
	internal_topk + search_width * graph_degree,
	top_k,
	false);

[BUG] Fix CAGRA filter #489

[BUG] Fix CAGRA filter #489

Conversation

enp1s0 commented Nov 23, 2024 • edited Loading

The cause of the bug

Fix

lowener commented Nov 24, 2024

achirkin left a comment

Choose a reason for hiding this comment

achirkin commented Nov 26, 2024 • edited Loading

enp1s0 commented Nov 26, 2024 • edited Loading

achirkin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enp1s0 commented Dec 3, 2024

achirkin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

achirkin commented Dec 4, 2024

enp1s0 commented Nov 23, 2024 •

edited

Loading

achirkin commented Nov 26, 2024 •

edited

Loading

enp1s0 commented Nov 26, 2024 •

edited

Loading