-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve multi-CTA algorithm #492
base: branch-25.02
Are you sure you want to change the base?
Conversation
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @anaruse for this PR, it is great to see these improvements. Overall the changes look good, and the benchmarks that you have shared offline look very encouraging. I just have a few questions below.
@anaruse there are some unsigned commits, that blocks CI from testing the changes automatically. To fix this issue, could you rebase the PR? |
/ok to test |
…when the number of results is large Fix some issues Fix lower recall issue with new multi-cta algo Removing redundant code and changing some parameters Update cpp/src/neighbors/detail/cagra/search_plan.cuh Co-authored-by: Tamas Bela Feher <[email protected]> Remove an unnecessary line and satisfy clang-format
3dce160
to
6223fd2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Akira for the updates, the PR looks good to me.
/merge |
/ok to test |
/ok to test |
/merge |
/ok to test |
/merge |
We are on the brink of missing code freeze for this PR. Please anyone reading this, don't click the "update" button. It inserts a merge commit, which reruns CI in its entirety and this is not needed to merge the PR. We can re-run individual flaky tests that fail without having to rerun the entire CI (the former takes minutes and the latter can take several hours). |
@anaruse @tfeher CI seems to be running successfully for other PRs but the gtests seem to be consistently timing out for this PR. As far as I can tell, there's no updates to any of the tests, in this PR, but the timeouts don't seem flaky, they seem isolated to these changes, somehow. We are pushing back code freeze by 1 day. Do you guys think we can still make this in time for 24.12? |
Handle the case when the search result contains invalid indices when building the updated graph in add_nodes. For debugging purposes, fail if any invalid indices found; in future, we can replace RAFT_FAIL with RAFT_LOG_WARN to make the add_nodes routine more robust.
I took the liberty to add a workaround to add_nodes, which handles the case when CAGRA search doesn't return enough valid indices. With this, the tests should fail with a descriptive message in place of the segfault. |
/ok to test |
The cause has not yet been identified, but it seems that this problem only occurs when the dimensionality of the dataset is 1. |
The problem of invalid results being included in the search results when the number of dimensions in the data set is small has been improved in the case of the new multi-CTA algorithm. |
If the search results for add_nodes include invalid results, a message is output using RAFT_LOG_WARN, and if there are too many invalid results to execute add_nodes without error, it is terminated using RAFT_FAIL. |
@enp1s0 , Could you please check the changes I made to add_nodes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anaruse Thank you for adding invalid node handling to the node addition function. I have a minor comment, but the changes look good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the error in add_nodes is not observed prior to this PR (without the safeguard RAFT_FAIL, the tests segfault in this PR, but not in the main branch). This leads me to believe that the problem in the search may be related (either introduced or just surfaced) by the new changes.
In any case, I don't see a reason why the cagra search algorithms have to return invalid values under the small dimensionality condition, and thus I believe we should fix the search rather throwing the error at the use-site. Where you able to pinpoint why the search returns the invalid values in the small dimensionality cases?
It is not easy to explain why the new multi-CTA algo sometimes produces invalid search results, but it is mainly because it allows multiple CTAs to use the same node as a search path, and it allows multiple CTAs to have the same node in their local itopk buffer. If the number of dimensions is small, the paths to reach the query are rather limited, so it is likely that many nodes will be in the local itopk buffers of multiple CTAs at the same time. Even the, duplications are eventually removed using hash tables, but the result of that de-duplication is that the specified number of valid results may not be output. So far, we know that the problem only occurs when the dimensionality of the dataset is 1. And a dataset with dimension 1 is not originally a target for vector search. Although not ideal, one way to deal with this is to stop testing on datasets with a dimension of 1. |
5a3519a
to
b61126a
Compare
I devised various ways to use the temporary buffer that holds the intermediate results of a search, taking into account the characteristics of the new multi CTA algorithm. As a result, at least in CI testing, the search results no longer contain any incorrect results. |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anaruse thank you for investigating this.
It's not new that our ANN algorithms may return invalid values under some circumstances. One example of this is IVF-PQ with a small number of probes (especially with filtering), so CAGRA multi-cta implementation won't be the first. I think it's reasonable to add a GTEST_SKIP()
with an explanation comment for the case of multi-cta and dim = 1 and get this PR merged.
} else if ((traversed_hash_ptr != nullptr) && | ||
hashmap::search<IndexT, 1>( | ||
traversed_hash_ptr, traversed_hash_bitlen, best_index_team_local)) { | ||
// Deactivate this entry as it has been already used by otehrs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Deactivate this entry as it has been already used by otehrs. | |
// Deactivate this entry as it has been already used by others. |
// Check search results | ||
constexpr int max_warnings = 3; | ||
int num_warnings = 0; | ||
for (std::size_t vec_i = 0; vec_i < batch.size(); vec_i++) { | ||
std::uint32_t invalid_edges = 0; | ||
for (std::uint32_t i = 0; i < base_degree; i++) { | ||
if (host_neighbor_indices(vec_i, i) >= old_size) { invalid_edges++; } | ||
} | ||
if (invalid_edges > 0) { | ||
if (num_warnings < max_warnings) { | ||
RAFT_LOG_WARN( | ||
"Invalid edges found in search results " | ||
"(vec_i:%lu, invalid_edges:%lu, degree:%lu, base_degree:%lu)", | ||
(uint64_t)vec_i, | ||
(uint64_t)invalid_edges, | ||
(uint64_t)degree, | ||
(uint64_t)base_degree); | ||
} | ||
num_warnings += 1; | ||
} | ||
} | ||
if (num_warnings > max_warnings) { | ||
RAFT_LOG_WARN("The number of queries that contain invalid search results: %d", num_warnings); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that this check is easily understandable now, but are you sure it's ok to add it here in terms of performance? Taking into account that the same complexity loop below (step 2: rank-based reordering) apparently has been worth parallelizing using OMP?
It has been reported that when the number of search results is large, for example 100, using the multi-CTA algorithm can cause a decrease in recall. This PR is intended to alleviate this low recall issue.
close #208