Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Add Support for Index merge in CAGRA #618

Merged
merged 22 commits into from
Feb 6, 2025

Conversation

rhdong
Copy link
Member

@rhdong rhdong commented Jan 27, 2025

No description provided.

@rhdong rhdong requested review from cjnolet and achirkin January 27, 2025 07:04
@rhdong rhdong requested review from a team as code owners January 27, 2025 07:04
auto merged_index =
cagra::build(handle, params, raft::make_const_mdspan(device_updated_dataset_view));

if (static_cast<std::size_t>(stride) == dim) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cjnolet @achirkin , I know these codes are odd, but without them, datasets will be changed after calling cagra::detail::search_main_core, which will cause the test failure. I do not know how the dataset format, matrix ownership, cagra::search interact behind it. Could you have comments here? Many thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, it looks like no one owns the host_updated_dataset or device_updated_dataset beyond the scope of this function, so the data gets destroyed unless the owning update_dataset is called under the if branch here.
Hence, I think, you should call update_dataset unconditionally here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, it looks like no one owns the host_updated_dataset or device_updated_dataset beyond the scope of this function, so the data gets destroyed unless the owning update_dataset is called under the if branch here. Hence, I think, you should call update_dataset unconditionally here.

Thank you, very helpful!

Copy link
Member Author

@rhdong rhdong Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @achirkin , I got the reason; the update_dataset is called in cagra::build, actually. This issue I mentioned is because a non-owning dataset is returned while stride==dim; pls refer to here. So, I unintentionally did the right thing. 😅

@rhdong rhdong added feature request New feature or request non-breaking Introduces a non-breaking change labels Jan 27, 2025
}

// Allocate the new dataset on device
auto device_updated_dataset =
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a great API that can be used to fit the situation that device memory is not enough cuvs::neighbors::nn_descent::has_enough_device_memory. I will make it in the next commit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

*
* @return A new CAGRA index containing the merged indices, graph, and dataset.
*/
auto merge(raft::resources const& res,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @chatman, I'm working on cagra::merge. Could you review the API design when you have a moment? Any suggestions would be greatly appreciated. Thanks!

@rhdong
Copy link
Member Author

rhdong commented Jan 30, 2025

Hi @achirkin @cjnolet , could you take a look at the API design? Any feedback is appreciated. Thanks a lot!

@achirkin
Copy link
Contributor

could you take a look at the API design? Any feedback is appreciated. Thanks a lot!

The API of cagra::merge looks good to me, but I do have couple questions:

  1. I'm thinking maybe it's not necessary to make the merge_params a child of index_params? index_params contains members that are not necessary for merging and can be a burden for a user.
  2. Why using vector of pointers to pass indices? I don't strongly oppose it, but there are other options, which look not much worse at the first glance: as a variable-length template (using references to indices), as a non-template vararg, as a plain array ((const index<>&)[]), or as a vector of references (to avoid checking for nullptr).

On a more general note, I wonder whether the merging may be problematic on the user side due to the absence of index (vector id) remapping in CAGRA? The new index ordering depends on the order in which one puts the merged indices, so it may be difficult to map these back if the need arises.

@rhdong
Copy link
Member Author

rhdong commented Jan 31, 2025

could you take a look at the API design? Any feedback is appreciated. Thanks a lot!

The API of cagra::merge looks good to me, but I do have couple questions:

  1. I'm thinking maybe it's not necessary to make the merge_params a child of index_params? index_params contains members that are not necessary for merging and can be a burden for a user.
  2. Why using vector of pointers to pass indices? I don't strongly oppose it, but there are other options, which look not much worse at the first glance: as a variable-length template (using references to indices), as a non-template vararg, as a plain array ((const index<>&)[]), or as a vector of references (to avoid checking for nullptr).

On a more general note, I wonder whether the merging may be problematic on the user side due to the absence of index (vector id) remapping in CAGRA? The new index ordering depends on the order in which one puts the merged indices, so it may be difficult to map these back if the need arises.

Hi @achirkin, Thank you for pointing these out! Let me explain a bit:

  1. Currently, the initial implementation of merge is based on rebuilding a new index from scratch, so index_param is required. Regarding how to pass this parameter, I’m not in favor of a definition like this, as it’s relatively inconvenient for the caller:

    struct merge_params {
        index_params index_params;
    };

    In fact, some parameters from the index might be needed for merging in the future (in the ultimate optimized version), such as intermediate_graph_degree, attach_dataset_on_build, graph_degree, and graph_build_params. That’s why I implemented it this way. Would you prefer defining the members individually, or do you have another approach in mind?

  2. Yeah, you're right—I generally agree with you. It’s ultimately a trade-off decision, mainly because std::vector doesn’t support references. My concern with variable-length templates or non-template varargs is that they seem to require the caller to manually fill in all the indices, which could be cumbersome when handling a large number of indices. (we—or the caller—need an additional mechanism like std::initializer_list to make it.)

    I found that std::vector<std::reference_wrapper<const index>> can resolve the issue, but it still feels like one problem is being replaced by another. So, I’d prefer to keep it simple and just check for nullptr (which I’ll add in a new commit)-- Anyway, we need to check many other things, such as whether the index dataset is attached. That said, if you’d still like me to change it, please let me know, and I’ll follow your suggestion. :)

@rhdong
Copy link
Member Author

rhdong commented Jan 31, 2025

On a more general note, I wonder whether the merging may be problematic on the user side due to the absence of index (vector id) remapping in CAGRA? The new index ordering depends on the order in which one puts the merged indices, so it may be difficult to map these back if the need arises.

Sorry for missing the last concern, Hi @cjnolet, may you confirm if it is a problem mentioned by @achirkin: do we need to take care of the sequence of indices when merging them?

@rhdong rhdong requested a review from a team as a code owner January 31, 2025 20:18
@rhdong rhdong requested a review from jameslamb January 31, 2025 20:18
@@ -105,4 +105,4 @@ select = [
]

# detect when package size grows significantly
max_allowed_size_compressed = '1.1G'
max_allowed_size_compressed = '1.2G'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jameslamb , for conservative consideration, I increase it to 1.2GB. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok with me

@rhdong
Copy link
Member Author

rhdong commented Jan 31, 2025

/ok to test

@rhdong
Copy link
Member Author

rhdong commented Feb 2, 2025

could you take a look at the API design? Any feedback is appreciated. Thanks a lot!

The API of cagra::merge looks good to me, but I do have couple questions:

  1. I'm thinking maybe it's not necessary to make the merge_params a child of index_params? index_params contains members that are not necessary for merging and can be a burden for a user.
  2. Why using vector of pointers to pass indices? I don't strongly oppose it, but there are other options, which look not much worse at the first glance: as a variable-length template (using references to indices), as a non-template vararg, as a plain array ((const index<>&)[]), or as a vector of references (to avoid checking for nullptr).

On a more general note, I wonder whether the merging may be problematic on the user side due to the absence of index (vector id) remapping in CAGRA? The new index ordering depends on the order in which one puts the merged indices, so it may be difficult to map these back if the need arises.

Hi @achirkin, Thank you for pointing these out! Let me explain a bit:

  1. Currently, the initial implementation of merge is based on rebuilding a new index from scratch, so index_param is required. Regarding how to pass this parameter, I’m not in favor of a definition like this, as it’s relatively inconvenient for the caller:

    struct merge_params {
        index_params index_params;
    };

    In fact, some parameters from the index might be needed for merging in the future (in the ultimate optimized version), such as intermediate_graph_degree, attach_dataset_on_build, graph_degree, and graph_build_params. That’s why I implemented it this way. Would you prefer defining the members individually, or do you have another approach in mind?

  2. Yeah, you're right—I generally agree with you. It’s ultimately a trade-off decision, mainly because std::vector doesn’t support references. My concern with variable-length templates or non-template varargs is that they seem to require the caller to manually fill in all the indices, which could be cumbersome when handling a large number of indices. (we—or the caller—need an additional mechanism like std::initializer_list to make it.)
    I found that std::vector<std::reference_wrapper<const index>> can resolve the issue, but it still feels like one problem is being replaced by another. So, I’d prefer to keep it simple and just check for nullptr (which I’ll add in a new commit)-- Anyway, we need to check many other things, such as whether the index dataset is attached. That said, if you’d still like me to change it, please let me know, and I’ll follow your suggestion. :)

Hi @achirkin , I thought about the 1st issue, and I realized the coupling with index_params is a bad design. I changed it by a new commit in which the strategy will be decoupled with index_params and make it an independent internal member variable, keeping no change on the caller's semantics

@@ -309,7 +325,7 @@ struct index : cuvs::neighbors::index {
return data_rows > 0 ? data_rows : graph_view_.extent(0);
}

/** Dimensionality of the data. */
/** dimension of the data. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems a little unecessary. Dimensionality seems like the right word (and case)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert it back

explicit merge_params(const cagra::index_params& params) : output_index_params(params) {}

// Parameters for creating the output index
cagra::index_params output_index_params;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From an algorithmic perspective, this could be really really challenging. For example, depending upon the merge method used, I'm not sure if these can always be used. I don't think we should hold up the PR over this, but can you create a Github issue just to expend more thought into how we might be able to utilize this efficiently (if at all) with different merge strategies?

There are at least 3 different merge strategies that I can think of off the top of my head:

  1. Logical- simply wraps a new index structure around existing CAGRA graphs and broadcasts the query to each of the existing cagra graphs. This will be a fast merge but take a small hit in search latency. (This might be preferred for fewer larger CAGRA graphs.
  2. Physical- builds a new cagra grpah from the union of dataset points in existing cagra graphs. This will be expensive to build but not impact search latency/quality. This might be preferred for many smaller cagra graphs.
  3. Smart- overlaps dataset vectors across cagra graphs and merges the graphs into a single graph. This might be prefferred for many larger cagra graphs.

Maybe you could create the "MergeKind" enum now and just add "Physical" as the only option (and document accordingly). We will next need to implement the logical merge.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you creatre a GIthub issue to capture the other merge strategies. For the logical merge, we will also need a composite_index or logically_merged_index that can act like a CAGRA (or other) index but it's really broadcasting the queries to the inner indexes.

Copy link
Member Author

@rhdong rhdong Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! (Naming is MergeStrategy)

Copy link
Member Author

@rhdong rhdong Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you creatre a GIthub issue to capture the other merge strategies. For the logical merge, we will also need a composite_index or logically_merged_index that can act like a CAGRA (or other) index but it's really broadcasting the queries to the inner indexes.

Sorry for missing this, I guess the composite index can be a feature for the search API instead of merging?
-- The issue was created: #663

@rhdong rhdong requested a review from cjnolet February 5, 2025 22:10
@cjnolet
Copy link
Member

cjnolet commented Feb 5, 2025

Sorry for missing this, I guess the composite index can be a feature for the search API instead of merging?

Essentially, the merge would return a "composite_index" instead of a typical cagra::index (though a "composite_index" would implement cagra::index) so the user can still interact with the index in the same way they would a typical cagra::index but when they perform search, it'll automatically broadcast the query vector to all the "logically merged" subindexes. DOes that make sense? It'd be a similar API experience to our single node multi-gpu "indexes" where the user has a handle to an index and they don't care what kind of index it is, they just know they can call the same functions on it and it'll act appropriately according to its type.

@rhdong
Copy link
Member Author

rhdong commented Feb 5, 2025

Sorry for missing this, I guess the composite index can be a feature for the search API instead of merging?

Essentially, the merge would return a "composite_index" instead of a typical cagra::index (though a "composite_index" would implement cagra::index) so the user can still interact with the index in the same way they would a typical cagra::index but when they perform search, it'll automatically broadcast the query vector to all the "logically merged" subindexes. DOes that make sense? It'd be a similar API experience to our single node multi-gpu "indexes" where the user has a handle to an index and they don't care what kind of index it is, they just know they can call the same functions on it and it'll act appropriately according to its type.

That's a great idea! Sounds like composite_index can inherit from the regular index.

*/
auto merge(raft::resources const& res,
const cuvs::neighbors::cagra::merge_params& params,
std::vector<cuvs::neighbors::cagra::index<int8_t, uint32_t>*>& indices)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree with Artem that the vector of points is not the prettiest thing, but I don't think variadic templates are the way to fix that (and they overall make things very challenging to work with). I think we can stick with pointers for now and udpate the API later if needed. Initially, this will be needed for Lucene, which will use it through our Java API so at least this public API is localized at the moment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pointers are fine, from the perspective of the Java API. We can work best with memory addresses, since we'll be mmapp'ing the index data from files on disk.

Copy link
Contributor

@chatman chatman Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vector<index> is fine from java's perspective.

*/
auto merge(raft::resources const& res,
const cuvs::neighbors::cagra::merge_params& params,
std::vector<cuvs::neighbors::cagra::index<int8_t, uint32_t>*>& indices)
Copy link
Contributor

@chatman chatman Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vector<index> is fine from java's perspective.

@cjnolet
Copy link
Member

cjnolet commented Feb 6, 2025

/merge

@rapids-bot rapids-bot bot merged commit 52dd92c into rapidsai:branch-25.02 Feb 6, 2025
63 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake cpp feature request New feature or request non-breaking Introduces a non-breaking change Python
Projects
Development

Successfully merging this pull request may close these issues.

6 participants