-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat] Add Support for Index merge
in CAGRA
#618
Conversation
auto merged_index = | ||
cagra::build(handle, params, raft::make_const_mdspan(device_updated_dataset_view)); | ||
|
||
if (static_cast<std::size_t>(stride) == dim) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cjnolet @achirkin , I know these codes are odd, but without them, datasets will be changed after calling cagra::detail::search_main_core,
which will cause the test failure. I do not know how the dataset format, matrix ownership, cagra::search
interact behind it. Could you have comments here? Many thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, it looks like no one owns the host_updated_dataset
or device_updated_dataset
beyond the scope of this function, so the data gets destroyed unless the owning update_dataset
is called under the if
branch here.
Hence, I think, you should call update_dataset
unconditionally here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, it looks like no one owns the
host_updated_dataset
ordevice_updated_dataset
beyond the scope of this function, so the data gets destroyed unless the owningupdate_dataset
is called under theif
branch here. Hence, I think, you should callupdate_dataset
unconditionally here.
Thank you, very helpful!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
// Allocate the new dataset on device | ||
auto device_updated_dataset = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a great API that can be used to fit the situation that device memory is not enough cuvs::neighbors::nn_descent::has_enough_device_memory
. I will make it in the next commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* | ||
* @return A new CAGRA index containing the merged indices, graph, and dataset. | ||
*/ | ||
auto merge(raft::resources const& res, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @chatman, I'm working on cagra::merge
. Could you review the API design when you have a moment? Any suggestions would be greatly appreciated. Thanks!
The API of
On a more general note, I wonder whether the merging may be problematic on the user side due to the absence of index (vector id) remapping in CAGRA? The new index ordering depends on the order in which one puts the merged indices, so it may be difficult to map these back if the need arises. |
Hi @achirkin, Thank you for pointing these out! Let me explain a bit:
|
Sorry for missing the last concern, Hi @cjnolet, may you confirm if it is a problem mentioned by @achirkin: do we need to take care of the sequence of indices when merging them? |
@@ -105,4 +105,4 @@ select = [ | |||
] | |||
|
|||
# detect when package size grows significantly | |||
max_allowed_size_compressed = '1.1G' | |||
max_allowed_size_compressed = '1.2G' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jameslamb , for conservative consideration, I increase it to 1.2GB. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok with me
/ok to test |
Hi @achirkin , I thought about the 1st issue, and I realized the coupling with |
cpp/include/cuvs/neighbors/cagra.hpp
Outdated
@@ -309,7 +325,7 @@ struct index : cuvs::neighbors::index { | |||
return data_rows > 0 ? data_rows : graph_view_.extent(0); | |||
} | |||
|
|||
/** Dimensionality of the data. */ | |||
/** dimension of the data. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change seems a little unecessary. Dimensionality
seems like the right word (and case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert it back
explicit merge_params(const cagra::index_params& params) : output_index_params(params) {} | ||
|
||
// Parameters for creating the output index | ||
cagra::index_params output_index_params; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From an algorithmic perspective, this could be really really challenging. For example, depending upon the merge method used, I'm not sure if these can always be used. I don't think we should hold up the PR over this, but can you create a Github issue just to expend more thought into how we might be able to utilize this efficiently (if at all) with different merge strategies?
There are at least 3 different merge strategies that I can think of off the top of my head:
- Logical- simply wraps a new index structure around existing CAGRA graphs and broadcasts the query to each of the existing cagra graphs. This will be a fast merge but take a small hit in search latency. (This might be preferred for fewer larger CAGRA graphs.
- Physical- builds a new cagra grpah from the union of dataset points in existing cagra graphs. This will be expensive to build but not impact search latency/quality. This might be preferred for many smaller cagra graphs.
- Smart- overlaps dataset vectors across cagra graphs and merges the graphs into a single graph. This might be prefferred for many larger cagra graphs.
Maybe you could create the "MergeKind" enum now and just add "Physical" as the only option (and document accordingly). We will next need to implement the logical merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you creatre a GIthub issue to capture the other merge strategies. For the logical merge, we will also need a composite_index
or logically_merged_index
that can act like a CAGRA (or other) index but it's really broadcasting the queries to the inner indexes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! (Naming is MergeStrategy
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you creatre a GIthub issue to capture the other merge strategies. For the logical merge, we will also need a
composite_index
orlogically_merged_index
that can act like a CAGRA (or other) index but it's really broadcasting the queries to the inner indexes.
Sorry for missing this, I guess the composite index can be a feature for the search
API instead of merging?
-- The issue was created: #663
Essentially, the merge would return a "composite_index" instead of a typical cagra::index (though a "composite_index" would implement cagra::index) so the user can still interact with the index in the same way they would a typical cagra::index but when they perform search, it'll automatically broadcast the query vector to all the "logically merged" subindexes. DOes that make sense? It'd be a similar API experience to our single node multi-gpu "indexes" where the user has a handle to an index and they don't care what kind of index it is, they just know they can call the same functions on it and it'll act appropriately according to its type. |
That's a great idea! Sounds like |
*/ | ||
auto merge(raft::resources const& res, | ||
const cuvs::neighbors::cagra::merge_params& params, | ||
std::vector<cuvs::neighbors::cagra::index<int8_t, uint32_t>*>& indices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree with Artem that the vector of points is not the prettiest thing, but I don't think variadic templates are the way to fix that (and they overall make things very challenging to work with). I think we can stick with pointers for now and udpate the API later if needed. Initially, this will be needed for Lucene, which will use it through our Java API so at least this public API is localized at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pointers are fine, from the perspective of the Java API. We can work best with memory addresses, since we'll be mmapp'ing the index data from files on disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vector<index>
is fine from java's perspective.
*/ | ||
auto merge(raft::resources const& res, | ||
const cuvs::neighbors::cagra::merge_params& params, | ||
std::vector<cuvs::neighbors::cagra::index<int8_t, uint32_t>*>& indices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vector<index>
is fine from java's perspective.
/merge |
No description provided.