Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RELEASE] cuvs v24.08 #274

Merged
merged 53 commits into from
Aug 8, 2024
Merged

[RELEASE] cuvs v24.08 #274

merged 53 commits into from
Aug 8, 2024

Conversation

raydouglass
Copy link
Member

❄️ Code freeze for branch-24.08 and v24.08 release

What does this mean?

Only critical/hotfix level issues should be merged into branch-24.08 until release (merging of this PR).

What is the purpose of this PR?

  • Update documentation
  • Allow testing for the new release
  • Enable a means to merge branch-24.08 into main for the release

raydouglass and others added 30 commits May 20, 2024 17:41
Forward-merge branch-24.06 into branch-24.08
Forward-merge branch-24.06 into branch-24.08
Forward-merge branch-24.06 into branch-24.08
Forward-merge branch-24.06 into branch-24.08
Contributes to rapidsai/build-planning#31.

Authors:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

Approvers:
  - Jake Awe (https://github.com/AyodeAwe)
  - Dante Gama Dessavre (https://github.com/dantegd)
  - James Lamb (https://github.com/jameslamb)
  - Bradley Dice (https://github.com/bdice)

URL: #145
Forward-merge branch-24.06 into branch-24.08
Forward-merge branch-24.06 into branch-24.08
This change allows serializing to a std::ostream and deserializaing from a std::istream. This also fixes some minor docstring issues in the C++ serialization api's.

Authors:
  - Ben Frederickson (https://github.com/benfred)

Approvers:
  - Divye Gala (https://github.com/divyegala)

URL: #173
Authors:
  - Ben Frederickson (https://github.com/benfred)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #175
This PR overhauls how `ops-codeowners` reviews are handled.

`ops-codeowners` is replaced by `ci-codeowners` &
`packaging-codeowners`. The coverage of files is expanded as well.

Additionally, the process will change: reviews will be assigned to a
member of the teams instead of a manual request to `ops-codeowners`.

---------

Co-authored-by: Bradley Dice <[email protected]>
This PR removes text builds of the documentation, which we do not currently use for anything. Contributes to rapidsai/build-planning#71.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Ben Frederickson (https://github.com/benfred)
  - Corey J. Nolet (https://github.com/cjnolet)
  - Jake Awe (https://github.com/AyodeAwe)

URL: #180
Use raft's large workspace resource for large temporary allocations during ANN index build.
This is the port of rapidsai/raft#2194, which didn't make into raft before the algorithms were ported to cuVS.

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)

URL: #181
…wup (#185)

Contributes to rapidsai/build-planning#31
Contributes to rapidsai/dependency-file-generator#89

Since #145 was merged, we've made some small adjustments to the approach for `rapids-build-backend`. This catches `cuvs` up with those changes:

* consolidates version-handling in `ci/` scripts
* uses `--file-key` instead of `--file_key` in `rapids-dependency-file-generator` calls

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #185
This changes the build_index method to build in the python API for cagra. All of the other python api's use a `build` method for building the index, as do both the C++ and Rust api's as well.

Authors:
  - Ben Frederickson (https://github.com/benfred)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #187
There is a bug in the current CAGRA graph rank-based neighbor reordering process. A low recall or illegal memory access can occur if there are many detourable nodes from a node to its neighbors, e.g. there is a small subgraph in the initial kNN graph. This PR fixes this.

Authors:
  - tsuki (https://github.com/enp1s0)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)

URL: #192
The Python library is distributed as conda package `cuvs`, but the installation docs say `pycuvs`. This fixes that.

Looked for other uses like this:

```shell
git grep -i pycuvs
```

Didn't find any.

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: #193
Porting the ANN benchmarks from RAFT.
- [x] Make it build

Sanity check that benchmarks work (runs and gives reasonable recall for Deep-1M dataset)
- [x] cuVS brute force kNN 
- [x] cuVS IVF-Flat
- [x] cuVS IVF-PQ (+ refinement)
- [x] cuVS CAGRA
- [x] cuVS CAGRA-Q (+refinement)
- [x] Faiss GPU/CPU IVF-Flat & IVF-PQ
- [x] HNSW
- [x] CAGRA + HNSW 
- [x] GGNN

NB: the indices built using the old ANN_BENCH in raft tend to crash in cuvs search benchmarks during index deserialization - don't forget to build the indexes anew when testing.

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)
  - Malte Förster (https://github.com/mfoerste4)
  - Tamas Bela Feher (https://github.com/tfeher)
  - Micka (https://github.com/lowener)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)
  - James Lamb (https://github.com/jameslamb)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #130
Authors:
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Divye Gala (https://github.com/divyegala)

URL: #203
Add an example project using the cuvs bindings uploaded to crates.io, as well as some basic instructions on how to compile

Authors:
  - Ben Frederickson (https://github.com/benfred)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #206
Some minor fixes that are required to publish our rust bindings to crates.io:

 * using relative paths in the cuvs-sys cmake files didn't work, get around this by symlinking required files instead
 * Need to specify an actual version for cuvs-sys and ndarray-rand packages in the rust/cuvs/Cargo.toml file

Authors:
  - Ben Frederickson (https://github.com/benfred)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)
  - Ray Douglass (https://github.com/raydouglass)

URL: #207
With the deployment of rapids-build-backend, we need to make sure our dependencies have alpha specs.

Contributes to rapidsai/build-planning#31

Authors:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

Approvers:
  - James Lamb (https://github.com/jameslamb)

URL: #209
Contributes to rapidsai/build-planning#80

Adds constraints to avoid pulling in CMake 3.30.0, for the reasons described in that issue.

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Ben Frederickson (https://github.com/benfred)
  - Bradley Dice (https://github.com/bdice)

URL: #214
Usage of the CUDA math libraries is independent of the CUDA runtime. Make their static/shared status separately controllable.

Contributes to rapidsai/build-planning#35

Authors:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Ben Frederickson (https://github.com/benfred)

URL: #216
This PR introduces the new vector addition feature to CAGRA.

Rel: rapidsai/raft#1775
Original PR: rapidsai/raft#2157

CAGRA-Q is not supported

## Usage
```cpp
auto additional_dataset = raft::make_host_matrix<float, int64_t>(res,updated_dataset_size, dim);
cuvs::neighbors::cagra::extend(handle, raft::make_const_mdspan(additiona_dataset.view()), cagra_index);
```

## Algorithm

Graph degree: d

The algorithm consists of two stages: rank-based reordering and reverse edge addition.
1. Rank-based reordering
1-1. Obtain d' (=2d) nearest neighbor vectors (V) of a given new vector using the CAGRA search
1-2. Count the number of detourable edges using the result of step 1 and the neighbor list of the input index. Then we prune (3*d/2) edges in the same way as the CAGRA graph optimization. Through this operation, we decide d/2 neighbors.
2. Reverse edge addition
2-1. Count the number of incoming edges for all nodes.
2-2. Add d/2 reverse edges from the nodes added to the neighbor list in Step 1 by replacing a node with a new node. To prevent the connection to the replaced node from being lost, we add the node to the neighbor list of the new node. This allow us to make a detour connection. The replaced nodes are the largest number of incoming edge nodes in the 2/d nodes from the back of the neighbor list without duplication with the nodes already in the neighbor list.

## Performance
In this experiment, we first split the dataset into two parts: the initial and the additional part. Then, we extend the CAGRA index built by the initial part to include the additional part.
![search-eval](https://github.com/rapidsai/raft/assets/12711693/0fbae9e5-defc-4263-9d34-176667fb3359)


We can see a larger recall drop compared to the baseline by increasing the number of added vectors.
Therefore, rebuilding the CAGRA index is recommended when one wants to add a lot of vectors.

Authors:
  - tsuki (https://github.com/enp1s0)
  - Tamas Bela Feher (https://github.com/tfeher)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #151
KyleFromNVIDIA and others added 17 commits July 19, 2024 17:23
After updating everything to CUDA 12.5.1, use `[email protected]` again.

Contributes to rapidsai/build-planning#73

Authors:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

Approvers:
  - James Lamb (https://github.com/jameslamb)

URL: #234
Attempting to pin the version of a raft to a custom fork wasn't working, and it was still using the version installed by conda. Fix by mirroing the `CUML_RAFT_CLONE_ON_PIN` logic found in the cuml cmake files.

Authors:
  - Ben Frederickson (https://github.com/benfred)

Approvers:
  - Divye Gala (https://github.com/divyegala)

URL: #235
A small change that reduces the number of arguments in one of the wrapper layers in the detail namespace of CAGRA. The goal is twofold:
  1) Simplify the overly long signature of `selet_and_run` (which has many instances) 
  2) Give access to all search parameters for future upgrades of the search kernel

This is to simplify the integration (and review) of the persistent kernel (#215).
No performance or functional changes expected.

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)

URL: #227
rapidsai/raft#2346 introduced a breaking change in the API. This PR fixes the API usage.

Authors:
  - Divye Gala (https://github.com/divyegala)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #249
Contributes to rapidsai/build-planning#31

In short, RAPIDS DLFW builds want to produce wheels with unsuffixed dependencies, e.g. `cudf` depending on `rmm`, not `rmm-cu12`.

This PR is part of a series across all of RAPIDS to try to support that type of build by setting up CUDA-suffixed and CUDA-unsuffixed dependency lists in `dependencies.yaml`.

For more details, see:
* rapidsai/build-planning#31 (comment)
* rapidsai/cudf#16183

## Notes for Reviewers

### Why target 24.08?

This is targeting 24.08 because:

1. it should be very low-risk
2. getting these changes into 24.08 prevents the need to carry around patches for every library in DLFW builds using RAPIDS 24.08

Authors:
  - James Lamb (https://github.com/jameslamb)
  - Paul Taylor (https://github.com/trxcllnt)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #247
Port rapidsai/raft#2323 PR from RAFT
[Cleans up a collection of anti-patterns in the cuvs CMake code while also enabling building faiss from latest main]

Authors:
  - Tarang Jain (https://github.com/tarang-jain)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Corey J. Nolet (https://github.com/cjnolet)
  - Paul Taylor (https://github.com/trxcllnt)
  - Ray Douglass (https://github.com/raydouglass)

URL: #241
Add extra information to benchmark context for better reproducibility and performance analysis:

  1. Full command line used to call the executable (so you can copy-paste and run again).
  2. More CUDA device information: whether HMM, AST, or host atomics are available (how GPU can efficiently communicate with CPU).
  3. Host information: min/max frequences, used virtual processors and cores, available physical memory and swap (does the benchmark segfault due to not enough host memory? is SMT enabled? etc).

Addresses parts of #160

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)

URL: #248
Utils / Helpers to enable FAISS migration to cuVS from RAFT.

Authors:
  - Tarang Jain (https://github.com/tarang-jain)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #213
iteraton -> iteration

Authors:
  - Ikko Eltociear Ashimine (https://github.com/eltociear)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #232
)

This PR allows us to guarantee the connectivity of the CAGRA search graph using approximate MST.

It has been empirically shown that the graph indexes generated by CAGRA for search provide comparable search accuracy to other libraries, but reachability from any node to all nodes is not guaranteed. In fact, it has been confirmed that the number of strongly connected components (SCC) of graph indexes created by CAGRA is not 1 in some 100M scale datasets.

This problem can be alleviated by increasing the number of degrees in the search graph, but this would increase the size of the graph index. It is desirable to address this problem without increasing the number of degrees of the search graph.

Prior study has shown that this can be solved by using a Minimum Spanning Tree (MST)-like approach, but in general, MST calculation takes a long time. However, what is needed here is not an exact MST, but, for example, an approximate MST in which the total number of edges is not necessarily minimum. Such an approximate MST could be computed quickly on GPUs.

This PR contains implementation to create a approximate MST on the GPU at high speed based on the above policy and use it to guarantee the connectivity of the search graph.

This functionality is not always required, so it is considered an opt-in feature. A member variable named `guarantee_connectivity` is added to `index_params`, so set this variable to `true` if you wish to use this featgure.

> cuvs::neighbors::cagra::index_params index_params;
> index_params.guarantee_connectivity = true;
> auto index = cuvs::neighbors::cagra::build(res, index_params, dataset_view);

Authors:
  - Akira Naruse (https://github.com/anaruse)
  - Tamas Bela Feher (https://github.com/tfeher)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #237
Currently, in IVF index building (both IVF-Flat and IVF-PQ), large dataset is usually in pageable host memory or mmap-ed file. In both case, after the cluster centers are trained, the entire dataset needs to be copied twice to the GPU -- one for assigning vectors to clusters, the other for copying vectors to the corresponding clusters. Both copies are done using `batch_load_iterator` in a chunk-by-chunk fashion. Since the source buffer is in pageable memory, the current `batch_load_iterator` implementation doesn't support kernel and memcopy overlapping. This PR adds support on prefetching with `cudaMemcpyAsync` on pageable memory. We achieve kernel copy overlapping by launching kernel first following by the prefetching of the next chunk. 

We benchmarked the change on L40S. The results show 3%-21% speedup on index building, without impacting the search recall (about 1-2%, similar to run-to-run variance). 
algo | dataset | model | with prefetching (s) | without prefetching (s) | speedup
-- | -- | -- | -- | -- | --
IVF-PQ | deep-100M | d64b5n50K | 97.3547 | 100.36 | 1.03
IVF-PQ | wiki-all-10M | d64-nlist16K | 14.9763 | 18.1602 | 1.21
IVF-Flat | deep-100M | nlist50K | 78.8188 | 81.4461 | 1.03

This PR is related to the issue submitted to RAFT: rapidsai/raft#2106

Authors:
  - Rui Lan (https://github.com/abc99lr)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)

URL: #230
`libcuvs.so` contains fp16 kernels that are not accessible (missing headers and missing public entry points). This PR removes the unused kernel.

Authors:
  - Tamas Bela Feher (https://github.com/tfeher)

Approvers:
  - Ben Frederickson (https://github.com/benfred)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #268
Random sampling of training set for IVF methods was reverted in rapidsai/raft#2144 due to the large memory usage of the subsample method.

Since then, PR rapidsai/raft#2155 has implemented a new random sampling method with improved memory utilization.  Using that we can now enable random sampling of IVF methods (rapidsai/raft#2052 and rapidsai/raft#2077).

Random subsampling has measurable overhead for IVF-Flat, therefore it is only enabled for IVF-PQ.

Authors:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #122
@raydouglass raydouglass requested review from a team as code owners August 1, 2024 17:27
Copy link

copy-pr-bot bot commented Aug 1, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@raydouglass raydouglass merged commit 0be69fd into main Aug 8, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.