Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] C++ testing failing on arm64: NEIGHBORS_ANN_NN_DESCENT_TEST #2450

Open
jameslamb opened this issue Sep 24, 2024 · 0 comments · May be fixed by rapidsai/cuvs#424
Open

[BUG] C++ testing failing on arm64: NEIGHBORS_ANN_NN_DESCENT_TEST #2450

jameslamb opened this issue Sep 24, 2024 · 0 comments · May be fixed by rapidsai/cuvs#424
Labels
bug Something isn't working

Comments

@jameslamb
Copy link
Member

jameslamb commented Sep 24, 2024

Describe the bug

Over at least the last day I've seen the NEIGHBORS_ANN_NN_DESCENT_TEST failing reproducibly and consistently in this CI job:

conda-cpp-tests / tests (arm64, 3.11, 12.0.1, ubuntu20.04, a100, latest, latest)

Like this:

/opt/conda/conda-bld/work/cpp/test/neighbors/ann_nn_descent/../ann_nn_descent.cuh:274: Failure
Value of: eval_neighbours(indices_naive, indices_NNDescent, distances_naive, distances_NNDescent, ps.n_rows, ps.graph_degree, 0.01, min_recall, true, static_cast<size_t>(ps.graph_degree * 0.1))
  Actual: false (Duplicated index 1780 at k 30 for query 194! )
Expected: true
[  FAILED  ] AnnNNDescentBatchTest/AnnNNDescentBatchTestF_U32.AnnNNDescentBatch/5, where GetParam() = dataset shape=4000x512, graph_degree=32, metric=0, host, clusters=3
 (477 ms)
...
[----------] Global test environment tear-down
[==========] 344 tests from 4 test suites ran. (153504 ms total)
[  PASSED  ] 343 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] AnnNNDescentBatchTest/AnnNNDescentBatchTestF_U32.AnnNNDescentBatch/5, where GetParam() = dataset shape=4000x512, graph_degree=32, metric=0, host, clusters=3

 1 FAILED TEST
CMake Error at run_gpu_test.cmake:34 (execute_process):
  execute_process failed command indexes:

    1: "Child return code: 1"

96% tests passed, 1 tests failed out of 24

Total Test time (real) = 4868.12 sec

The following tests FAILED:
	 22 - NEIGHBORS_ANN_NN_DESCENT_TEST (Failed)

All C++ tests appear to pass in other conda-cpp-tests jobs (which are all x86_64).

At https://github.com/rapidsai/raft/actions/workflows/pr.yaml, it looks like the most recent fully-passing run of the pr workflow that included the conda-cpp-tests jobs was 19 hours ago (build link).

I have not seen this be resolved by manual re-runs, so I don't think it's a flaky test. I think something has changed and that CI will be blocked until it's fixed.

Steps/Code to reproduce bug

Builds where I've seen that fail:

Most recent successful run:

Expected behavior

N/A

Environment details (please complete the following information):

N/A

Additional context

We did very recently update the version of fmt / spdlog across RAPIDS (#2433), but I don't have any evidence suggesting that that's the root cause.

@jameslamb jameslamb added the bug Something isn't working label Sep 24, 2024
rapids-bot bot pushed a commit that referenced this issue Sep 26, 2024
Linked issue #2450

Authors:
  - Divye Gala (https://github.com/divyegala)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #2453
rapids-bot bot pushed a commit to rapidsai/cuvs that referenced this issue Nov 8, 2024
This PR is an amalgamation of the diff of 3 PRs in RAFT:

1. rapidsai/raft#2345
2. rapidsai/raft#2380
3. rapidsai/raft#2403

This PR also addresses part 1 and 2 of #419, closes #391 and makes CAGRA use the compiled headers of NN Descent, which seemed to have been a pending TODO https://github.com/rapidsai/cuvs/blob/009bb8de03ce9708d4d797166187250f77a59a36/cpp/src/neighbors/detail/cagra/cagra_build.cuh#L36-L37

Also, batch tests are disabled in this PR due to issue rapidsai/raft#2450. PR #424 will attempt to re-enable them.

Authors:
  - Divye Gala (https://github.com/divyegala)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #421
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant