Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize euclidian distance in raft refine phase #2574

Closed
wants to merge 2 commits into from

Conversation

anstellaire
Copy link

Initial issue

Originally written code (below) generated serial assembly and used strictly-ordered fadda instructions (at least on gcc & clang). That resulted in suboptimal performance.

for (size_t k = 0; k < dim; k++) {
  distance += DC::template eval<DistanceT>(query[k], row[k]);
}

Proposed solution

This PR provides optimized generic reduction using used partial vector sum (below), that helps vectorization but looses strcictly-ordered limitation.

template <typename DC, typename DistanceT, typename DataT>
DistanceT euclidean_distance_squared_generic(DataT const* a, DataT const* b, size_t n) {
  // vector register capacity in elements
  size_t constexpr vreg_len = (128 / 8) / sizeof(DistanceT);
  // unroll factor = vector register capacity * number of ports;
  size_t constexpr unroll_factor = vreg_len * 4;

  // unroll factor is a power of two
  size_t n_rounded = n & (0xFFFFFFFF ^ (unroll_factor - 1));
  DistanceT distance[unroll_factor] = {0};

  for (size_t i = 0; i < n_rounded; i += unroll_factor) {
    for (size_t j = 0; j < unroll_factor; ++j) {
      distance[j] += DC::template eval<DistanceT>(a[i + j], b[i + j]);
    }
  }

  for (size_t i = n_rounded; i < n; ++i) {
    distance[i] += DC::template eval<DistanceT>(a[i], b[i]);
  }

  for (size_t i = 1; i < unroll_factor; ++i) {
    distance[0] += distance[i];
  }

  return distance[0];
}

Also it adds NEON implementation with intrinsics which provided further speedup on certain test cases (can be removed if arch-specific code is undesired).

Results

image

@anstellaire anstellaire requested a review from a team as a code owner February 6, 2025 12:17
Copy link

copy-pr-bot bot commented Feb 6, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the cpp label Feb 6, 2025
@anstellaire anstellaire changed the base branch from branch-25.02 to branch-25.04 February 6, 2025 12:21
@anstellaire anstellaire marked this pull request as draft February 6, 2025 15:32
@wphicks wphicks added enhancement New feature or request non-breaking Non-breaking change labels Feb 7, 2025
@cjnolet
Copy link
Member

cjnolet commented Feb 12, 2025

@anstellaire thank you for opening up a PR to improve the refinment APIs, but these have all been deprecated in RAFT and any further updates should be getting made in cuVS. Are you able to move your changes to cuVS library instead?

@anstellaire
Copy link
Author

@cjnolet sure, will do.

@anstellaire
Copy link
Author

The PR was opened in cuVS repository instead: rapidsai/cuvs#689.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp enhancement New feature or request non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants