Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native Rotation Intrinsic Optimization #37

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

NCGThompson
Copy link
Contributor

I removed some redundancies in rol3. On generic x86_64, the rotate left short benchmark got a 44% time reduction to 4.25ns and rotate left long got 34% improvement to 4.6ns. On the same computer with target-cpu=skylake, both benchmark times got approximately halved to 3.45ns. Hopefully native rol3 is called more than 2 * 10^13 times in the future so my time spent was worth while.

I marked it draft because two things need to happen:

  • ror3 needs to be optimized as well. I set to just negate b and call rol3. This is still an improvement over the status quo, but on generic x86_64, rotate right gets considerably worse benchmark scores than rotate rotate left. My only hypothesis is that the negation tips it over the edge of an inlining threshold. I'll give it its own mirror of rol3.
  • The new function needs to be benchmarked on other architectures. This is particularily important because I explicitly wrote the new version with 64-bit operations in mind, so it might be a pessimization on 32 bit architectures. We may need to either use 32-bit integers instead, or maybe go the opposite direction to 128-bit. There is also some endianness-specific code. It is pretty simple, but it should be checked for correctness to be safe. I only have access to x86_64 and 64-bit ARM so I need someone else to test this for me.

@NCGThompson
Copy link
Contributor Author

On an only somewhat related note. It's worth considering totally getting rid of the llvm intrinsics for shl and shr. The native implementations use branching rather than cmov. Even after the $CARGO_ENCODED_RUSTFLAGS patch, I'm still getting worse benchmarks for many llvm intrinsics than native for many other functions so I don't want to jump to conclusions.

@NCGThompson
Copy link
Contributor Author

My last commit made it slower in some circumstances but faster in others. However, I think it its overall a bit better than the 64-bit rotation version.

@NCGThompson NCGThompson marked this pull request as ready for review December 10, 2023 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant