-
-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full Compile-Time Processor Feature Detection #160
Comments
I think if would be better to try and do this without feature flags. We can detect at compile time whether, for example, the avx2 target feature is enabled. |
I wrote this when I was tired (still am). |
I haven't examined the code here yet, but it seems just setting rustflags to |
https://godbolt.org/z/h7av5rx3n extern crate memchr;
use memchr::memchr;
#[no_mangle]
fn memchr_1(data: &[u8]) -> usize {
memchr(0, &data).unwrap()
} I believe these let fun = FN.load(Ordering::Relaxed); The commit you linked does add compile-time checks, but it still stores a function pointer. |
I see, seems the compiler doesn't automatically inline the call, sad |
so I have an issue that ties into this one, and solving this issue can help solve that. The |
Would need to dig a bit more into it, but it sounds interesting at the very least. |
Also @BurntSushi I love your work and just started sponsoring you through my company so I hope you read this :) |
I really do think this is an entirely separate issue. I addressed it as a Discussion question here: #164
Thank you! :-) |
@Sewer56 Since it seems like you are trying to micro-optimize here, I think there are two important questions to answer:
|
Ok that's just what i was looking for, thank you, and OP can also just do this directly as well |
I'm porting an open source archive format I wrote during work hours to Rust in my own time, while giving it a small upgrade: https://sewer56.dev/sewer56-archives-nx/Specification/Table-Of-Contents/#string-pool Reading names from the pool usually involves reading strings, usually average 40-ish characters, before hitting null terminator. I did recently write a benchmark around this; and while it is true that in my use case the actual memchr operation takes very little time compared to decompression of the pool itself; I nonetheless strive to make the code as good as possible. In my specific case, I have inlining to gain from this; as it would be the only use case of How much the improvement actually amounts to, I'm not concerned. I'd guess something in the range of 0.1-0.5% and a hundred or two bytes of assembly code less. To me any improvement is an improvement.
I could use those directly, but it's more about improving the whole ecosystem. Why just use a workaround for myself when I could instead eventually contribute and make an improvement for everyone? It'd also avoid a death by 1000 cuts situations. Code depends on other code; if you make the lowest level code more efficient, you make the system as a whole more efficient. That means there's a chance for others' stuff to run faster too, and everyone, myself included gets to benefit in tandem. |
So, I can see two improvements here:
|
it looks like |
ok just did it in the code, all tests are passing, now time to figure out how to run the benchmarks |
(Bear in mind we're still talking about a separate issue here) My idea when I originally opened the issue was making I may be able to pick this up not long from now, but the main question really is how to do it in a way that doesn't make it harder to read. Essentially you want a block like this: memchr/src/arch/x86_64/memchr.rs Lines 114 to 137 in a26dd5b
At the beginning of For example, if someone is building for a non-AVX2 processor for example, and they do have memchr/src/arch/x86_64/avx2/memchr.rs Lines 86 to 109 in a26dd5b
even if someone is explicitly building for a target without As an extra note. I'm also wondering if there's a way to just make it better out of the box if the user sets a specific target while running And if you can do that, you can then call whichever implementation you need directly; skipping the whole load/store mechanism. |
Bear in mind the last paragraph in previous post is an extra question/discussion point of sorts that just came to mind, as it's semi-relevant. The actual issue I opened is that there isn't an automatic way to call the required implementation directly; even if you build for a specific arch by using This matters if you, for example want to use an iterator with I think an extra block of code for |
I am overall very skeptical of doing improvements just for improvement sake, especially if they cost something. For this in particular, I'm willing to suffer a little implementation complexity internally to make this work. So if y'all can come up with something where if the relevant target features are enabled at compile time and inlining is possible, and it isn't too crazy, I might be open to that. But based on what you've said, I don't see a good justification for any new APIs. Otherwise, I think you're on the right track in terms of how to do this internally and automatically.
But that's why those lower level APIs are there! They are there precisely for cases like this where the higher level implementation makes choices that are undesirable for more specific or niche use cases. The bottom line is that inlining a huge AVX2 substring search routine (it's a lot of code) is usually not the right thing to do and usually not going to improve overall performance. Even you admit that, for your use case, it won't really help move the needle on overall perf in any meaningful way. That's what I care about.
It's not. Or at least, I've seen it go either way enough times that I don't think you can really make the claim that dynamic dispatch is or isn't generally slower than using an enum. In this specific case, I did absolutely use an enum at first. That's the obvious choice. But I found that using dynamic dispatch lead to tighter codegen and overall improved performance. Since I've re-litigated this experiment in I'm always happy to have the choice re-litigated and would switch given evidence that a simple enum is better. But it's not like I just didn't try it. It's the obvious choice here because it doesn't require any |
I am irrelevant. I don't matter.
No API requests here were made. I'm pretty sure I can make this work with a small change to
You are correct, however even if the code is not inlined, it would still be an improvement.
Because no load is required from a static. So we can use Whether the code will actually be inlined can be left to the compiler. Compiler will decide based on metrics such as amount of calls to the final target. Most big programs, it may not be if it's used in a lot of places, but if an application or library only uses |
Yeah sorry, I guess I was responding to this comment. I know I directed it at you, but that comment wasn't by you, so apologies.
Again, this isn't really how I operate. I try hard not to do things for the sake of doing them. I would much rather concrete, real world and specific motivations for improvements or changes.
I generally don't base improvements on instruction count reductions. I think we have different philosophies here and I think it's probably not productive to keep going back-and-forth on that. It seems like we are both aligned on being open to a small internal change that automatically avoids the indirect function call when it can. |
cargo.toml
Above command, with patch and
Massive, microsecond improvement; especially since memchr here is around half the load. Example change: I put it as a PR for easy reading. I want to show in that PR what I've had in mind. For the bench, I took the first 4000 file paths from I then used the |
That's interesting. Like @Sewer56 mentioned, function pointer derefs cause pipeline stalls, and well so does branching with a It's possible neither the function pointer nor the enum + Let me know about how to run |
Trust me. I'm aware of all of that. :-) And yes, I even tried using likely intrInsics too. I am always happy to have folks relitigate perf though. Benchmarks are done with rebar. Not |
Well you have clearly done this a lot longer than I, haha. And thanks now I can actually run benchmarks |
Currently
memchr
does run-time feature detection for x86, storing 'preferred' functions inside a field that is later referenced in subsequent callsThis manifests itself as a
mov
andcall
in x86; and one or two other misc instructions along the way.In order to avoid unnecessary code bloat and improve performance, it should be possible to bypass this mechanism if we know what is available at build time.
This would allow for inlining, and slightly lesser call overhead.
That is of course, when the Rust compiler is invoked with the correct flags to ship a binary targeting a specific processor.
Note
This isn't a formal feature request, it's more so a reminder for myself.
I want to reference this issue in my own codebase, and possibly PR this sometime in the future.
The text was updated successfully, but these errors were encountered: