-
Notifications
You must be signed in to change notification settings - Fork 344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make loops comparable in go and c #150
base: main
Are you sure you want to change the base?
Conversation
The clang compiler performs an optimization called Scalar Replacement of Aggregates on the C code which essentially means that it rewrites the c code to use a local variable (which is stored in a register) just for iterating through the inner loop with 100,000 sums of `j%u`, and then assigns it back to the array of int32 values. The go complier doesn't currently perform the same optimization. But, it's trivial to rewrite the code manually to perform the same optimization. So, it's not exactly the same code, but using only the features of the go programming language, this code now performs as well as the equivalent c code on my mac.
On the amd64 architecture, this really matters. For some reason.
I'm a bit torn on whether or not I should accept these types of changes. A similar things was discussed here, but for dart: https://mrale.ph/blog/2024/11/27/microbenchmarks-are-experiments.html On the one hand, I get that there are small tweaks that we can make to these in order to coerce the compiler to perform better optimization. This is true not just in Go, but many languages. On the other hand, Its a bit more fair to write the code "the same" and leave it up to the compiler to figure it out. More sophisticated / mature compilers will have a higher likely hood of producing better optimizations, and is a part of the big picture of choosing language. Most people are not tweaking their code to coerce a compiler, but just write code to perform a task, and allow their compiler optimizes it as best it can. Thoughts? |
I think you should make two (or more) versions for each language, for each test and let people contribute by adding optimized code. This will point out compiler inefficiencies and would also be a good resource on learning how to optimize your language's code. |
Go's compiler weakness needs to be fixed in the Go's compiler, which, as was correctly noted, applies to other languages too :) |
The main goal of the Go compiler is to prioritize compile speed, which is why it does not implement more complex optimization techniques. This PR serves as a good example for Go developers, showing how they can improve performance in key parts of the code. While the Go compiler may not always generate the most optimized code, it doesn't mean that the Go runtime can't run as fast as the C version. |
The PR description is not valid. SROA has no relation to the optimization being done. SROA is, for example, promoting int[4] stack buffer to 4 individual registers. In this case, per its memory model, GCC and LLVM perform LICM - Loop Invariant Code Motion. It's a standard optimization done by many compilers - OpenJDK HotSpot and .NET also do it (albeit worse). Moreover, the gradual improvements in Go's compiler clearly demonstrate the intention to address its shortcomings and incorrectness of the argument being made, which is a common excuse heard from the Go community. At the end of the day, the underwhelming quality of the Go's compiler output is what the users will get when they write code "normally". The intention behind microbenchmarks like these is to help the developers better understand performance characteristics of programming languages and make better informed decisions (like using Go less). |
Sorry, I'm not a compiler expert, and I have removed the reference to SROA in my previous comment. This repository is focused on language runtime performance, not compiler performance. If we’re discussing compiler performance, then maintaining a consistent code style is important. Do you think C or Java compilers always perform optimally without requiring developers to make adjustments? |
This is far from the truth, at least for people who care about performance. There is endless books on this very topic for just about every mainstream language, as well as algorithms and programming patterns in general. Benchmarks based on syntactical rather than semantic equivalence is unfair. For example, the Scalar Replacement issue aside, C version also lacks the bounds check that memory safe languages like Go and C# will perform for each lookup. |
The compilers of Java, .NET and Rust are able to elide the bound checks here inside the loop. I forgot whether Go managed to do so - it had other issues with the way it compiled the loop by placing the return block inside of it resulting in a weird jump threading instead of having small loop body. The cost of bound checks is only paid when the compiler cannot statically prove that the array/vec/slice/span/list accesses are in-bounds, and if that fails it is possible to provide fast-path via loop cloning for certain scenarios. |
I don't know about graal (which seems to be the cream of the crop when it comes to JVMs) but HotSpot only does bound check elimination when code is JIT which doesn't mean always. For rust, it doesn't seem to eliminate the bound checks for the code benchmarked, as you can see here Again, the specifics doesn't matter so much that as a rule, benchmarks based on syntactical rather than semantic equivalence is unfair. And as correct as trying to translate human languages without understanding of figures of speech and metaphors, like translating "Hals- und Beinbruch" into "breaking neck and leg" which means something entirely different to "break a leg!" as it goes in English. |
Bound checks are being elided for Rust: https://godbolt.org/z/cj4GqzenK |
What is this then? LBB3_53:
adrp x2, l___unnamed_12@PAGE
add x2, x2, l___unnamed_12@PAGEOFF
mov w1, #10000
bl __ZN4core9panicking18panic_bounds_check17h4fa7ca60df78ab98E |
I think this one should be accepted. Cheating is when you solve the problem using FFI/C exts. This is not the case, it's all the native Golang code that well does account for the golang compiler specifics. At the same time, yes, it's not great that golang makes us think about this little quirks and it would have been much better to have these two versions of code perform equality fast. |
@omeid In the disasm, the following block corresponds to the nested loops "core" of the benchmark: LBB3_42:
mov w12, #0
mov w11, #0
ldr w13, [x21, x8, lsl #2]
LBB3_43:
add w14, w12, #1
udiv w15, w12, w19
msub w15, w15, w19, w12
udiv w16, w14, w19
msub w14, w16, w19, w14
add w13, w15, w13
add w11, w14, w11
add w12, w12, #2
cmp w12, w9
b.ne LBB3_43 ;; <-- continue with 100k inner loop
add w11, w11, w13
add w11, w11, w0
str w11, [x21, x8, lsl #2]
add x11, x8, #1
mov x8, x11
cmp x11, x10
b.ne LBB3_42 ;; <-- continue with 10k outer loop The bounds check failure block is for accessing the input args at the start of the program since the length of args cannot be statically proven. @roma-glushko in that case other languages should be allowed to do so as well - it will not be comparable to work that other entries have to do, and then it becomes which language submission is optimized to make the compiler happy the most. Such optimizations are really fragile, done in real world as a last resort only and tend to break when the compiler is updated. I can rent M3 Mac Mini to spend a couple of hours in Apple Instruments to precisely order the code to make sure that C# submission performs optimally on the specific hardware used by the author. I have done this with BenchmarksGame submission in the past and it was questionable investment of effort because .NET 9 changed the way the compiler handles the loops, not going to do that again. I think it's best to let the results speak for themselves, and if you see a performance issue, the language authors will appreciate if you submit an issue against the official repository instead so that the general user can benefit. |
@neon-sunset well, we should feel a difference between finding how to "inline" code in a smart way to make it work as fast as possible in the given setup and a well-know specific of how the language works. The first one is definitely a waste of time and gaming the benchmark I agree, but I would not say so for the latter because this knowledge would reply for any other golang application that would do similar frequent access to slice indices. If any other language has similar well-know specifics like Golang here, they should be applied too. Otherwise, what we are trying to benchmark here? General performance of languages without using any special optimizations like PGO/asm/FFI exts? Or maybe performance of the loop that looks as close as possible to language X? Or maybe how many optimisations the complier does out of the box if you know nothing about language? Or maybe something else?
No disagreement here, Go team should fix this at some point (but given that everyone knows about this issue for years, that may not going to happen for whatever reason 😌 ). |
In that case we should let people see that Go compiler indeed needs more work and allow them to make informed decisions (to use better options). If the language you chose requires you to constantly babysit the compiler, perhaps it's better to have the numbers reflect that? In either case, and it's something I wrote in at least 3 other places, this microbenchmark only measures the specific way the compilers emit integer division on ARM64 inside the loop and whether they can LICM the spill to memory (which is likely hidden by M3 anyway), it does not translate to x86_64 with the same compilers too. I only spend time here because it's good at allowing people to be prompted that a lot of widespread assumptions about performance are incorrect. (the Go one is particularly egregious and it's long past overdue to stop lying about its performance profile, this is why you see backlash on social media and increasing amount of people who are unreasonably negative towards the language, it's the natural outcome once you start looking into the details) |
@neon-sunset Fair, but that only happens if you opt into LBB3_53:
adrp x2, l___unnamed_12@PAGE
add x2, x2, l___unnamed_12@PAGEOFF
mov w1, #10000
bl __ZN4core9panicking18panic_bounds_check17h4fa7ca60df78ab98E
...... So this means that as a developer, you need to understand this aspect of the compiler tool chain, in the same way that you might need to understand the Go's limitations. But again, it just reinforces my point, syntactical equivalence is not an incorrect standard of harmonising semantics, which means fair benchmarks should be based on semantic equivalence not syntactical and arbitrary expectation of understanding of compiler. For example, in case of Rust, you consider understanding the need to pass specific optimisation levels and understanding its trade-offs reasonable, but for Go, you consider understanding that the compiler won't do Scaler Replacement out of ordinary. |
@omeid Not sure why you keep insisting on the bound checks argument. Even with smaller optimization level it still elides bound checks: For comparison, here's the .NET output with AOT (in this case the difference with final JIT output is going to be small): https://godbolt.org/z/1sGoMsTKd The goal of this microbenchmark is not to take the shortest path but to evaluate what is the cost of standard language constructs. |
Fair, I don't like the godbolt interface seems like my copy paste didn't work and was still looking at opt-level=0. |
The clang compiler performs an optimization called Scalar Replacement of Aggregates on the C code which essentially means that it rewrites the c code to use a local variable (which is stored in a register) just for iterating through the inner loop with 100,000 sums of
j%u
, and then assigns it back to the array of int32 values.The go complier doesn't currently perform the same optimization. But, it's trivial to rewrite the code manually to perform the same optimization. So, it's not exactly the same code, but using only the features of the go programming language, this code now performs as well as the equivalent c code on my mac.