-
Notifications
You must be signed in to change notification settings - Fork 344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Go loops #44
Improve Go loops #44
Conversation
I have to say, I think It's pretty irresponsible of @bddicken to have published this repo on Twitter almost 3 days ago and to continue to be exaggerating the differences between go and rust or c by not pulling in this PR and updating the findings. |
Very nice, thanks for contributing @gmr458. |
I'm sorry. I have to eat some crow here. |
@eljobe interesting, could you share more details? |
So, the go code slows down on my Mac M3 arm64
But, it does improve things quite a bit on my Linux VPS at Hetzner (amd64)
So, in summary, it actually does help a lot on the AMD architecture, but hurts a little on the ARM 64 architecture. One of the things I find quite puzzling is, why the assembly that the Go compiler is creating is not essentially the same as what gcc is outputting for the C code and the rust and zig code. It seems like there is clearly some optimization that the Go compiler is missing for these tight loops. |
Here's a sort of fascinating follow-on. Compilers are so interesting. I thought a bit more about the optimization that the C, Rust, and Zig compilers must be doing that the golang compiler isn't doing. (BTW, I have almost 0 experience on how compilers or compiler optimizations work, I'm just guessing.) It seems like the compiler is somehow analyzing the inner loop e.g.: for (int j = 0; j < 100000; j++) { // 100k inner loop iterations, per outer loop iteration
a[i] = a[i] + j%u; // Simple sum
} And somehow simplifying it knowing that there is a closed-form solution that increments So, I had my theory about what the compiler must have been optimizing away. So, I decided to try to give a strong hint to the for i := int32(0); i < 10000; i++ { // 10k outer loop iterations
k := int32(0)
for j := int32(0); j < 100000; j++ { // 100k inner loop iterations, per outer loop iteration
k = k + j%u // Simple sum
}
a[i] = a[i] + k // Add the sum to the array element
a[i] += r // Add a random value to each element in array
} And, viola! ❯ go build -ldflags "-s -w" -o code code.go
❯ for x in 1 2 3; do /usr/bin/time ./code 40; done
1952997
0.55 real 0.54 user 0.00 sys
1955906
0.55 real 0.55 user 0.00 sys
1955778
0.54 real 0.54 user 0.00 sys Those times are pretty much identical to what I get for rust and C on the same hardware (output from ../run.sh):
I feel like I should maybe open a feature request against the go compiler. |
Your changes on my laptop make the program 1 second faster (4 seconds total), same as Java on my laptop. You should open a PR, this change improves performance on both architectures. |
But, don't you think that it's sort of "cheating" because it's no longer similar enough to the equivalent code written in C or rust? |
@eljobe Your guess is right. I decompiled the C version of the binary and it looks like this:
The Clang compiler optimizes the inner loop using local variables and updates the array elements when the inner loop completes. I also decompiled the Go version of the binary, which does not have this optimization and updates the array elements on each iteration of the inner loop. |
I feel like this is relevant: golang/go#49785 So, one thing that puzzles me. It seems like gccgo doesn't work on Darwin any more. |
@d0ngw, I'm pretty new to this sort of low-level analysis of the outputs of compilers. Can you share with me what exactly you did to "decompile" the c binary and go binary? Or, maybe point me to some resources on how to do it myself? |
I went ahead and opened PR #150 anyway. It really isn't the fault of the language that the compiler doesn't have this optimization ... yet. |
@eljobe You can try this tool Hopper (https://www.hopperapp.com/) 😄 |
It may be worth mentioning that Go's memory model requires providing much stronger guarentees than the C memory model. It basically treats every variable as a weak atomic and this is something that users passively expect in the very large portion Go code which is concurrent. If the memory location being incremented is visible to other goroutines (which requires escape analysis for slices), then any other goroutine reading from it must only ever read values that have actually been written to that memory location. So the C compiler approach of "anything goes if a single-threaded program is left unchanged" is explicitly rejected. Multithreaded programs must not be subjected to any new data races that were introduced by the optimization |
@d0ngw That hopper tool looks nice, thanks for sharing. |
@bddicken You're welcome! 😄 |
I'd love if someone who cares about Go would port it to the new benchmark runner: |
I changed the type from
int
toint32
, on my laptop the time improved from 15 seconds to 5 seconds when running the binary with input 40.This is a bit fairer compared to other languages like Java or Kotlin, whose implementations in this repository use 32-bit integer types.
I also added the flag
-ldflags "-s -w"
to thego build
command incompile.sh
to remove the symbol and debug info, this reduces the size of the binary.