Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Go loops #44

Merged
merged 2 commits into from
Nov 29, 2024
Merged

Improve Go loops #44

merged 2 commits into from
Nov 29, 2024

Conversation

gmr458
Copy link
Contributor

@gmr458 gmr458 commented Nov 26, 2024

I changed the type from int to int32, on my laptop the time improved from 15 seconds to 5 seconds when running the binary with input 40.

This is a bit fairer compared to other languages like Java or Kotlin, whose implementations in this repository use 32-bit integer types.

I also added the flag -ldflags "-s -w" to the go build command in compile.sh to remove the symbol and debug info, this reduces the size of the binary.

@eljobe
Copy link

eljobe commented Nov 28, 2024

I have to say, I think It's pretty irresponsible of @bddicken to have published this repo on Twitter almost 3 days ago and to continue to be exaggerating the differences between go and rust or c by not pulling in this PR and updating the findings.

@bddicken bddicken merged commit 411e2b1 into bddicken:main Nov 29, 2024
@bddicken
Copy link
Owner

Very nice, thanks for contributing @gmr458.

@eljobe
Copy link

eljobe commented Nov 29, 2024

I'm sorry. I have to eat some crow here.
This was, evidently, a bad change. On both linux and os x, it actually slows down the performance of the code.
I'd recommend reverting this PR.

@gmr458
Copy link
Contributor Author

gmr458 commented Nov 29, 2024

@eljobe interesting, could you share more details?

@eljobe
Copy link

eljobe commented Nov 29, 2024

So, the go code slows down on my Mac M3 arm64
I read a little bit about it, and it seems like the arm64 is really optimized for dealing with 64-bit values, and forcing it to treat the values as int32 is incurring some overhead.

❯ uname -a
Darwin pepchain.local 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:00:32 PDT 2024; root:xnu-11215.41.3~2/RELEASE_ARM64_T6030 arm64
❯ git log -n 1
commit d0ae86bc824d088d15c5a1ba605dc14eb0c48c03 (HEAD -> before)
Merge: 43fbc11 c2ae46d
Author: Benjamin <[email protected]>
Date:   Thu Nov 28 18:34:08 2024 -0700

    Merge pull request #46 from botantony/lua

    Add Lua
❯ go build -o code code.go
❯ for x in 1 2 3; do /usr/bin/time ./code 40; done
1953347
        1.35 real         1.09 user         0.00 sys
1957761
        1.10 real         1.09 user         0.00 sys
1958342
        1.10 real         1.09 user         0.00 sys
❯ git co main
Switched to branch 'main'
Your branch is up to date with 'origin/main'.
❯ git log -n 1
commit 474b0fda3945b0899e7f78babb0d145cd8ab198e (HEAD -> main, origin/main, origin/HEAD)
Merge: 359833e 74d0b2b
Author: Benjamin <[email protected]>
Date:   Thu Nov 28 20:44:35 2024 -0700

    Merge pull request #93 from jpablo/dev

    Add Scala
❯ go build -ldflags "-s -w" -o code code.go
❯ for x in 1 2 3; do /usr/bin/time ./code 40; done
1956691
        2.13 real         1.89 user         0.01 sys
1954808
        1.89 real         1.88 user         0.00 sys
1958831
        1.89 real         1.88 user         0.00 sys

But, it does improve things quite a bit on my Linux VPS at Hetzner (amd64)

❯ uname -a
Linux pfeffer.technodabble.com 5.15.0-83-generic #92-Ubuntu SMP Mon Aug 14 09:30:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
❯ git log -n 1
commit d0ae86bc824d088d15c5a1ba605dc14eb0c48c03 (HEAD -> before)
Merge: 43fbc11 c2ae46d
Author: Benjamin <[email protected]>
Date:   Thu Nov 28 18:34:08 2024 -0700

    Merge pull request #46 from botantony/lua

    Add Lua
❯ go build -o code code.go
❯ for x in 1 2 3; do time ./code 40; done
1952785
./code 40  7.82s user 0.00s system 100% cpu 7.815 total
1953816
./code 40  7.81s user 0.00s system 100% cpu 7.797 total
1959204
./code 40  7.80s user 0.00s system 100% cpu 7.795 total
❯ git co main
Switched to branch 'main'
Your branch is up to date with 'origin/main'.
❯ git log -n 1
commit 474b0fda3945b0899e7f78babb0d145cd8ab198e (HEAD -> main, origin/main, origin/HEAD)
Merge: 359833e 74d0b2b
Author: Benjamin <[email protected]>
Date:   Thu Nov 28 20:44:35 2024 -0700

    Merge pull request #93 from jpablo/dev

    Add Scala
❯ go build -ldflags "-s -w" -o code code.go
❯ for x in 1 2 3; do time ./code 40; done
1951154
./code 40  2.66s user 0.00s system 100% cpu 2.659 total
1952079
./code 40  2.65s user 0.00s system 100% cpu 2.645 total
1956297
./code 40  2.65s user 0.00s system 100% cpu 2.643 total

So, in summary, it actually does help a lot on the AMD architecture, but hurts a little on the ARM 64 architecture. One of the things I find quite puzzling is, why the assembly that the Go compiler is creating is not essentially the same as what gcc is outputting for the C code and the rust and zig code. It seems like there is clearly some optimization that the Go compiler is missing for these tight loops.

@eljobe
Copy link

eljobe commented Nov 30, 2024

Here's a sort of fascinating follow-on. Compilers are so interesting.

I thought a bit more about the optimization that the C, Rust, and Zig compilers must be doing that the golang compiler isn't doing. (BTW, I have almost 0 experience on how compilers or compiler optimizations work, I'm just guessing.)

It seems like the compiler is somehow analyzing the inner loop e.g.:

    for (int j = 0; j < 100000; j++) { // 100k inner loop iterations, per outer loop iteration
      a[i] = a[i] + j%u;               // Simple sum
    }

And somehow simplifying it knowing that there is a closed-form solution that increments a[i] by some value k which is a pure function of the value of u (given that j will always iterate from 0 to 99999). And given that u doesn't change for a particular run of the program, it can either run the loop once and cache k for use next time around, or maybe it replaces it with some assembly instructions which calculate k in constant time instead of having to perform the iteration.

So, I had my theory about what the compiler must have been optimizing away. So, I decided to try to give a strong hint to the go compiler by allocating a local variable k and accumulating the j%u values into k and then incrementing a[i] by k after the inner loop was finished. My guess was that (maybe) the go compiler didn't think it was safe to collapse the loop because it couldn't be sure if something might be concurrently modifying the a array since it was defined outside the scope in which the loop was being run, and that maybe if I gave it a more-locally defined variable to update, it might be more obvious that short-circuiting the loop was safe.

	for i := int32(0); i < 10000; i++ { // 10k outer loop iterations
		k := int32(0)
		for j := int32(0); j < 100000; j++ { // 100k inner loop iterations, per outer loop iteration
			k = k + j%u // Simple sum
		}
		a[i] = a[i] + k // Add the sum to the array element
		a[i] += r       // Add a random value to each element in array
	}

And, viola!

❯ go build -ldflags "-s -w" -o code code.go
❯ for x in 1 2 3; do /usr/bin/time ./code 40; done
1952997
        0.55 real         0.54 user         0.00 sys
1955906
        0.55 real         0.55 user         0.00 sys
1955778
        0.54 real         0.54 user         0.00 sys

Those times are pretty much identical to what I get for rust and C on the same hardware (output from ../run.sh):

C = 0.56s
C = 0.55s
C = 0.55s

Go = 0.56s
Go = 0.55s
Go = 0.55s

Rust = 0.54s
Rust = 0.56s
Rust = 0.55s

Zig = 0.80s
Zig = 0.56s
Zig = 0.55s

I feel like I should maybe open a feature request against the go compiler.

@gmr458
Copy link
Contributor Author

gmr458 commented Nov 30, 2024

Your changes on my laptop make the program 1 second faster (4 seconds total), same as Java on my laptop. You should open a PR, this change improves performance on both architectures.

@eljobe
Copy link

eljobe commented Nov 30, 2024

But, don't you think that it's sort of "cheating" because it's no longer similar enough to the equivalent code written in C or rust?

@gmr458
Copy link
Contributor Author

gmr458 commented Nov 30, 2024

@eljobe Yes, you are right.

@bddicken Hello, we think you should revert this PR, it improves performance on AMD64, but on Apple Silicon the performance is worse.

@d0ngw
Copy link

d0ngw commented Nov 30, 2024

@eljobe Your guess is right.

I decompiled the C version of the binary and it looks like this:

do {
            r11 = 0x0;
            r10 = 0x0;
            r12 = *(int32_t *)(r22 + r8 * 0x4);
            do {
                    r12 = (r11 - r11 / r19 * r19) + r12;
                    r10 = (r11 + 0x1 - (r11 + 0x1) / r19 * r19) + r10;
                    r11 = r11 + 0x2;
            } while (r11 != 0x186a0);
            *(int32_t *)(r22 + r8 * 0x4) = r20 + r10 + r12;
            r8 = r8 + 0x1;
    } while (r8 != 0x2710);

The Clang compiler optimizes the inner loop using local variables and updates the array elements when the inner loop completes.

I also decompiled the Go version of the binary, which does not have this optimization and updates the array elements on each iteration of the inner loop.

@eljobe
Copy link

eljobe commented Nov 30, 2024

I feel like this is relevant: golang/go#49785

So, one thing that puzzles me. It seems like gccgo doesn't work on Darwin any more.
So, is the only hope for addressing these performance differences actually convincing the go team to bring clang's optimizations to the go/cmd/compile (a.k.a gc) compiler?

@eljobe
Copy link

eljobe commented Nov 30, 2024

@d0ngw, I'm pretty new to this sort of low-level analysis of the outputs of compilers. Can you share with me what exactly you did to "decompile" the c binary and go binary? Or, maybe point me to some resources on how to do it myself?

@eljobe
Copy link

eljobe commented Nov 30, 2024

I went ahead and opened PR #150 anyway.

It really isn't the fault of the language that the compiler doesn't have this optimization ... yet.
It can be trivially achieved with a little bit of better-structured code, and the performance benefit is there.

@d0ngw
Copy link

d0ngw commented Dec 1, 2024

@eljobe You can try this tool Hopper (https://www.hopperapp.com/) 😄

@saolof
Copy link

saolof commented Dec 1, 2024

It may be worth mentioning that Go's memory model requires providing much stronger guarentees than the C memory model. It basically treats every variable as a weak atomic and this is something that users passively expect in the very large portion Go code which is concurrent.

If the memory location being incremented is visible to other goroutines (which requires escape analysis for slices), then any other goroutine reading from it must only ever read values that have actually been written to that memory location. So the C compiler approach of "anything goes if a single-threaded program is left unchanged" is explicitly rejected. Multithreaded programs must not be subjected to any new data races that were introduced by the optimization

@bddicken
Copy link
Owner

bddicken commented Dec 1, 2024

@d0ngw That hopper tool looks nice, thanks for sharing.

@d0ngw
Copy link

d0ngw commented Dec 2, 2024

@bddicken You're welcome! 😄

@PEZ
Copy link
Collaborator

PEZ commented Feb 8, 2025

I'd love if someone who cares about Go would port it to the new benchmark runner:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants