Improve Go loops #44

gmr458 · 2024-11-26T06:10:27Z

I changed the type from int to int32, on my laptop the time improved from 15 seconds to 5 seconds when running the binary with input 40.

This is a bit fairer compared to other languages like Java or Kotlin, whose implementations in this repository use 32-bit integer types.

I also added the flag -ldflags "-s -w" to the go build command in compile.sh to remove the symbol and debug info, this reduces the size of the binary.

eljobe · 2024-11-28T14:32:51Z

I have to say, I think It's pretty irresponsible of @bddicken to have published this repo on Twitter almost 3 days ago and to continue to be exaggerating the differences between go and rust or c by not pulling in this PR and updating the findings.

bddicken · 2024-11-29T01:39:35Z

Very nice, thanks for contributing @gmr458.

eljobe · 2024-11-29T08:08:15Z

I'm sorry. I have to eat some crow here.
This was, evidently, a bad change. On both linux and os x, it actually slows down the performance of the code.
I'd recommend reverting this PR.

gmr458 · 2024-11-29T19:10:48Z

@eljobe interesting, could you share more details?

eljobe · 2024-11-29T19:52:00Z

So, the go code slows down on my Mac M3 arm64
I read a little bit about it, and it seems like the arm64 is really optimized for dealing with 64-bit values, and forcing it to treat the values as int32 is incurring some overhead.

❯ uname -a
Darwin pepchain.local 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:00:32 PDT 2024; root:xnu-11215.41.3~2/RELEASE_ARM64_T6030 arm64
❯ git log -n 1
commit d0ae86bc824d088d15c5a1ba605dc14eb0c48c03 (HEAD -> before)
Merge: 43fbc11 c2ae46d
Author: Benjamin <[email protected]>
Date:   Thu Nov 28 18:34:08 2024 -0700

    Merge pull request #46 from botantony/lua

    Add Lua
❯ go build -o code code.go
❯ for x in 1 2 3; do /usr/bin/time ./code 40; done
1953347
        1.35 real         1.09 user         0.00 sys
1957761
        1.10 real         1.09 user         0.00 sys
1958342
        1.10 real         1.09 user         0.00 sys
❯ git co main
Switched to branch 'main'
Your branch is up to date with 'origin/main'.
❯ git log -n 1
commit 474b0fda3945b0899e7f78babb0d145cd8ab198e (HEAD -> main, origin/main, origin/HEAD)
Merge: 359833e 74d0b2b
Author: Benjamin <[email protected]>
Date:   Thu Nov 28 20:44:35 2024 -0700

    Merge pull request #93 from jpablo/dev

    Add Scala
❯ go build -ldflags "-s -w" -o code code.go
❯ for x in 1 2 3; do /usr/bin/time ./code 40; done
1956691
        2.13 real         1.89 user         0.01 sys
1954808
        1.89 real         1.88 user         0.00 sys
1958831
        1.89 real         1.88 user         0.00 sys

But, it does improve things quite a bit on my Linux VPS at Hetzner (amd64)

❯ uname -a
Linux pfeffer.technodabble.com 5.15.0-83-generic #92-Ubuntu SMP Mon Aug 14 09:30:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
❯ git log -n 1
commit d0ae86bc824d088d15c5a1ba605dc14eb0c48c03 (HEAD -> before)
Merge: 43fbc11 c2ae46d
Author: Benjamin <[email protected]>
Date:   Thu Nov 28 18:34:08 2024 -0700

    Merge pull request #46 from botantony/lua

    Add Lua
❯ go build -o code code.go
❯ for x in 1 2 3; do time ./code 40; done
1952785
./code 40  7.82s user 0.00s system 100% cpu 7.815 total
1953816
./code 40  7.81s user 0.00s system 100% cpu 7.797 total
1959204
./code 40  7.80s user 0.00s system 100% cpu 7.795 total
❯ git co main
Switched to branch 'main'
Your branch is up to date with 'origin/main'.
❯ git log -n 1
commit 474b0fda3945b0899e7f78babb0d145cd8ab198e (HEAD -> main, origin/main, origin/HEAD)
Merge: 359833e 74d0b2b
Author: Benjamin <[email protected]>
Date:   Thu Nov 28 20:44:35 2024 -0700

    Merge pull request #93 from jpablo/dev

    Add Scala
❯ go build -ldflags "-s -w" -o code code.go
❯ for x in 1 2 3; do time ./code 40; done
1951154
./code 40  2.66s user 0.00s system 100% cpu 2.659 total
1952079
./code 40  2.65s user 0.00s system 100% cpu 2.645 total
1956297
./code 40  2.65s user 0.00s system 100% cpu 2.643 total

So, in summary, it actually does help a lot on the AMD architecture, but hurts a little on the ARM 64 architecture. One of the things I find quite puzzling is, why the assembly that the Go compiler is creating is not essentially the same as what gcc is outputting for the C code and the rust and zig code. It seems like there is clearly some optimization that the Go compiler is missing for these tight loops.

eljobe · 2024-11-30T12:08:40Z

Here's a sort of fascinating follow-on. Compilers are so interesting.

I thought a bit more about the optimization that the C, Rust, and Zig compilers must be doing that the golang compiler isn't doing. (BTW, I have almost 0 experience on how compilers or compiler optimizations work, I'm just guessing.)

It seems like the compiler is somehow analyzing the inner loop e.g.:

    for (int j = 0; j < 100000; j++) { // 100k inner loop iterations, per outer loop iteration
      a[i] = a[i] + j%u;               // Simple sum
    }

And somehow simplifying it knowing that there is a closed-form solution that increments a[i] by some value k which is a pure function of the value of u (given that j will always iterate from 0 to 99999). And given that u doesn't change for a particular run of the program, it can either run the loop once and cache k for use next time around, or maybe it replaces it with some assembly instructions which calculate k in constant time instead of having to perform the iteration.

So, I had my theory about what the compiler must have been optimizing away. So, I decided to try to give a strong hint to the go compiler by allocating a local variable k and accumulating the j%u values into k and then incrementing a[i] by k after the inner loop was finished. My guess was that (maybe) the go compiler didn't think it was safe to collapse the loop because it couldn't be sure if something might be concurrently modifying the a array since it was defined outside the scope in which the loop was being run, and that maybe if I gave it a more-locally defined variable to update, it might be more obvious that short-circuiting the loop was safe.

	for i := int32(0); i < 10000; i++ { // 10k outer loop iterations
		k := int32(0)
		for j := int32(0); j < 100000; j++ { // 100k inner loop iterations, per outer loop iteration
			k = k + j%u // Simple sum
		}
		a[i] = a[i] + k // Add the sum to the array element
		a[i] += r       // Add a random value to each element in array
	}

And, viola!

❯ go build -ldflags "-s -w" -o code code.go
❯ for x in 1 2 3; do /usr/bin/time ./code 40; done
1952997
        0.55 real         0.54 user         0.00 sys
1955906
        0.55 real         0.55 user         0.00 sys
1955778
        0.54 real         0.54 user         0.00 sys

Those times are pretty much identical to what I get for rust and C on the same hardware (output from ../run.sh):

C = 0.56s
C = 0.55s
C = 0.55s

Go = 0.56s
Go = 0.55s
Go = 0.55s

Rust = 0.54s
Rust = 0.56s
Rust = 0.55s

Zig = 0.80s
Zig = 0.56s
Zig = 0.55s

I feel like I should maybe open a feature request against the go compiler.

gmr458 · 2024-11-30T13:48:15Z

Your changes on my laptop make the program 1 second faster (4 seconds total), same as Java on my laptop. You should open a PR, this change improves performance on both architectures.

eljobe · 2024-11-30T14:07:20Z

But, don't you think that it's sort of "cheating" because it's no longer similar enough to the equivalent code written in C or rust?

gmr458 · 2024-11-30T14:17:44Z

@eljobe Yes, you are right.

@bddicken Hello, we think you should revert this PR, it improves performance on AMD64, but on Apple Silicon the performance is worse.

d0ngw · 2024-11-30T15:43:42Z

@eljobe Your guess is right.

I decompiled the C version of the binary and it looks like this:

do {
            r11 = 0x0;
            r10 = 0x0;
            r12 = *(int32_t *)(r22 + r8 * 0x4);
            do {
                    r12 = (r11 - r11 / r19 * r19) + r12;
                    r10 = (r11 + 0x1 - (r11 + 0x1) / r19 * r19) + r10;
                    r11 = r11 + 0x2;
            } while (r11 != 0x186a0);
            *(int32_t *)(r22 + r8 * 0x4) = r20 + r10 + r12;
            r8 = r8 + 0x1;
    } while (r8 != 0x2710);

The Clang compiler optimizes the inner loop using local variables and updates the array elements when the inner loop completes.

I also decompiled the Go version of the binary, which does not have this optimization and updates the array elements on each iteration of the inner loop.

eljobe · 2024-11-30T16:09:50Z

I feel like this is relevant: golang/go#49785

So, one thing that puzzles me. It seems like gccgo doesn't work on Darwin any more.
So, is the only hope for addressing these performance differences actually convincing the go team to bring clang's optimizations to the go/cmd/compile (a.k.a gc) compiler?

eljobe · 2024-11-30T23:31:23Z

@d0ngw, I'm pretty new to this sort of low-level analysis of the outputs of compilers. Can you share with me what exactly you did to "decompile" the c binary and go binary? Or, maybe point me to some resources on how to do it myself?

eljobe · 2024-11-30T23:54:39Z

I went ahead and opened PR #150 anyway.

It really isn't the fault of the language that the compiler doesn't have this optimization ... yet.
It can be trivially achieved with a little bit of better-structured code, and the performance benefit is there.

d0ngw · 2024-12-01T00:24:33Z

@eljobe You can try this tool Hopper (https://www.hopperapp.com/) 😄

saolof · 2024-12-01T04:42:36Z

It may be worth mentioning that Go's memory model requires providing much stronger guarentees than the C memory model. It basically treats every variable as a weak atomic and this is something that users passively expect in the very large portion Go code which is concurrent.

If the memory location being incremented is visible to other goroutines (which requires escape analysis for slices), then any other goroutine reading from it must only ever read values that have actually been written to that memory location. So the C compiler approach of "anything goes if a single-threaded program is left unchanged" is explicitly rejected. Multithreaded programs must not be subjected to any new data races that were introduced by the optimization

bddicken · 2024-12-01T22:08:43Z

@d0ngw That hopper tool looks nice, thanks for sharing.

d0ngw · 2024-12-02T05:59:20Z

@bddicken You're welcome! 😄

PEZ · 2025-02-08T16:34:33Z

I'd love if someone who cares about Go would port it to the new benchmark runner:

gmr458 added 2 commits November 26, 2024 00:52

change int to int32

f7d37c5

added go build flag to remove symbol and debug info

ebd7d31

CalebQ42 mentioned this pull request Nov 26, 2024

Use int32 instead of int #57

Closed

zamadatix mentioned this pull request Nov 28, 2024

Match Go integer size to C/Rust/Java integer size #88

Closed

bddicken merged commit 411e2b1 into bddicken:main Nov 29, 2024

eljobe mentioned this pull request Nov 30, 2024

cmd/compile: strange performance difference between two implementations golang/go#49785

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Go loops #44

Improve Go loops #44

gmr458 commented Nov 26, 2024

eljobe commented Nov 28, 2024

bddicken commented Nov 29, 2024

eljobe commented Nov 29, 2024

gmr458 commented Nov 29, 2024

eljobe commented Nov 29, 2024

eljobe commented Nov 30, 2024

gmr458 commented Nov 30, 2024

eljobe commented Nov 30, 2024

gmr458 commented Nov 30, 2024

d0ngw commented Nov 30, 2024

eljobe commented Nov 30, 2024

eljobe commented Nov 30, 2024

eljobe commented Nov 30, 2024

d0ngw commented Dec 1, 2024

saolof commented Dec 1, 2024 •

edited

Loading

bddicken commented Dec 1, 2024

d0ngw commented Dec 2, 2024

PEZ commented Feb 8, 2025

Improve Go loops #44

Improve Go loops #44

Conversation

gmr458 commented Nov 26, 2024

eljobe commented Nov 28, 2024

bddicken commented Nov 29, 2024

eljobe commented Nov 29, 2024

gmr458 commented Nov 29, 2024

eljobe commented Nov 29, 2024

eljobe commented Nov 30, 2024

gmr458 commented Nov 30, 2024

eljobe commented Nov 30, 2024

gmr458 commented Nov 30, 2024

d0ngw commented Nov 30, 2024

eljobe commented Nov 30, 2024

eljobe commented Nov 30, 2024

eljobe commented Nov 30, 2024

d0ngw commented Dec 1, 2024

saolof commented Dec 1, 2024 • edited Loading

bddicken commented Dec 1, 2024

d0ngw commented Dec 2, 2024

PEZ commented Feb 8, 2025

saolof commented Dec 1, 2024 •

edited

Loading