Make loops comparable in go and c #150

eljobe · 2024-11-30T23:52:09Z

The clang compiler performs an optimization called Scalar Replacement of Aggregates on the C code which essentially means that it rewrites the c code to use a local variable (which is stored in a register) just for iterating through the inner loop with 100,000 sums of j%u, and then assigns it back to the array of int32 values.

The go complier doesn't currently perform the same optimization. But, it's trivial to rewrite the code manually to perform the same optimization. So, it's not exactly the same code, but using only the features of the go programming language, this code now performs as well as the equivalent c code on my mac.

The clang compiler performs an optimization called Scalar Replacement of Aggregates on the C code which essentially means that it rewrites the c code to use a local variable (which is stored in a register) just for iterating through the inner loop with 100,000 sums of `j%u`, and then assigns it back to the array of int32 values. The go complier doesn't currently perform the same optimization. But, it's trivial to rewrite the code manually to perform the same optimization. So, it's not exactly the same code, but using only the features of the go programming language, this code now performs as well as the equivalent c code on my mac.

On the amd64 architecture, this really matters. For some reason.

bddicken · 2024-12-01T21:32:09Z

I'm a bit torn on whether or not I should accept these types of changes. A similar things was discussed here, but for dart:

https://mrale.ph/blog/2024/11/27/microbenchmarks-are-experiments.html

On the one hand, I get that there are small tweaks that we can make to these in order to coerce the compiler to perform better optimization. This is true not just in Go, but many languages.

On the other hand, Its a bit more fair to write the code "the same" and leave it up to the compiler to figure it out. More sophisticated / mature compilers will have a higher likely hood of producing better optimizations, and is a part of the big picture of choosing language. Most people are not tweaking their code to coerce a compiler, but just write code to perform a task, and allow their compiler optimizes it as best it can.

Thoughts?

diwasrimal · 2024-12-02T04:01:58Z

I think you should make two (or more) versions for each language, for each test and let people contribute by adding optimized code. This will point out compiler inefficiencies and would also be a good resource on learning how to optimize your language's code.

neon-sunset · 2024-12-02T04:10:58Z

Go's compiler weakness needs to be fixed in the Go's compiler, which, as was correctly noted, applies to other languages too :)

d0ngw · 2024-12-02T07:14:57Z

The main goal of the Go compiler is to prioritize compile speed, which is why it does not implement more complex optimization techniques. This PR serves as a good example for Go developers, showing how they can improve performance in key parts of the code. While the Go compiler may not always generate the most optimized code, it doesn't mean that the Go runtime can't run as fast as the C version.

neon-sunset · 2024-12-02T07:26:47Z

The PR description is not valid. SROA has no relation to the optimization being done. SROA is, for example, promoting int[4] stack buffer to 4 individual registers.

In this case, per its memory model, GCC and LLVM perform LICM - Loop Invariant Code Motion. It's a standard optimization done by many compilers - OpenJDK HotSpot and .NET also do it (albeit worse). Moreover, the gradual improvements in Go's compiler clearly demonstrate the intention to address its shortcomings and incorrectness of the argument being made, which is a common excuse heard from the Go community.

At the end of the day, the underwhelming quality of the Go's compiler output is what the users will get when they write code "normally". The intention behind microbenchmarks like these is to help the developers better understand performance characteristics of programming languages and make better informed decisions (like using Go less).

d0ngw · 2024-12-02T07:47:28Z

Sorry, I'm not a compiler expert, and I have removed the reference to SROA in my previous comment. This repository is focused on language runtime performance, not compiler performance. If we’re discussing compiler performance, then maintaining a consistent code style is important. Do you think C or Java compilers always perform optimally without requiring developers to make adjustments?

omeid · 2024-12-05T05:41:30Z

Most people are not tweaking their code to coerce a compiler

This is far from the truth, at least for people who care about performance. There is endless books on this very topic for just about every mainstream language, as well as algorithms and programming patterns in general.

Benchmarks based on syntactical rather than semantic equivalence is unfair. For example, the Scalar Replacement issue aside, C version also lacks the bounds check that memory safe languages like Go and C# will perform for each lookup.

neon-sunset · 2024-12-05T06:07:07Z

C version also lacks the bounds check that memory safe languages like Go and C# will perform for each lookup

The compilers of Java, .NET and Rust are able to elide the bound checks here inside the loop. I forgot whether Go managed to do so - it had other issues with the way it compiled the loop by placing the return block inside of it resulting in a weird jump threading instead of having small loop body. The cost of bound checks is only paid when the compiler cannot statically prove that the array/vec/slice/span/list accesses are in-bounds, and if that fails it is possible to provide fast-path via loop cloning for certain scenarios.

omeid · 2024-12-05T07:47:47Z

The compilers of Java, .NET and Rust are able to elide the bound checks here inside the loop.

I don't know about graal (which seems to be the cream of the crop when it comes to JVMs) but HotSpot only does bound check elimination when code is JIT which doesn't mean always.

For rust, it doesn't seem to eliminate the bound checks for the code benchmarked, as you can see here

Again, the specifics doesn't matter so much that as a rule, benchmarks based on syntactical rather than semantic equivalence is unfair.

And as correct as trying to translate human languages without understanding of figures of speech and metaphors, like translating "Hals- und Beinbruch" into "breaking neck and leg" which means something entirely different to "break a leg!" as it goes in English.

neon-sunset · 2024-12-05T08:02:15Z

Bound checks are being elided for Rust: https://godbolt.org/z/cj4GqzenK

omeid · 2024-12-05T09:28:46Z

What is this then?

LBB3_53:
        adrp    x2, l___unnamed_12@PAGE
        add     x2, x2, l___unnamed_12@PAGEOFF
        mov     w1, #10000
        bl      __ZN4core9panicking18panic_bounds_check17h4fa7ca60df78ab98E

roma-glushko · 2024-12-05T18:28:29Z

I think this one should be accepted.

Cheating is when you solve the problem using FFI/C exts. This is not the case, it's all the native Golang code that well does account for the golang compiler specifics.

At the same time, yes, it's not great that golang makes us think about this little quirks and it would have been much better to have these two versions of code perform equality fast.

neon-sunset · 2024-12-05T18:50:21Z

@omeid In the disasm, the following block corresponds to the nested loops "core" of the benchmark:

LBB3_42:
        mov     w12, #0
        mov     w11, #0
        ldr     w13, [x21, x8, lsl #2]
LBB3_43:
        add     w14, w12, #1
        udiv    w15, w12, w19
        msub    w15, w15, w19, w12
        udiv    w16, w14, w19
        msub    w14, w16, w19, w14
        add     w13, w15, w13
        add     w11, w14, w11
        add     w12, w12, #2
        cmp     w12, w9
        b.ne    LBB3_43 ;; <-- continue with 100k inner loop
        add     w11, w11, w13
        add     w11, w11, w0
        str     w11, [x21, x8, lsl #2]
        add     x11, x8, #1
        mov     x8, x11
        cmp     x11, x10
        b.ne    LBB3_42 ;; <-- continue with 10k outer loop

The bounds check failure block is for accessing the input args at the start of the program since the length of args cannot be statically proven.

@roma-glushko in that case other languages should be allowed to do so as well - it will not be comparable to work that other entries have to do, and then it becomes which language submission is optimized to make the compiler happy the most. Such optimizations are really fragile, done in real world as a last resort only and tend to break when the compiler is updated. I can rent M3 Mac Mini to spend a couple of hours in Apple Instruments to precisely order the code to make sure that C# submission performs optimally on the specific hardware used by the author. I have done this with BenchmarksGame submission in the past and it was questionable investment of effort because .NET 9 changed the way the compiler handles the loops, not going to do that again.

I think it's best to let the results speak for themselves, and if you see a performance issue, the language authors will appreciate if you submit an issue against the official repository instead so that the general user can benefit.

roma-glushko · 2024-12-05T20:23:18Z

in that case other languages should be allowed to do so as well - it will not be comparable to work that other entries have to do, and then it becomes which language submission is optimized to make the compiler happy the most. Such optimizations are really fragile, done in real world as a last resort only and tend to break when the compiler is updated. I can rent M3 Mac Mini to spend a couple of hours in Apple Instruments to precisely order the code to make sure that C# submission performs optimally on the specific hardware used by the author. I have done this with BenchmarksGame submission in the past and it was questionable investment of effort because .NET 9 changed the way the compiler handles the loops, not going to do that again.

@neon-sunset well, we should feel a difference between finding how to "inline" code in a smart way to make it work as fast as possible in the given setup and a well-know specific of how the language works. The first one is definitely a waste of time and gaming the benchmark I agree, but I would not say so for the latter because this knowledge would reply for any other golang application that would do similar frequent access to slice indices. If any other language has similar well-know specifics like Golang here, they should be applied too.

Otherwise, what we are trying to benchmark here? General performance of languages without using any special optimizations like PGO/asm/FFI exts? Or maybe performance of the loop that looks as close as possible to language X? Or maybe how many optimisations the complier does out of the box if you know nothing about language? Or maybe something else?

I think it's best to let the results speak for themselves, and if you see a performance issue, the language authors will appreciate if you submit an issue against the official repository instead so that the general user can benefit.

No disagreement here, Go team should fix this at some point (but given that everyone knows about this issue for years, that may not going to happen for whatever reason 😌 ).

neon-sunset · 2024-12-05T20:28:43Z

In that case we should let people see that Go compiler indeed needs more work and allow them to make informed decisions (to use better options). If the language you chose requires you to constantly babysit the compiler, perhaps it's better to have the numbers reflect that?

In either case, and it's something I wrote in at least 3 other places, this microbenchmark only measures the specific way the compilers emit integer division on ARM64 inside the loop and whether they can LICM the spill to memory (which is likely hidden by M3 anyway), it does not translate to x86_64 with the same compilers too. I only spend time here because it's good at allowing people to be prompted that a lot of widespread assumptions about performance are incorrect.

(the Go one is particularly egregious and it's long past overdue to stop lying about its performance profile, this is why you see backlash on social media and increasing amount of people who are unreasonably negative towards the language, it's the natural outcome once you start looking into the details)

omeid · 2024-12-06T01:07:40Z

@neon-sunset Fair, but that only happens if you opt into -Copt-level=3, even with -Copt-level=2 you still get bound checks:

LBB3_53:
        adrp    x2, l___unnamed_12@PAGE
        add     x2, x2, l___unnamed_12@PAGEOFF
        mov     w1, #10000
        bl      __ZN4core9panicking18panic_bounds_check17h4fa7ca60df78ab98E
......

So this means that as a developer, you need to understand this aspect of the compiler tool chain, in the same way that you might need to understand the Go's limitations.

But again, it just reinforces my point, syntactical equivalence is not an incorrect standard of harmonising semantics, which means fair benchmarks should be based on semantic equivalence not syntactical and arbitrary expectation of understanding of compiler.

For example, in case of Rust, you consider understanding the need to pass specific optimisation levels and understanding its trade-offs reasonable, but for Go, you consider understanding that the compiler won't do Scaler Replacement out of ordinary.

neon-sunset · 2024-12-06T01:12:27Z

@omeid Not sure why you keep insisting on the bound checks argument. -Copt-level=3 is the equivalent of --release in Rust. It is the experience of every library or an application written in this language and deployed somewhere. Godbolt simply does not use Cargo so you have to specify it explicitly.

Even with smaller optimization level it still elides bound checks:

For comparison, here's the .NET output with AOT (in this case the difference with final JIT output is going to be small): https://godbolt.org/z/1sGoMsTKd

Sadly it executes three branches for explicitly guarding against division by zero, overflow and middle one to jump around the condition. This only happens on ARM64 but still - if you have knowledge of this ARM64 behavior you can place a if (u <= 0) { throw new("Invalid input"); } statement before the loop which will completely remove them significantly improving the execution time. This would not be fair either however because that's the compiler's job.

The goal of this microbenchmark is not to take the shortest path but to evaluate what is the cost of standard language constructs.

omeid · 2024-12-06T01:34:15Z

Fair, I don't like the godbolt interface seems like my copy paste didn't work and was still looking at opt-level=0.

eljobe mentioned this pull request Nov 30, 2024

Improve Go loops #44

Merged

Use int32 everywhere

75f5c9b

On the amd64 architecture, this really matters. For some reason.

bddicken added the needs updates / explanation label Dec 2, 2024

komuw mentioned this pull request Dec 4, 2024

improve golang #168

Closed

karololszacki mentioned this pull request Dec 5, 2024

⚡️Optimize the slice access in Golang #198

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make loops comparable in go and c #150

Make loops comparable in go and c #150

eljobe commented Nov 30, 2024

bddicken commented Dec 1, 2024 •

edited

Loading

diwasrimal commented Dec 2, 2024

neon-sunset commented Dec 2, 2024 •

edited

Loading

d0ngw commented Dec 2, 2024 •

edited

Loading

neon-sunset commented Dec 2, 2024 •

edited

Loading

d0ngw commented Dec 2, 2024

omeid commented Dec 5, 2024 •

edited

Loading

neon-sunset commented Dec 5, 2024

omeid commented Dec 5, 2024

neon-sunset commented Dec 5, 2024

omeid commented Dec 5, 2024

roma-glushko commented Dec 5, 2024

neon-sunset commented Dec 5, 2024

roma-glushko commented Dec 5, 2024

neon-sunset commented Dec 5, 2024 •

edited

Loading

omeid commented Dec 6, 2024 •

edited

Loading

neon-sunset commented Dec 6, 2024 •

edited

Loading

omeid commented Dec 6, 2024

Make loops comparable in go and c #150

Are you sure you want to change the base?

Make loops comparable in go and c #150

Conversation

eljobe commented Nov 30, 2024

bddicken commented Dec 1, 2024 • edited Loading

diwasrimal commented Dec 2, 2024

neon-sunset commented Dec 2, 2024 • edited Loading

d0ngw commented Dec 2, 2024 • edited Loading

neon-sunset commented Dec 2, 2024 • edited Loading

d0ngw commented Dec 2, 2024

omeid commented Dec 5, 2024 • edited Loading

neon-sunset commented Dec 5, 2024

omeid commented Dec 5, 2024

neon-sunset commented Dec 5, 2024

omeid commented Dec 5, 2024

roma-glushko commented Dec 5, 2024

neon-sunset commented Dec 5, 2024

roma-glushko commented Dec 5, 2024

neon-sunset commented Dec 5, 2024 • edited Loading

omeid commented Dec 6, 2024 • edited Loading

neon-sunset commented Dec 6, 2024 • edited Loading

omeid commented Dec 6, 2024

bddicken commented Dec 1, 2024 •

edited

Loading

neon-sunset commented Dec 2, 2024 •

edited

Loading

d0ngw commented Dec 2, 2024 •

edited

Loading

neon-sunset commented Dec 2, 2024 •

edited

Loading

omeid commented Dec 5, 2024 •

edited

Loading

neon-sunset commented Dec 5, 2024 •

edited

Loading

omeid commented Dec 6, 2024 •

edited

Loading

neon-sunset commented Dec 6, 2024 •

edited

Loading