Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No more auto-unrolling starting with LLVM 14 #94847

Open
Urgau opened this issue Mar 11, 2022 · 7 comments
Open

No more auto-unrolling starting with LLVM 14 #94847

Urgau opened this issue Mar 11, 2022 · 7 comments
Labels
C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. P-medium Medium priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Milestone

Comments

@Urgau
Copy link
Member

Urgau commented Mar 11, 2022

Code

I tried this code:

pub fn find_minimum(values: &[f64]) -> f64 {
    let mut min = f64::INFINITY;

    for &num in values {
        if num < min {
            min = num
        }
    }

    min
}

I expected to see this happen: I expect that the generated assembly will unroll the loop to a certain level.

Instead, this happened: There is no more unrolling perform by the compiler.

Version it worked on

.LCPI0_0:
        .quad   0x7ff0000000000000
example::find_minimum:
        test    rsi, rsi
        je      .LBB0_1
        lea     rcx, [8*rsi - 8]
        mov     edx, ecx
        shr     edx, 3
        add     edx, 1
        and     rdx, 7
        je      .LBB0_3
        neg     rdx
        movsd   xmm1, qword ptr [rip + .LCPI0_0]
        mov     rax, rdi
.LBB0_5:
        movsd   xmm0, qword ptr [rax]
        add     rax, 8
        minsd   xmm0, xmm1
        movapd  xmm1, xmm0
        inc     rdx
        jne     .LBB0_5
        cmp     rcx, 56
        jae     .LBB0_7
        jmp     .LBB0_9
.LBB0_1:
        movsd   xmm0, qword ptr [rip + .LCPI0_0]
        ret
.LBB0_3:
        movsd   xmm0, qword ptr [rip + .LCPI0_0]
        mov     rax, rdi
        cmp     rcx, 56
        jb      .LBB0_9
.LBB0_7:
        lea     rcx, [rdi + 8*rsi]
.LBB0_8:
        movsd   xmm1, qword ptr [rax]
        minsd   xmm1, xmm0
        movsd   xmm0, qword ptr [rax + 8]
        minsd   xmm0, xmm1
        movsd   xmm1, qword ptr [rax + 16]
        minsd   xmm1, xmm0
        movsd   xmm0, qword ptr [rax + 24]
        minsd   xmm0, xmm1
        movsd   xmm1, qword ptr [rax + 32]
        minsd   xmm1, xmm0
        movsd   xmm0, qword ptr [rax + 40]
        minsd   xmm0, xmm1
        movsd   xmm1, qword ptr [rax + 48]
        minsd   xmm1, xmm0
        movsd   xmm0, qword ptr [rax + 56]
        minsd   xmm0, xmm1
        add     rax, 64
        cmp     rcx, rax
        jne     .LBB0_8
.LBB0_9:
        ret

It most recently worked on: 1.59

Version with regression

.LCPI0_0:
        .quad   0x7ff0000000000000
example::find_minimum:
        test    rsi, rsi
        je      .LBB0_1
        shl     rsi, 3
        movsd   xmm1, qword ptr [rip + .LCPI0_0]
        xor     eax, eax
.LBB0_4:
        movsd   xmm0, qword ptr [rdi + rax]
        minsd   xmm0, xmm1
        add     rax, 8
        movapd  xmm1, xmm0
        cmp     rsi, rax
        jne     .LBB0_4
        ret
.LBB0_1:
        movsd   xmm0, qword ptr [rip + .LCPI0_0]
        ret

rustc --version --verbose:

rustc 1.60.0-beta.3 (e5effbd0b 2022-03-07)
binary: rustc
commit-hash: e5effbd0b34c9ede216e21d3f60dcbad0b863676
commit-date: 2022-03-07
host: x86_64-unknown-linux-gnu
release: 1.60.0-beta.3
LLVM version: 14.0.0

@rustbot modify labels: +regression-from-stable-to-beta -regression-untriaged

@Urgau Urgau added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels Mar 11, 2022
@rustbot rustbot added regression-from-stable-to-beta Performance or correctness regression from stable to beta. I-prioritize Issue: Indicates that prioritization has been requested for this issue. and removed regression-untriaged Untriaged performance or correctness regression. labels Mar 11, 2022
@Urgau
Copy link
Member Author

Urgau commented Mar 11, 2022

searched nightlies: from nightly-2022-02-01 to nightly-2022-03-01
regressed nightly: nightly-2022-02-18
searched commit range: 75d9a0a...30b3f35
regressed commit: 30b3f35

bisected with cargo-bisect-rustc v0.6.1

Host triple: x86_64-unknown-linux-gnu
Reproduce with:

cargo bisect-rustc --test-dir=. --start=2022-02-01 --end=2022-03-01 

@apiraino apiraino added I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Mar 11, 2022
@SNCPlay42
Copy link
Contributor

The assembly is actually not vectorized in either version (each version only operates on 8 bytes (1 f64) per minsd instruction); what's changed is the loop was unrolled before and isn't any more.

@Urgau
Copy link
Member Author

Urgau commented Mar 11, 2022

I just checked and now rustc, clang and gcc are generating the same assembly code for the example.

The assembly is actually not vectorized in either version (each version only operates on 8 bytes (1 f64) per minsd instruction); what's changed is the loop was unrolled before and isn't any more.

Oh, you're right. My bad.

@Urgau Urgau changed the title No more auto-vectorization starting with LLVM 14 No more auto-unrolling starting with LLVM 14 Mar 11, 2022
@saethlin
Copy link
Member

saethlin commented Mar 11, 2022

-O for rustc is -Copt-level=2, and --release is -Copt-level=3. In clang and gcc I think it is equivalent to -O1 (at least that's what the gcc docs say) which I've never seen anyone use. You probably want to pass -Copt-level=3 in godbolt for Rust, and -O2 or -O3 for clang/gcc. -O3 is probably best if you're comparing to rustc codegen. But that's not relevant to whether we have a problem here, thank goodness.

I slapped together some extremely crude benchmarking on my 3970X and this looks like a 2x perf regression. If you replace f64 with u64 I see something closer to a 2.5x perf regression.

The perhaps good news is that if you pass -Ctarget-cpu=native on godbolt you get the old codegen back, and on my machine I see the perf regression disappear.

Also, it looks like this problem may be specific to computing the min of an array? That seems odd to me but maybe it means something to people who work on loop optimizations.

@apiraino
Copy link
Contributor

Assigning priority as discussed in the Zulip thread of the Prioritization Working Group.

@rustbot label -I-prioritize +P-high

@rustbot rustbot added P-high High priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Mar 16, 2022
@pnkfelix
Copy link
Member

pnkfelix commented Mar 31, 2022

Also, it looks like this problem may be specific to computing the min of an array? That seems odd to me but maybe it means something to people who work on loop optimizations.

I'd be curious to know if we regressed when you compute the min via other rustic-idioms like iterators.

In any case, we've been discussing this in the T-compiler meeting and we are thinking it isn't quite P-high priority, and perhaps should be P-medium. Its not clear what is the scope of kinds of loops that LLVM has started to fail to unroll; it seems somewhat limited.

@saethlin
Copy link
Member

saethlin commented Mar 31, 2022

Yup, same behavior with Iterator::min. No unrolling and a ~2x regression by default, but we get the old unrolling and no regression with -Ctarget-cpu=native: https://godbolt.org/z/4Pnz6rcrc.

@apiraino apiraino added P-medium Medium priority and removed P-high High priority labels Apr 6, 2022
@pietroalbini pietroalbini added this to the 1.60.0 milestone Apr 6, 2022
@pietroalbini pietroalbini added regression-from-stable-to-stable Performance or correctness regression from one stable version to another. and removed regression-from-stable-to-beta Performance or correctness regression from stable to beta. labels Jun 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. P-medium Medium priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

7 participants