[RISCV][AArch64] llvm has 9-12% performance degradation compared to gcc for spec2017/500.perlbench_r #121813

michaelmaitland · 2025-01-06T18:35:18Z

I've been looking into this for some time now and I wanted to file an issue hoping that (a) others are seeing the same problem, (b) we can discuss on how to close this gap, and (c) see if any other targets have some insights on prior work that may help here.

Comparison	Regression (%)
LLVM No Vec vs GCC No Vec	12.57
LLVM No Vec vs GCC Vec	12.19
LLVM Vec vs GCC No Vec	9.72
LLVM Vec vs GCC Vec	9.32

It looks like there is a common scalar related regression. These numbers are at O3 with LTO enabled. I know that this regression is visible in both the qemu dynamic instruction count and on hardware. I know that it impacts both in order and out of order RISC-V cores. As per a talk at the 2021 LLVM Dev Meeting, it looks like this issue also exists on AArch64 see slide 3. I'm not sure if the regression is present on other targets.

The S_regmatch function cycle count on LLVM is far behind the cycle count on GCC. In this function, the number of dynamic stack spills and reloads is much higher (over 50% higher) on LLVM than on GCC. The static number of spills and reloads is relatively similar. In this function, the number of dynamic branches are relatively similar, but there are 34% more dynamic jumps.

The issue solved by #90819 helps close the performance gap by a few percents, but there is is a significant way to go. I have run many other experiments that have ruled out what the issue is, and could chat about them in a call or add follow up comments.

llvmbot · 2025-01-06T18:35:34Z

@llvm/issue-subscribers-backend-risc-v

Author: Michael Maitland (michaelmaitland)

I've been looking into this for some time now and I wanted to file an issue hoping that (a) others are seeing the same problem, (b) we can discuss on how to close this gap, and (c) see if any other targets have some insights on prior work that may help here.

Comparison	Regression (%)
LLVM No Vec vs GCC No Vec	12.57
LLVM No Vec vs GCC Vec	12.19
LLVM Vec vs GCC No Vec	9.72
LLVM Vec vs GCC Vec	9.32

It looks like there is a common scalar related regression. These numbers are at O3 with LTO enabled. I know that this regression is visible in both the qemu dynamic instruction count and on hardware. I know that it impacts both in order and out of order RISC-V cores. As per a talk at the 2021 LLVM Dev Meeting, it looks like this issue also exists on AArch64 see slide 3. I'm not sure if the regression is present on other targets.

The S_regmatch function cycle count on LLVM is far behind the cycle count on GCC. In this function, the number of dynamic stack spills and reloads is much higher (over 50% higher) on LLVM than on GCC. The static number of spills and reloads is relatively similar. In this function, the number of dynamic branches are relatively similar, but there are 34% more dynamic jumps.

The issue solved by #90819 helps close the performance gap by a few percents, but there is is a significant way to go. I have run many other experiments that have ruled out what the issue is, and could chat about them in a call or add follow up comments.

llvmbot · 2025-01-06T18:35:36Z

@llvm/issue-subscribers-backend-aarch64

Author: Michael Maitland (michaelmaitland)

I've been looking into this for some time now and I wanted to file an issue hoping that (a) others are seeing the same problem, (b) we can discuss on how to close this gap, and (c) see if any other targets have some insights on prior work that may help here.

Comparison	Regression (%)
LLVM No Vec vs GCC No Vec	12.57
LLVM No Vec vs GCC Vec	12.19
LLVM Vec vs GCC No Vec	9.72
LLVM Vec vs GCC Vec	9.32

It looks like there is a common scalar related regression. These numbers are at O3 with LTO enabled. I know that this regression is visible in both the qemu dynamic instruction count and on hardware. I know that it impacts both in order and out of order RISC-V cores. As per a talk at the 2021 LLVM Dev Meeting, it looks like this issue also exists on AArch64 see slide 3. I'm not sure if the regression is present on other targets.

The S_regmatch function cycle count on LLVM is far behind the cycle count on GCC. In this function, the number of dynamic stack spills and reloads is much higher (over 50% higher) on LLVM than on GCC. The static number of spills and reloads is relatively similar. In this function, the number of dynamic branches are relatively similar, but there are 34% more dynamic jumps.

The issue solved by #90819 helps close the performance gap by a few percents, but there is is a significant way to go. I have run many other experiments that have ruled out what the issue is, and could chat about them in a call or add follow up comments.

preames · 2025-01-06T21:04:01Z

Small request - "regression" has a very particular meaning, can you edit the title and body to remove that unless this is truly a regression from a prior LLVM release? Your data doesn't seem to claim so; I was very confused on first reading.

michaelmaitland · 2025-01-06T21:18:23Z

Small request - "regression" has a very particular meaning, can you edit the title and body to remove that unless this is truly a regression from a prior LLVM release? Your data doesn't seem to claim so; I was very confused on first reading.

Sure, I've updated the title.

wangpc-pp · 2025-01-07T05:14:29Z

Can you help to extract these hotspots into standalone kernels so that we can easily see the differences?

michaelmaitland · 2025-01-07T18:25:14Z

Can you help to extact these hotspots into standalone kernels so that we can easily see the differences?

The S_regmatch function is quite large (3233 LOC). In my experience, understanding the differences is difficult because it is so large. I've attempted to use llvm-cov to hone in on problem areas, although it is quite difficult since that only gives us information about the llvm side of things.

The spilling seems to be occurring across basic blocks, so I don't think that instruction scheduling (which is per-basic-block) will have much of an impact in fixing the problem. I can add a datapoint that @mgudim's patch helps us close the gap as it relates to spilling and reloading, but even after his patch, we're still far behind:

	Base LLVM	LLVM + Mikhails Changes	GCC
Dyn Stack Stores (billion)	203	151	107
Dyn Stack Reloads (billion)	96	77	58

Some of my experiments have shown that middle end optimizations lead to more spilling when enabled. I've been pursuing this line of work lately, but it can be difficult to predict which optimizations should be disabled based on down-the-line-register-pressure heuristics which may lead to regressions elsewhere. There is also a possibility that we could be doing a better job in the actual register allocator, although I don't have the answers as to what we could be doing better. It has been extremely difficult trying to make sense of the register allocation debug output on such a large function. I've tried to reduce the function, focusing on a few spills I think are particularly bad (i.e. ones with the highest dynamic IC), and it is still difficult to understand what could be improved inside register allocation.

michaelmaitland added backend:AArch64 backend:RISC-V labels Jan 6, 2025

michaelmaitland changed the title ~~[RISCV][AArch64] llvm has 9-12% regression compared to gcc for spec2017/500.perlbench_r~~ [RISCV][AArch64] llvm has 9-12% performance degradation compared to gcc for spec2017/500.perlbench_r Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RISCV][AArch64] llvm has 9-12% performance degradation compared to gcc for spec2017/500.perlbench_r #121813

[RISCV][AArch64] llvm has 9-12% performance degradation compared to gcc for spec2017/500.perlbench_r #121813

michaelmaitland commented Jan 6, 2025

llvmbot commented Jan 6, 2025

llvmbot commented Jan 6, 2025

preames commented Jan 6, 2025

michaelmaitland commented Jan 6, 2025

wangpc-pp commented Jan 7, 2025 •

edited

Loading

michaelmaitland commented Jan 7, 2025 •

edited

Loading

[RISCV][AArch64] llvm has 9-12% performance degradation compared to gcc for spec2017/500.perlbench_r #121813

[RISCV][AArch64] llvm has 9-12% performance degradation compared to gcc for spec2017/500.perlbench_r #121813

Comments

michaelmaitland commented Jan 6, 2025

llvmbot commented Jan 6, 2025

llvmbot commented Jan 6, 2025

preames commented Jan 6, 2025

michaelmaitland commented Jan 6, 2025

wangpc-pp commented Jan 7, 2025 • edited Loading

michaelmaitland commented Jan 7, 2025 • edited Loading

wangpc-pp commented Jan 7, 2025 •

edited

Loading

michaelmaitland commented Jan 7, 2025 •

edited

Loading