Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RISCV][AArch64] llvm has 9-12% performance degradation compared to gcc for spec2017/500.perlbench_r #121813

Open
michaelmaitland opened this issue Jan 6, 2025 · 6 comments

Comments

@michaelmaitland
Copy link
Contributor

I've been looking into this for some time now and I wanted to file an issue hoping that (a) others are seeing the same problem, (b) we can discuss on how to close this gap, and (c) see if any other targets have some insights on prior work that may help here.

Comparison Regression (%)
LLVM No Vec vs GCC No Vec 12.57
LLVM No Vec vs GCC Vec 12.19
LLVM Vec vs GCC No Vec 9.72
LLVM Vec vs GCC Vec 9.32

It looks like there is a common scalar related regression. These numbers are at O3 with LTO enabled. I know that this regression is visible in both the qemu dynamic instruction count and on hardware. I know that it impacts both in order and out of order RISC-V cores. As per a talk at the 2021 LLVM Dev Meeting, it looks like this issue also exists on AArch64 see slide 3. I'm not sure if the regression is present on other targets.

The S_regmatch function cycle count on LLVM is far behind the cycle count on GCC. In this function, the number of dynamic stack spills and reloads is much higher (over 50% higher) on LLVM than on GCC. The static number of spills and reloads is relatively similar. In this function, the number of dynamic branches are relatively similar, but there are 34% more dynamic jumps.

The issue solved by #90819 helps close the performance gap by a few percents, but there is is a significant way to go. I have run many other experiments that have ruled out what the issue is, and could chat about them in a call or add follow up comments.

@llvmbot
Copy link
Member

llvmbot commented Jan 6, 2025

@llvm/issue-subscribers-backend-risc-v

Author: Michael Maitland (michaelmaitland)

I've been looking into this for some time now and I wanted to file an issue hoping that (a) others are seeing the same problem, (b) we can discuss on how to close this gap, and (c) see if any other targets have some insights on prior work that may help here.
Comparison Regression (%)
LLVM No Vec vs GCC No Vec 12.57
LLVM No Vec vs GCC Vec 12.19
LLVM Vec vs GCC No Vec 9.72
LLVM Vec vs GCC Vec 9.32

It looks like there is a common scalar related regression. These numbers are at O3 with LTO enabled. I know that this regression is visible in both the qemu dynamic instruction count and on hardware. I know that it impacts both in order and out of order RISC-V cores. As per a talk at the 2021 LLVM Dev Meeting, it looks like this issue also exists on AArch64 see slide 3. I'm not sure if the regression is present on other targets.

The S_regmatch function cycle count on LLVM is far behind the cycle count on GCC. In this function, the number of dynamic stack spills and reloads is much higher (over 50% higher) on LLVM than on GCC. The static number of spills and reloads is relatively similar. In this function, the number of dynamic branches are relatively similar, but there are 34% more dynamic jumps.

The issue solved by #90819 helps close the performance gap by a few percents, but there is is a significant way to go. I have run many other experiments that have ruled out what the issue is, and could chat about them in a call or add follow up comments.

@llvmbot
Copy link
Member

llvmbot commented Jan 6, 2025

@llvm/issue-subscribers-backend-aarch64

Author: Michael Maitland (michaelmaitland)

I've been looking into this for some time now and I wanted to file an issue hoping that (a) others are seeing the same problem, (b) we can discuss on how to close this gap, and (c) see if any other targets have some insights on prior work that may help here.
Comparison Regression (%)
LLVM No Vec vs GCC No Vec 12.57
LLVM No Vec vs GCC Vec 12.19
LLVM Vec vs GCC No Vec 9.72
LLVM Vec vs GCC Vec 9.32

It looks like there is a common scalar related regression. These numbers are at O3 with LTO enabled. I know that this regression is visible in both the qemu dynamic instruction count and on hardware. I know that it impacts both in order and out of order RISC-V cores. As per a talk at the 2021 LLVM Dev Meeting, it looks like this issue also exists on AArch64 see slide 3. I'm not sure if the regression is present on other targets.

The S_regmatch function cycle count on LLVM is far behind the cycle count on GCC. In this function, the number of dynamic stack spills and reloads is much higher (over 50% higher) on LLVM than on GCC. The static number of spills and reloads is relatively similar. In this function, the number of dynamic branches are relatively similar, but there are 34% more dynamic jumps.

The issue solved by #90819 helps close the performance gap by a few percents, but there is is a significant way to go. I have run many other experiments that have ruled out what the issue is, and could chat about them in a call or add follow up comments.

@preames
Copy link
Collaborator

preames commented Jan 6, 2025

Small request - "regression" has a very particular meaning, can you edit the title and body to remove that unless this is truly a regression from a prior LLVM release? Your data doesn't seem to claim so; I was very confused on first reading.

@michaelmaitland michaelmaitland changed the title [RISCV][AArch64] llvm has 9-12% regression compared to gcc for spec2017/500.perlbench_r [RISCV][AArch64] llvm has 9-12% performance degradation compared to gcc for spec2017/500.perlbench_r Jan 6, 2025
@michaelmaitland
Copy link
Contributor Author

Small request - "regression" has a very particular meaning, can you edit the title and body to remove that unless this is truly a regression from a prior LLVM release? Your data doesn't seem to claim so; I was very confused on first reading.

Sure, I've updated the title.

@wangpc-pp
Copy link
Contributor

wangpc-pp commented Jan 7, 2025

Can you help to extract these hotspots into standalone kernels so that we can easily see the differences?

@michaelmaitland
Copy link
Contributor Author

michaelmaitland commented Jan 7, 2025

Can you help to extact these hotspots into standalone kernels so that we can easily see the differences?

The S_regmatch function is quite large (3233 LOC). In my experience, understanding the differences is difficult because it is so large. I've attempted to use llvm-cov to hone in on problem areas, although it is quite difficult since that only gives us information about the llvm side of things.

The spilling seems to be occurring across basic blocks, so I don't think that instruction scheduling (which is per-basic-block) will have much of an impact in fixing the problem. I can add a datapoint that @mgudim's patch helps us close the gap as it relates to spilling and reloading, but even after his patch, we're still far behind:

Base LLVM LLVM + Mikhails Changes GCC
Dyn Stack Stores (billion) 203 151 107
Dyn Stack Reloads (billion) 96 77 58

Some of my experiments have shown that middle end optimizations lead to more spilling when enabled. I've been pursuing this line of work lately, but it can be difficult to predict which optimizations should be disabled based on down-the-line-register-pressure heuristics which may lead to regressions elsewhere. There is also a possibility that we could be doing a better job in the actual register allocator, although I don't have the answers as to what we could be doing better. It has been extremely difficult trying to make sense of the register allocation debug output on such a large function. I've tried to reduce the function, focusing on a few spills I think are particularly bad (i.e. ones with the highest dynamic IC), and it is still difficult to understand what could be improved inside register allocation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants