-
Notifications
You must be signed in to change notification settings - Fork 12.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NVPTX] Do not run the NVVMReflect pass as part of the normal pipeline #121834
Conversation
@llvm/pr-subscribers-backend-nvptx Author: Joseph Huber (jhuber6) ChangesSummary: Pushing this into the backend pipeline will ensure that this works as Full diff: https://github.com/llvm/llvm-project/pull/121834.diff 6 Files Affected:
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp b/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp
index b3b2880588cc59..f6ec780d963d9a 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp
@@ -255,7 +255,6 @@ void NVPTXTargetMachine::registerPassBuilderCallbacks(PassBuilder &PB) {
PB.registerPipelineStartEPCallback(
[this](ModulePassManager &PM, OptimizationLevel Level) {
FunctionPassManager FPM;
- FPM.addPass(NVVMReflectPass(Subtarget.getSmVersion()));
// Note: NVVMIntrRangePass was causing numerical discrepancies at one
// point, if issues crop up, consider disabling.
FPM.addPass(NVVMIntrRangePass());
diff --git a/llvm/lib/Target/NVPTX/NVVMReflect.cpp b/llvm/lib/Target/NVPTX/NVVMReflect.cpp
index 56525a1edc7614..a0e897584a9d32 100644
--- a/llvm/lib/Target/NVPTX/NVVMReflect.cpp
+++ b/llvm/lib/Target/NVPTX/NVVMReflect.cpp
@@ -21,6 +21,7 @@
#include "NVPTX.h"
#include "llvm/ADT/SmallVector.h"
#include "llvm/Analysis/ConstantFolding.h"
+#include "llvm/CodeGen/CommandFlags.h"
#include "llvm/IR/Constants.h"
#include "llvm/IR/DerivedTypes.h"
#include "llvm/IR/Function.h"
@@ -219,7 +220,12 @@ bool NVVMReflect::runOnFunction(Function &F) {
return runNVVMReflect(F, SmVersion);
}
-NVVMReflectPass::NVVMReflectPass() : NVVMReflectPass(0) {}
+NVVMReflectPass::NVVMReflectPass() {
+ // Get the CPU string from the command line if not provided.
+ StringRef SM = codegen::getMCPU();
+ if (!SM.consume_front("sm_") || SM.consumeInteger(10, SmVersion))
+ SmVersion = 0;
+}
PreservedAnalyses NVVMReflectPass::run(Function &F,
FunctionAnalysisManager &AM) {
diff --git a/llvm/test/CodeGen/NVPTX/nvvm-reflect-arch.ll b/llvm/test/CodeGen/NVPTX/nvvm-reflect-arch.ll
index ac5875c6ab1043..83cb3cde48de18 100644
--- a/llvm/test/CodeGen/NVPTX/nvvm-reflect-arch.ll
+++ b/llvm/test/CodeGen/NVPTX/nvvm-reflect-arch.ll
@@ -1,9 +1,9 @@
; Libdevice in recent CUDA versions relies on __CUDA_ARCH reflecting GPU type.
; Verify that __nvvm_reflect() is replaced with an appropriate value.
;
-; RUN: opt %s -S -passes='default<O2>' -mtriple=nvptx64 -mcpu=sm_20 \
+; RUN: opt %s -S -passes='nvvm-reflect' -mtriple=nvptx64 -mcpu=sm_20 \
; RUN: | FileCheck %s --check-prefixes=COMMON,SM20
-; RUN: opt %s -S -passes='default<O2>' -mtriple=nvptx64 -mcpu=sm_35 \
+; RUN: opt %s -S -passes='nvvm-reflect' -mtriple=nvptx64 -mcpu=sm_35 \
; RUN: | FileCheck %s --check-prefixes=COMMON,SM35
@"$str" = private addrspace(1) constant [12 x i8] c"__CUDA_ARCH\00"
diff --git a/llvm/test/CodeGen/NVPTX/nvvm-reflect-ocl.ll b/llvm/test/CodeGen/NVPTX/nvvm-reflect-ocl.ll
index 9d383218dce86a..bf8d6e2cca3071 100644
--- a/llvm/test/CodeGen/NVPTX/nvvm-reflect-ocl.ll
+++ b/llvm/test/CodeGen/NVPTX/nvvm-reflect-ocl.ll
@@ -1,8 +1,8 @@
; Verify that __nvvm_reflect_ocl() is replaced with an appropriate value
;
-; RUN: opt %s -S -passes='default<O2>' -mtriple=nvptx64 -mcpu=sm_20 \
+; RUN: opt %s -S -passes='nvvm-reflect' -mtriple=nvptx64 -mcpu=sm_20 \
; RUN: | FileCheck %s --check-prefixes=COMMON,SM20
-; RUN: opt %s -S -passes='default<O2>' -mtriple=nvptx64 -mcpu=sm_35 \
+; RUN: opt %s -S -passes='nvvm-reflect' -mtriple=nvptx64 -mcpu=sm_35 \
; RUN: | FileCheck %s --check-prefixes=COMMON,SM35
@"$str" = private addrspace(4) constant [12 x i8] c"__CUDA_ARCH\00"
diff --git a/llvm/test/CodeGen/NVPTX/nvvm-reflect-opaque.ll b/llvm/test/CodeGen/NVPTX/nvvm-reflect-opaque.ll
index 46ab79d9858cad..19c74df3037028 100644
--- a/llvm/test/CodeGen/NVPTX/nvvm-reflect-opaque.ll
+++ b/llvm/test/CodeGen/NVPTX/nvvm-reflect-opaque.ll
@@ -3,12 +3,12 @@
; RUN: cat %s > %t.noftz
; RUN: echo '!0 = !{i32 4, !"nvvm-reflect-ftz", i32 0}' >> %t.noftz
-; RUN: opt %t.noftz -S -mtriple=nvptx-nvidia-cuda -passes='default<O2>' \
+; RUN: opt %t.noftz -S -mtriple=nvptx-nvidia-cuda -passes='nvvm-reflect,simplifycfg' \
; RUN: | FileCheck %s --check-prefix=USE_FTZ_0 --check-prefix=CHECK
; RUN: cat %s > %t.ftz
; RUN: echo '!0 = !{i32 4, !"nvvm-reflect-ftz", i32 1}' >> %t.ftz
-; RUN: opt %t.ftz -S -mtriple=nvptx-nvidia-cuda -passes='default<O2>' \
+; RUN: opt %t.ftz -S -mtriple=nvptx-nvidia-cuda -passes='nvvm-reflect,simplifycfg' \
; RUN: | FileCheck %s --check-prefix=USE_FTZ_1 --check-prefix=CHECK
@str = private unnamed_addr addrspace(4) constant [11 x i8] c"__CUDA_FTZ\00"
@@ -43,7 +43,7 @@ exit:
declare i32 @llvm.nvvm.reflect(ptr)
-; CHECK-LABEL: define noundef i32 @intrinsic
+; CHECK-LABEL: define i32 @intrinsic
define i32 @intrinsic() {
; CHECK-NOT: call i32 @llvm.nvvm.reflect
; USE_FTZ_0: ret i32 0
diff --git a/llvm/test/CodeGen/NVPTX/nvvm-reflect.ll b/llvm/test/CodeGen/NVPTX/nvvm-reflect.ll
index 2ed9f7c11bcf9b..244b44fea9b83c 100644
--- a/llvm/test/CodeGen/NVPTX/nvvm-reflect.ll
+++ b/llvm/test/CodeGen/NVPTX/nvvm-reflect.ll
@@ -3,12 +3,12 @@
; RUN: cat %s > %t.noftz
; RUN: echo '!0 = !{i32 4, !"nvvm-reflect-ftz", i32 0}' >> %t.noftz
-; RUN: opt %t.noftz -S -mtriple=nvptx-nvidia-cuda -passes='default<O2>' \
+; RUN: opt %t.noftz -S -mtriple=nvptx-nvidia-cuda -passes='nvvm-reflect,simplifycfg' \
; RUN: | FileCheck %s --check-prefix=USE_FTZ_0 --check-prefix=CHECK
; RUN: cat %s > %t.ftz
; RUN: echo '!0 = !{i32 4, !"nvvm-reflect-ftz", i32 1}' >> %t.ftz
-; RUN: opt %t.ftz -S -mtriple=nvptx-nvidia-cuda -passes='default<O2>' \
+; RUN: opt %t.ftz -S -mtriple=nvptx-nvidia-cuda -passes='nvvm-reflect,simplifycfg' \
; RUN: | FileCheck %s --check-prefix=USE_FTZ_1 --check-prefix=CHECK
@str = private unnamed_addr addrspace(4) constant [11 x i8] c"__CUDA_FTZ\00"
@@ -43,7 +43,8 @@ exit:
declare i32 @llvm.nvvm.reflect(ptr)
-; CHECK-LABEL: define noundef i32 @intrinsic
+; CHECK-LABEL: define i32 @intrinsic
+
define i32 @intrinsic() {
; CHECK-NOT: call i32 @llvm.nvvm.reflect
; USE_FTZ_0: ret i32 0
|
The problem is that libdevice depends on this patch and it does carry a fair amount of code that will no longer benefit from removal of unused conditional branches. |
I don't think this will make a considerable difference, since it's usually guarding some very shallow code paths. We still get full optimizations when the backend runs. If you think this is a major issue, I could acquiesce to making the non-backend version skip lowering if
|
With this PR, OpenMC works again with NVIDIA cards (OpenMC has been broken on nvidia since #119091). |
FWIW, the pass should be super cheap, if it starts by looking for the intrinsic and then the uses. Running it twice is a valid option. |
First pass should delete all the uses, the concern is that by not trimming the intrinsic earlier we're losing some optimizations, but I feel like we're still getting a full optimization pipeline and this likely will be easily optimized out. If @Artem-B is really concerned I'll just change it to keep the per-file run but ignore it if there's no SM passed. |
The problem, IIUIC, is that in some compilation modes we may run optimization w/o the constants set properly for the reflect pass and running it may pick the wrong branch -- something that the late reflect pass would not be able to undo.
I don't think it's always the case. There are functions in libdevice where
It's a maybe. Considering that __nvvm_reflect() only depends on a string, its branches may be optimizable, as long as they have no other The practical impact will likely be limited to the heavy functions with multiple
Reflect can be used with other parameters, so SM-only check alone is, generally speaking, not sufficient as an on/off switch. I think the decision where the reflect pass should run should be tied to the earliest point where the reflect inputs get set. For CUDA, it's the beginning of the pipeline. For openMP and stand-alone compilation it's probably somewhere closer to the back-end (or wherever we may link in with libdevice, or do LTO, or other point where we finally know what we're actually compiling for). |
Yeah, the point is to defer something until the backend knows what the actual target is. The optimizations that run on the initial compile are usually more generic so I wouldn't think this would fire until the backend. |
Can we make the pass at first only specialize what is known to be known, and later do the rest? |
What I was suggesting, but I think it makes more sense to just make this a backend thing. Only difference it makes is having a few branches live slightly longer, but I really don't think that it'll make a noticeable difference. |
I don't think this belongs in the backend, or middle end optimization pipeline. It's really a job for whatever "frontend" is loading the bitcode for final code generation |
Summary: This pass lowers the `__nvvm_reflect` builtin in the IR. However, this currently runs in the standard optimization pipeline, not just the backend pipeline. This means that if the user creates LLVM-IR without an architecture set, it will always delete the reflect code even if it is intended to be used later. Pushing this into the backend pipeline will ensure that this works as intended, allowing users to conditionally include code depending on which target architecture the user ended up using. This fixes a bug in OpenMP and missing code in `libc`.
I get @jhuber6's point about target-specific specialization. There is a benefit if we could do more "library" code IR generation w/o specifying all target details. We kinda do that now, and it broke stuff, but the direction is good. What is the downside of multiple specialization runs, with the earlier one(s) not specializing what they do not know for sure? |
I think I could probably make @Artem-B happy if I just forewent the early pass if the architecture is not known. That'd leave CUDA with identical behavior while allowing this kind of use where we only specify the target during the final link + backend stage. |
That would work, too. |
It's a little annoying for NVPTX because we just default to |
Can you rely on the 'cuda' part of the triple instead? |
Here's my attempt, hopefully it's not too invasive. Eager to get this landed so OpenMC works again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
ParseSubtargetFeatures(TargetName, /*TuneCPU*/ TargetName, FS); | ||
ParseSubtargetFeatures(CPU.empty() ? "sm_30" : CPU, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CPU.empty() ? "sm_30" : CPU
-> getTargetName()
@@ -35,9 +35,10 @@ void NVPTXSubtarget::anchor() {} | |||
NVPTXSubtarget &NVPTXSubtarget::initializeSubtargetDependencies(StringRef CPU, | |||
StringRef FS) { | |||
// Provide the default CPU if we don't have one. | |||
TargetName = std::string(CPU.empty() ? "sm_30" : CPU); | |||
TargetName = std::string(CPU); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could use a comment on why we may want to keep CPU empty in some cases.
✅ With the latest revision this PR passed the C/C++ code formatter. |
I just checked in e7a83fc to fix a warning from this PR. |
Was in the process of doing that myself, thanks for fixing it so fast. |
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/55/builds/5192 Here is the relevant piece of the build log for the reference
|
Summary:
This pass lowers the
__nvvm_reflect
builtin in the IR. However, thiscurrently runs in the standard optimization pipeline, not just the
backend pipeline. This means that if the user creates LLVM-IR without an
architecture set, it will always delete the reflect code even if it is
intended to be used later.
Pushing this into the backend pipeline will ensure that this works as
intended, allowing users to conditionally include code depending on
which target architecture the user ended up using. This fixes a bug in
OpenMP and missing code in
libc
.