-
Notifications
You must be signed in to change notification settings - Fork 754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][SYCLLowerWGLocalMemoryPass] Remove implicit dependency on AlwaysInlinerPass and move to PipelineStart #16356
base: sycl
Are you sure you want to change the base?
Conversation
…ysInlinerPass and move to PipelineStart Currently SYCLLowerWGLocalMemoryPass must run after AlwaysInlinerPass because in sycl header __sycl_allocateLocalMemory call is wrapped in group_local_memory/group_local_memory_for_overwrite function. Each call to __sycl_allocateLocalMemory represents a unique local memory, so group_local_memory/group_local_memory_for_overwrite must be inlined. The dependency is implicit and prohibits SYCLLowerWGLocalMemoryPass being moved around in the pass pipeline. Since the pass transforms __sycl_allocateLocalMemory call to access of global variable @WGLocalMem, moving the pass to beginning of pipeline could enable more optimization than the function call does. In addition, intel gpu compiler has a pass to transform global variable in addrspace(3) to alloca that runs after pipeline basic simplification. Therefore, we shall run SYCLLowerWGLocalMemoryPass ealier.
Looks like you need to rebase to pick up the new changes in this pass first. |
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
continue; | ||
} | ||
std::string FName = llvm::demangle(Caller->getName()); | ||
if (FName.find("sycl::_V1::ext::oneapi::group_local_memory") == |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoding current function name from DPC++ library is unfortunate. The code in the DPC++ header files can be changed at any time.
To make it more robust, I thought we could go up in the call stack up-to the kernel function ignoring all functions in sycl::
namespace. This will require SYCL kernel to be inlined into kernel function wrapper.
@Naghasan, do you have any thoughts on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it is unfortunate, especially w.r.t. upstreaming. I don't know what the plans are for this one but if it is seen as important, we might want to improve this.
This will require SYCL kernel to be inlined into kernel function wrapper.
I don't think this is an issue TBH, I don't see any benefit in not inline the SYCL kernel in the wrapper, even in SPIR-V.
I think relying on an attribute is probably the most flexible: this makes the compiler agnostic to header refactor and changes in API. It is also cheap to add.
I also just realized syclcompat::local_mem
uses it, it isn't technically a valid usage of it w.r.t. the extension but something the attribute would allow to correctly handle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think relying on an attribute is probably the most flexible: this makes the compiler agnostic to header refactor and changes in API. It is also cheap to add.
A new attribute "sycl_forceinline" is added in a4fe915
Please review.
return false; | ||
|
||
bool Changed = false; | ||
for (auto *U : ALMFunc->users()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to use a work list here rather than the simple loop.
This function https://github.com/intel/llvm/blob/sycl/sycl/include/syclcompat/memory.hpp#L71 needs to be updated as well, and this function won't be able to handle the nesting. The CI is currently green because there is no test requesting 2 distinct local memory objects using this function in the same kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, thank you for the suggestion. Now I understand what you mean by syclcompat::local_mem
.
Also added a new test sycl/test/check_device_code/syclcompat_local_mem.cpp
that has two calls to syclcompat::local_mem
in a kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FE changes look okay to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SYCL Changes LGTM.
kindly ping @Naghasan @intel/dpcpp-tools-reviewers @intel/syclcompat-lib-reviewers for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the new test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The syclcompat
changes look good to me, though I'll appreciate if @joeatodd also has a quick look to confirm, if possible.
kindly ping @intel/dpcpp-tools-reviewers for review |
@@ -84,6 +86,42 @@ ModulePass *llvm::createSYCLLowerWGLocalMemoryLegacyPass() { | |||
return new SYCLLowerWGLocalMemoryLegacy(); | |||
} | |||
|
|||
// In sycl header __sycl_allocateLocalMemory builtin call is wrapped in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can we not rewrite the SYCL headers to 'inline' these calls? Is there a specific reason? Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can we not rewrite the SYCL headers to 'inline' these calls? Is there a specific reason? Thanks
We can't ask users to call __sycl_allocateLocalMemory
internal intrsinsic when documented interface is sycl::ext::something::something::group_local_memory<T>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look OK to me. It will help to get an answer to my high-level question before submission. May be I am missing something here.
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking, the extension spec explicitly says that those functions can only be within kernel functor scope, but the spec doesn't prevent us from lifting that restriction if we want and can do that.
I assume that this PR exists because syclcompat
has a bug where it uses the exception in a way which is not guaranteed to work (and it doesn't).
I understand that we are trying to fix that issue in syclcompat
to make it work, but by doing so we put ourselves into a weird situation.
The thing is that we are introducing several conceptually incorrect things to our project:
syclcompat
is documented as header-only library. This PR breaks that by introducing compiler support which is necessary for correctness of certainsyclcompat
features. Tagging @intel/syclcompat-lib-reviewers for awareness.- we are setting an example of incorrect usage of the extension in our implementation and looking at the code my guess that it won't work the same way anywhere outside of
syclcompat
library which may confuse those who will dig into implementation details.
The proper fix I think is to either completely and officially remove the limitation of work_group_memory
being only useable at SYCL kernel functor scope, or to rewrite syclcompat
to use some other mechanism for obtaining local memory (we do have several of them; not every is already implemented and not every will fit, but still, worth exploring).
@@ -84,6 +86,42 @@ ModulePass *llvm::createSYCLLowerWGLocalMemoryLegacyPass() { | |||
return new SYCLLowerWGLocalMemoryLegacy(); | |||
} | |||
|
|||
// In sycl header __sycl_allocateLocalMemory builtin call is wrapped in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can we not rewrite the SYCL headers to 'inline' these calls? Is there a specific reason? Thanks
We can't ask users to call __sycl_allocateLocalMemory
internal intrsinsic when documented interface is sycl::ext::something::something::group_local_memory<T>
// distinct global variable. Inlining them here so that this pass doesn't have | ||
// implicit dependency on AlwaysInlinerPass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, instead of having two building blocks (i.e. passes) where each does a specific thing, we now have one of them doing both things.
This doesn't sound good from the high-level design point of view. I understand that having an implicit dependency is probably not a good thing, but are there any reasons to remove the dependency completely?
Because many passes have dependencies on each other and there are mechanism to explicitly tell the pass manager about them: The AnalysisUsage::addRequired<>
and AnalysisUsage::addRequiredTransitive<>
methods. They are mostly used for requesting results of a certain analysis, but they can also be used to request specific transformations to be performed before a certain pass is run. You can find examples of that with LoopSimplify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't sound good from the high-level design point of view. I understand that having an implicit dependency is probably not a good thing, but are there any reasons to remove the dependency completely?
I agree, but I think it is good to making this pass self-contained.
Because many passes have dependencies on each other and there are mechanism to explicitly tell the pass manager about them: The
AnalysisUsage::addRequired<>
andAnalysisUsage::addRequiredTransitive<>
methods. They are mostly used for requesting results of a certain analysis, but they can also be used to request specific transformations to be performed before a certain pass is run. You can find examples of that withLoopSimplify
This would inline other functions that are not related to what we handle in this pass.
@AlexeySachkov The change you reviewed exist because the pass was moved in the pipeline and runs before inlining, the change required because of that, nothing to do with
|
Ok, even if |
The restriction that work_group_memory is only useable at kernel function scope might be a language behavior, rather than implementation limitation due to inlining. The behavior aligns with OpenCL. I have filed a feature request #16617 to address the issue of dependency on inlining.
|
The request may take time to discuss. |
@intel/dpcpp-tools-reviewers Please approve or comment. Thanks. |
Currently SYCLLowerWGLocalMemoryPass must run after AlwaysInlinerPass because in sycl header __sycl_allocateLocalMemory call is wrapped in group_local_memory/group_local_memory_for_overwrite function. Each call to __sycl_allocateLocalMemory represents a unique local memory, so group_local_memory/group_local_memory_for_overwrite must be inlined.
The dependency is implicit and prohibits SYCLLowerWGLocalMemoryPass being moved around in the pass pipeline.
Since the pass transforms __sycl_allocateLocalMemory call to access of global variable @WGLocalMem, moving the pass to beginning of pipeline could enable more optimization than the function call does.
We can't assume backend compiler lowers the global variable after AlwaysInlinerPass.