Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][SYCLLowerWGLocalMemoryPass] Remove implicit dependency on AlwaysInlinerPass and move to PipelineStart #16356

Open
wants to merge 13 commits into
base: sycl
Choose a base branch
from

Conversation

wenju-he
Copy link
Contributor

@wenju-he wenju-he commented Dec 13, 2024

Currently SYCLLowerWGLocalMemoryPass must run after AlwaysInlinerPass because in sycl header __sycl_allocateLocalMemory call is wrapped in group_local_memory/group_local_memory_for_overwrite function. Each call to __sycl_allocateLocalMemory represents a unique local memory, so group_local_memory/group_local_memory_for_overwrite must be inlined.

The dependency is implicit and prohibits SYCLLowerWGLocalMemoryPass being moved around in the pass pipeline.

Since the pass transforms __sycl_allocateLocalMemory call to access of global variable @WGLocalMem, moving the pass to beginning of pipeline could enable more optimization than the function call does.

We can't assume backend compiler lowers the global variable after AlwaysInlinerPass.

…ysInlinerPass and move to PipelineStart

Currently SYCLLowerWGLocalMemoryPass must run after AlwaysInlinerPass
because in sycl header __sycl_allocateLocalMemory call is wrapped in
group_local_memory/group_local_memory_for_overwrite function. Each call
to __sycl_allocateLocalMemory represents a unique local memory, so
group_local_memory/group_local_memory_for_overwrite must be inlined.

The dependency is implicit and prohibits SYCLLowerWGLocalMemoryPass
being moved around in the pass pipeline.

Since the pass transforms __sycl_allocateLocalMemory call to access of
global variable @WGLocalMem, moving the pass to beginning of pipeline
could enable more optimization than the function call does.

In addition, intel gpu compiler has a pass to transform global variable
in addrspace(3) to alloca that runs after pipeline basic simplification.
Therefore, we shall run SYCLLowerWGLocalMemoryPass ealier.
clang/lib/CodeGen/BackendUtil.cpp Outdated Show resolved Hide resolved
llvm/lib/SYCLLowerIR/LowerWGLocalMemory.cpp Outdated Show resolved Hide resolved
llvm/lib/SYCLLowerIR/LowerWGLocalMemory.cpp Show resolved Hide resolved
llvm/lib/SYCLLowerIR/LowerWGLocalMemory.cpp Outdated Show resolved Hide resolved
@jsji
Copy link
Contributor

jsji commented Dec 13, 2024

Looks like you need to rebase to pick up the new changes in this pass first.

@wenju-he
Copy link
Contributor Author

Looks like you need to rebase to pick up the new changes in this pass first.

done

@wenju-he wenju-he requested a review from jsji December 13, 2024 02:24
Copy link
Contributor

@jsji jsji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

continue;
}
std::string FName = llvm::demangle(Caller->getName());
if (FName.find("sycl::_V1::ext::oneapi::group_local_memory") ==
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoding current function name from DPC++ library is unfortunate. The code in the DPC++ header files can be changed at any time.

To make it more robust, I thought we could go up in the call stack up-to the kernel function ignoring all functions in sycl:: namespace. This will require SYCL kernel to be inlined into kernel function wrapper.

@Naghasan, do you have any thoughts on that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is unfortunate, especially w.r.t. upstreaming. I don't know what the plans are for this one but if it is seen as important, we might want to improve this.

This will require SYCL kernel to be inlined into kernel function wrapper.

I don't think this is an issue TBH, I don't see any benefit in not inline the SYCL kernel in the wrapper, even in SPIR-V.

I think relying on an attribute is probably the most flexible: this makes the compiler agnostic to header refactor and changes in API. It is also cheap to add.

I also just realized syclcompat::local_mem uses it, it isn't technically a valid usage of it w.r.t. the extension but something the attribute would allow to correctly handle.

cc @elizabethandrews @joeatodd

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think relying on an attribute is probably the most flexible: this makes the compiler agnostic to header refactor and changes in API. It is also cheap to add.

A new attribute "sycl_forceinline" is added in a4fe915
Please review.

@wenju-he wenju-he requested a review from a team as a code owner December 16, 2024 03:28
return false;

bool Changed = false;
for (auto *U : ALMFunc->users()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to use a work list here rather than the simple loop.

This function https://github.com/intel/llvm/blob/sycl/sycl/include/syclcompat/memory.hpp#L71 needs to be updated as well, and this function won't be able to handle the nesting. The CI is currently green because there is no test requesting 2 distinct local memory objects using this function in the same kernel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thank you for the suggestion. Now I understand what you mean by syclcompat::local_mem.
Also added a new test sycl/test/check_device_code/syclcompat_local_mem.cpp that has two calls to syclcompat::local_mem in a kernel.

Copy link
Contributor

@premanandrao premanandrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FE changes look okay to me.

Copy link
Contributor

@uditagarwal97 uditagarwal97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SYCL Changes LGTM.

@wenju-he
Copy link
Contributor Author

wenju-he commented Jan 6, 2025

kindly ping @Naghasan @intel/dpcpp-tools-reviewers @intel/syclcompat-lib-reviewers for review

Copy link
Contributor

@Naghasan Naghasan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the new test

Copy link
Contributor

@GeorgeWeb GeorgeWeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The syclcompat changes look good to me, though I'll appreciate if @joeatodd also has a quick look to confirm, if possible.

@wenju-he
Copy link
Contributor Author

kindly ping @intel/dpcpp-tools-reviewers for review

@@ -84,6 +86,42 @@ ModulePass *llvm::createSYCLLowerWGLocalMemoryLegacyPass() {
return new SYCLLowerWGLocalMemoryLegacy();
}

// In sycl header __sycl_allocateLocalMemory builtin call is wrapped in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can we not rewrite the SYCL headers to 'inline' these calls? Is there a specific reason? Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can we not rewrite the SYCL headers to 'inline' these calls? Is there a specific reason? Thanks

We can't ask users to call __sycl_allocateLocalMemory internal intrsinsic when documented interface is sycl::ext::something::something::group_local_memory<T>

Copy link
Contributor

@asudarsa asudarsa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look OK to me. It will help to get an answer to my high-level question before submission. May be I am missing something here.

Thanks

Copy link
Contributor

@AlexeySachkov AlexeySachkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, the extension spec explicitly says that those functions can only be within kernel functor scope, but the spec doesn't prevent us from lifting that restriction if we want and can do that.

I assume that this PR exists because syclcompat has a bug where it uses the exception in a way which is not guaranteed to work (and it doesn't).

I understand that we are trying to fix that issue in syclcompat to make it work, but by doing so we put ourselves into a weird situation.

The thing is that we are introducing several conceptually incorrect things to our project:

  • syclcompat is documented as header-only library. This PR breaks that by introducing compiler support which is necessary for correctness of certain syclcompat features. Tagging @intel/syclcompat-lib-reviewers for awareness.
  • we are setting an example of incorrect usage of the extension in our implementation and looking at the code my guess that it won't work the same way anywhere outside of syclcompat library which may confuse those who will dig into implementation details.

The proper fix I think is to either completely and officially remove the limitation of work_group_memory being only useable at SYCL kernel functor scope, or to rewrite syclcompat to use some other mechanism for obtaining local memory (we do have several of them; not every is already implemented and not every will fit, but still, worth exploring).

@@ -84,6 +86,42 @@ ModulePass *llvm::createSYCLLowerWGLocalMemoryLegacyPass() {
return new SYCLLowerWGLocalMemoryLegacy();
}

// In sycl header __sycl_allocateLocalMemory builtin call is wrapped in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can we not rewrite the SYCL headers to 'inline' these calls? Is there a specific reason? Thanks

We can't ask users to call __sycl_allocateLocalMemory internal intrsinsic when documented interface is sycl::ext::something::something::group_local_memory<T>

Comment on lines +92 to +93
// distinct global variable. Inlining them here so that this pass doesn't have
// implicit dependency on AlwaysInlinerPass.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, instead of having two building blocks (i.e. passes) where each does a specific thing, we now have one of them doing both things.

This doesn't sound good from the high-level design point of view. I understand that having an implicit dependency is probably not a good thing, but are there any reasons to remove the dependency completely?

Because many passes have dependencies on each other and there are mechanism to explicitly tell the pass manager about them: The AnalysisUsage::addRequired<> and AnalysisUsage::addRequiredTransitive<> methods. They are mostly used for requesting results of a certain analysis, but they can also be used to request specific transformations to be performed before a certain pass is run. You can find examples of that with LoopSimplify

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't sound good from the high-level design point of view. I understand that having an implicit dependency is probably not a good thing, but are there any reasons to remove the dependency completely?

I agree, but I think it is good to making this pass self-contained.

Because many passes have dependencies on each other and there are mechanism to explicitly tell the pass manager about them: The AnalysisUsage::addRequired<> and AnalysisUsage::addRequiredTransitive<> methods. They are mostly used for requesting results of a certain analysis, but they can also be used to request specific transformations to be performed before a certain pass is run. You can find examples of that with LoopSimplify

This would inline other functions that are not related to what we handle in this pass.

@Naghasan
Copy link
Contributor

I assume that this PR exists because syclcompat has a bug where it uses the exception in a way which is not guaranteed to work (and it doesn't).

@AlexeySachkov The change you reviewed exist because the pass was moved in the pipeline and runs before inlining, the change required because of that, nothing to do with syclcompat.

syclcompat came into the picture because I highlighted that the existing tests couldn't highlight an issue the patch was bringing.

@AlexeySachkov
Copy link
Contributor

I assume that this PR exists because syclcompat has a bug where it uses the exception in a way which is not guaranteed to work (and it doesn't).

@AlexeySachkov The change you reviewed exist because the pass was moved in the pipeline and runs before inlining, the change required because of that, nothing to do with syclcompat.

syclcompat came into the picture because I highlighted that the existing tests couldn't highlight an issue the patch was bringing.

Ok, even if syclcompat wasn't the origin of how this issue was found, the use of the extension from syclcompat still violates the extension spec and we are still introducing compiler dependency into syclcompat by adding that attribute into the header.

@wenju-he
Copy link
Contributor Author

The proper fix I think is to either completely and officially remove the limitation of work_group_memory being only useable at SYCL kernel functor scope

The restriction that work_group_memory is only useable at kernel function scope might be a language behavior, rather than implementation limitation due to inlining. The behavior aligns with OpenCL.

I have filed a feature request #16617 to address the issue of dependency on inlining.
The feature request also address

we are still introducing compiler dependency into syclcompat by adding that attribute into the header

@wenju-he
Copy link
Contributor Author

I have filed a feature request #16617 to address the issue of dependency on inlining.

The request may take time to discuss.
@AlexeySachkov @intel/dpcpp-tools-reviewers is it ok to merge this pr as it is to unblock our work?

@jsji
Copy link
Contributor

jsji commented Jan 23, 2025

@intel/dpcpp-tools-reviewers Please approve or comment. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants