-
Notifications
You must be signed in to change notification settings - Fork 17
[OpenMP][OMPIRBuilder] Set num_teams, thread_limit and trip_count arguments in call to __tgt_target_kernel
#241
[OpenMP][OMPIRBuilder] Set num_teams, thread_limit and trip_count arguments in call to __tgt_target_kernel
#241
Conversation
After adding support for populating the target-cpu and target-features kernel attributes, the default number of teams and threads used by clang and flang-new still doesn't match. The reason for this divergence must be elsewhere. |
f8dafce
to
a86cf6f
Compare
Latest commits add support for non-constant num_teams and thread_limit clauses, and also variable trip count. The variable trip count does not currently work, though, because extra arguments to the kernel are not being added to the kernel_args structure. So GPU execution results in a memory access fault (probably trying to dereference a pointer that has not been set by the host). I'm currently investigating this problem. |
After some investigation, the remaining issues seem to be unrelated to the trip count. Rather, there seems to be a problem with mapping in certain situations.
I have not been able to introduce any kind of no-ops like the ones in Example 6 that would work in a All examples compiled with Example 1 -- Runtime crashprogram main
integer :: v(3), n = 3
do i=1, n
v(i) = 0
end do
!$omp target teams distribute parallel do map(tofrom: v)
do i=1, n
v(i) = i
end do
!$omp end target teams distribute parallel do
end program main Example 2 -- Explicit map, computation ignoredprogram main
integer :: v(3), k = 5
do i=1, 3
v(i) = 0
end do
!$omp target map(tofrom: v) map(to:k)
v(1) = k
!$omp end target
do i=1, 3
print *, "v(", i, ") = ", v(i)
end do
end program main Example 3 -- Manual outlining, computation performedprogram main
integer :: v(3), k = 5
do i=1, 3
v(i) = 0
end do
!$omp target map(tofrom: v) map(to:k)
call writeSingle(v, 1, k)
!$omp end target
do i=1, 3
print *, "v(", i, ") = ", v(i)
end do
end program main
subroutine writeSingle(v, i, k)
integer, intent(inout) :: v(3)
integer, intent(in) :: i, k
!$omp declare target
v(i) = k
end subroutine writeSingle Example 4 -- Inline and outline modifications, both ignoredprogram main
integer :: v(3), k = 5
do i=1, 3
v(i) = 0
end do
!$omp target map(tofrom: v) map(to:k)
v(1) = k
call writeSingle(v, 2, k)
!$omp end target
do i=1, 3
print *, "v(", i, ") = ", v(i)
end do
end program main
subroutine writeSingle(v, i, k)
integer, intent(inout) :: v(3)
integer, intent(in) :: i, k
!$omp declare target
v(i) = k
end subroutine writeSingle Example 5 -- Inline and outline modifications, both executedprogram main
integer :: v(3), k = 5
do i=1, 3
v(i) = 0
end do
!$omp target map(tofrom: v) map(to:k)
call writeSingle(v, 2, k)
v(1) = k
!$omp end target
do i=1, 3
print *, "v(", i, ") = ", v(i)
end do
end program main
subroutine writeSingle(v, i, k)
integer, intent(inout) :: v(3)
integer, intent(in) :: i, k
!$omp declare target
v(i) = k
end subroutine writeSingle Example 6 -- No-op introduced, computation performedprogram main
integer :: v(3), k = 5
do i=1, 3
v(i) = 0
end do
!$omp target map(tofrom: v) map(to:k)
! Can be replaced by `v(1) = v(1)`, with the same result.
call doNothing(v)
v(1) = k
v(2) = k+1
v(3) = k+2
!$omp end target
do i=1, 3
print *, "v(", i, ") = ", v(i)
end do
end program main
subroutine doNothing(v)
integer, intent(inout) :: v(3)
!$omp declare target
end subroutine doNothing |
…uments in call to `__tgt_target_kernel` This patch gathers the information during translation to LLVM IR about the values for num_teams, thread_limit and trip count to later use them to set up the RTL call to `__tgt_target_kernel`. It takes into account different allowed combinations of the teams, distribute, parallel and do directives inside a target region and it does not impact device codegen. However, this approach has some known issues: - If the thread_limit clause is set for the omp.target operation, it will result in a compiler error due to an attempt to remove the operation from which it is initialized. It complains about removing it while there are uses remaining. This seems likely to be related to the implicit mapping, but we can ignore this limitation for milestone2. - The LLVM values for the num_teams and thread_limit clauses of the omp.teams operation must be compile-time constants, as well as the calculated trip count of the omp.wsloop at the core of the kernel. Otherwise, it will result in a compiler crash. The reason for this is that these values are defined in an inner scope, so they are not visible to the parent omp.target that tries to access them after. This results in the num_teams and thread_limit clauses being defined after their use, as well as the trip count produced by the CanonicalLoopInfo class. In addition, the definition of many of these values ends up outlined to a different function. The parallel_test application can be compiled with these limitations, but it's likely real applications won't due to the need for a constant trip count (i.e. `do i=1,n` will crash whereas `do i=1,10` will not). Even though the arguments passed to `__tgt_target_kernel` now match between flang-new and clang, switching on libomptarget traces shows that the number of teams and threads with which the kernel is launched does not match. It's likely that some kernel attributes that are only set by clang are impacting the defaults for num_teams and thread_limit. As a workaround, the num_teams and thread_limit clauses of the teams directive can be used to set these values by hand.
…hread_limit set in the omp.target op
97501bc
to
9d6fe47
Compare
c57822a
into
ROCm-Developer-Tools:amd-trunk-dev
Comparing the generated LLVM IR for host and device on these example applications I've been able to figure out the source for the issue impacting examples 2-6. The problem is that the order in which kernel arguments are set for the outlined target region between host and device differ in certain situations. Some examples are failing because the pointers for program main
integer :: x = 1, y = 2
!$omp target map(tofrom:y, x)
y = x
!$omp end target
print *, "x =", x, ", y =", y
end program main |
This patch gathers the information during translation to LLVM IR about the values for num_teams, thread_limit and trip count to later use them to set up the RTL call to
__tgt_target_kernel
. It takes into account different allowed combinations of the teams, distribute, parallel and do directives inside a target region and it does not impact device codegen.However, this approach has some known issues:
The parallel_test application can be compiled with these limitations, but it's likely real applications won't due to the need for a constant trip count (i.e.
do i=1,n
will crash whereasdo i=1,10
will not).Even though the arguments passed to
__tgt_target_kernel
now match between flang-new and clang, switching on libomptarget traces shows that the number of teams and threads with which the kernel is launched does not match. It's likely that some kernel attributes that are only set by clang are impacting the defaults for num_teams and thread_limit. As a workaround, the num_teams and thread_limit clauses of the teams directive can be used to set these values by hand.