[OpenMP][OMPIRBuilder] Set num_teams, thread_limit and trip_count arguments in call to `__tgt_target_kernel` #241

skatrak · 2023-10-10T11:04:00Z

This patch gathers the information during translation to LLVM IR about the values for num_teams, thread_limit and trip count to later use them to set up the RTL call to __tgt_target_kernel. It takes into account different allowed combinations of the teams, distribute, parallel and do directives inside a target region and it does not impact device codegen.

However, this approach has some known issues:

If the thread_limit clause is set for the omp.target operation, it will result in a compiler error due to an attempt to remove the operation from which it is initialized. It complains about removing it while there are uses remaining. This seems likely to be related to the implicit mapping, but we can ignore this limitation for milestone2b.
The LLVM values for the num_teams and thread_limit clauses of the omp.teams operation must be compile-time constants, as well as the calculated trip count of the omp.wsloop at the core of the kernel. Otherwise, it will result in a compiler crash. The reason for this is that these values are defined in an inner scope, so they are not visible to the parent omp.target that tries to access them after. This results in the num_teams and thread_limit clauses being defined after their use, as well as the trip count produced by the CanonicalLoopInfo class. In addition, the definition of many of these values ends up outlined to a different function.

The parallel_test application can be compiled with these limitations, but it's likely real applications won't due to the need for a constant trip count (i.e. do i=1,n will crash whereas do i=1,10 will not).

Even though the arguments passed to __tgt_target_kernel now match between flang-new and clang, switching on libomptarget traces shows that the number of teams and threads with which the kernel is launched does not match. It's likely that some kernel attributes that are only set by clang are impacting the defaults for num_teams and thread_limit. As a workaround, the num_teams and thread_limit clauses of the teams directive can be used to set these values by hand.

skatrak · 2023-10-10T12:49:45Z

After adding support for populating the target-cpu and target-features kernel attributes, the default number of teams and threads used by clang and flang-new still doesn't match. The reason for this divergence must be elsewhere.

skatrak · 2023-10-12T14:48:26Z

Latest commits add support for non-constant num_teams and thread_limit clauses, and also variable trip count. The variable trip count does not currently work, though, because extra arguments to the kernel are not being added to the kernel_args structure. So GPU execution results in a memory access fault (probably trying to dereference a pointer that has not been set by the host). I'm currently investigating this problem.

skatrak · 2023-10-13T11:13:44Z

After some investigation, the remaining issues seem to be unrelated to the trip count. Rather, there seems to be a problem with mapping in certain situations.

First, implicit mapping for scalars does not seem to be working ~~(this patch is currently based on an older ATD, so maybe rebasing would address this)~~. This results in the Example 1 below failing to run because n is not added to the kernel_args structure, but it's expected by the kernel.
By explicitly mapping every variable used inside of the target region, we can work around this limitation. However, using these outside values doesn't seem to work well either. Example 2 is a simple test case where it is expected that the k value would be set to the first element of the array, and yet it prints zeros after the target region is executed.
Interestingly, in Example 3 , by manually outlining the kernel body into a declare target subroutine, we obtain the expected result. This happens even though the same variables are mapped in the same way as in the Example 2.
What's even more interesting is that Example 4 fails again by keeping all elements of the array set to zero, even though it calls the function that we previously checked to do the expected computation. However, if we reorder these statements as shown in Example 5, then the two array elements are updated as expected.
It turns out that having passed the array variable to some other function makes following updates work. This is shown in Example 6, where the called function does nothing but as long as it runs before the updates to it, these updates that otherwise wouldn't work, now do work. That no-op function call could also be replaced by a v(1) = v(1) statement and the same behavior occurs.

I have not been able to introduce any kind of no-ops like the ones in Example 6 that would work in a target teams distribute parallel do to make Example 1 run with some workarounds. However, this patch does set the trip count, num_teams and thread_limit properly even when they are not compile-time constants, as corroborated by libomptarget traces. These other limitations are what currently prevents us from being able to run these loops.

All examples compiled with flang-new -save-temps -fopenmp --offload-arch=gfx1030 test.f90 -o main and executed with LIBOMPTARGET_INFO=-1 OMP_TARGET_OFFLOAD=MANDATORY ./main.

Example 1 -- Runtime crash

program main
  integer :: v(3), n = 3
  do i=1, n
    v(i) = 0
  end do

  !$omp target teams distribute parallel do map(tofrom: v)
  do i=1, n
    v(i) = i
  end do
  !$omp end target teams distribute parallel do
end program main

Example 2 -- Explicit map, computation ignored

program main
  integer :: v(3), k = 5
  do i=1, 3
    v(i) = 0
  end do

  !$omp target map(tofrom: v) map(to:k)
    v(1) = k
  !$omp end target

  do i=1, 3
    print *, "v(", i, ") = ", v(i)
  end do
end program main

Example 3 -- Manual outlining, computation performed

program main
  integer :: v(3), k = 5
  do i=1, 3
    v(i) = 0
  end do

  !$omp target map(tofrom: v) map(to:k)
    call writeSingle(v, 1, k)
  !$omp end target

  do i=1, 3
    print *, "v(", i, ") = ", v(i)
  end do
end program main

subroutine writeSingle(v, i, k)
  integer, intent(inout) :: v(3)
  integer, intent(in) :: i, k
  !$omp declare target
  v(i) = k
end subroutine writeSingle

Example 4 -- Inline and outline modifications, both ignored

program main
  integer :: v(3), k = 5
  do i=1, 3
    v(i) = 0
  end do

  !$omp target map(tofrom: v) map(to:k)
    v(1) = k
    call writeSingle(v, 2, k)
  !$omp end target

  do i=1, 3
    print *, "v(", i, ") = ", v(i)
  end do
end program main

subroutine writeSingle(v, i, k)
  integer, intent(inout) :: v(3)
  integer, intent(in) :: i, k
  !$omp declare target
  v(i) = k
end subroutine writeSingle

Example 5 -- Inline and outline modifications, both executed

program main
  integer :: v(3), k = 5
  do i=1, 3
    v(i) = 0
  end do

  !$omp target map(tofrom: v) map(to:k)
    call writeSingle(v, 2, k)
    v(1) = k
  !$omp end target

  do i=1, 3
    print *, "v(", i, ") = ", v(i)
  end do
end program main

subroutine writeSingle(v, i, k)
  integer, intent(inout) :: v(3)
  integer, intent(in) :: i, k
  !$omp declare target
  v(i) = k
end subroutine writeSingle

Example 6 -- No-op introduced, computation performed

program main
  integer :: v(3), k = 5
  do i=1, 3
    v(i) = 0
  end do

  !$omp target map(tofrom: v) map(to:k)
    ! Can be replaced by `v(1) = v(1)`, with the same result.
    call doNothing(v)
    v(1) = k
    v(2) = k+1
    v(3) = k+2
  !$omp end target

  do i=1, 3
    print *, "v(", i, ") = ", v(i)
  end do
end program main

subroutine doNothing(v)
  integer, intent(inout) :: v(3)
  !$omp declare target
end subroutine doNothing

…uments in call to `__tgt_target_kernel` This patch gathers the information during translation to LLVM IR about the values for num_teams, thread_limit and trip count to later use them to set up the RTL call to `__tgt_target_kernel`. It takes into account different allowed combinations of the teams, distribute, parallel and do directives inside a target region and it does not impact device codegen. However, this approach has some known issues: - If the thread_limit clause is set for the omp.target operation, it will result in a compiler error due to an attempt to remove the operation from which it is initialized. It complains about removing it while there are uses remaining. This seems likely to be related to the implicit mapping, but we can ignore this limitation for milestone2. - The LLVM values for the num_teams and thread_limit clauses of the omp.teams operation must be compile-time constants, as well as the calculated trip count of the omp.wsloop at the core of the kernel. Otherwise, it will result in a compiler crash. The reason for this is that these values are defined in an inner scope, so they are not visible to the parent omp.target that tries to access them after. This results in the num_teams and thread_limit clauses being defined after their use, as well as the trip count produced by the CanonicalLoopInfo class. In addition, the definition of many of these values ends up outlined to a different function. The parallel_test application can be compiled with these limitations, but it's likely real applications won't due to the need for a constant trip count (i.e. `do i=1,n` will crash whereas `do i=1,10` will not). Even though the arguments passed to `__tgt_target_kernel` now match between flang-new and clang, switching on libomptarget traces shows that the number of teams and threads with which the kernel is launched does not match. It's likely that some kernel attributes that are only set by clang are impacting the defaults for num_teams and thread_limit. As a workaround, the num_teams and thread_limit clauses of the teams directive can be used to set these values by hand.

…hread_limit set in the omp.target op

skatrak · 2023-10-13T15:22:48Z

Comparing the generated LLVM IR for host and device on these example applications I've been able to figure out the source for the issue impacting examples 2-6. The problem is that the order in which kernel arguments are set for the outlined target region between host and device differ in certain situations. Some examples are failing because the pointers for k and for v are being swapped in the host and kept in the same order in the device. So, this minimal reproducer prints "x = 2 , y = 2" if executed in the device and "x = 1 , y = 1" if executed in the host.

program main
  integer :: x = 1, y = 2
  !$omp target map(tofrom:y, x)
    y = x
  !$omp end target
  print *, "x =", x, ", y =", y
end program main

skatrak self-assigned this Oct 10, 2023

skatrak force-pushed the num-teams-init branch from f8dafce to a86cf6f Compare October 10, 2023 13:41

skatrak added 5 commits October 13, 2023 12:23

Remove duplicated definition and call to convertOmpTeams

e02b868

Add target-cpu and target-features kernel attributes.

9d88d66

Support variable num_teams and thread_limit, and avoid crashes when t…

8c3204b

…hread_limit set in the omp.target op

1/2 Support for variable trip count

9d6fe47

skatrak force-pushed the num-teams-init branch from 97501bc to 9d6fe47 Compare October 13, 2023 12:26

gregrodgers merged commit c57822a into ROCm-Developer-Tools:amd-trunk-dev Oct 13, 2023
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OpenMP][OMPIRBuilder] Set num_teams, thread_limit and trip_count arguments in call to `__tgt_target_kernel` #241

[OpenMP][OMPIRBuilder] Set num_teams, thread_limit and trip_count arguments in call to `__tgt_target_kernel` #241

skatrak commented Oct 10, 2023

skatrak commented Oct 10, 2023

skatrak commented Oct 12, 2023

skatrak commented Oct 13, 2023 •

edited

Loading

skatrak commented Oct 13, 2023

[OpenMP][OMPIRBuilder] Set num_teams, thread_limit and trip_count arguments in call to __tgt_target_kernel #241

[OpenMP][OMPIRBuilder] Set num_teams, thread_limit and trip_count arguments in call to __tgt_target_kernel #241

Conversation

skatrak commented Oct 10, 2023

skatrak commented Oct 10, 2023

skatrak commented Oct 12, 2023

skatrak commented Oct 13, 2023 • edited Loading

Example 1 -- Runtime crash

Example 2 -- Explicit map, computation ignored

Example 3 -- Manual outlining, computation performed

Example 4 -- Inline and outline modifications, both ignored

Example 5 -- Inline and outline modifications, both executed

Example 6 -- No-op introduced, computation performed

skatrak commented Oct 13, 2023

[OpenMP][OMPIRBuilder] Set num_teams, thread_limit and trip_count arguments in call to `__tgt_target_kernel` #241

[OpenMP][OMPIRBuilder] Set num_teams, thread_limit and trip_count arguments in call to `__tgt_target_kernel` #241

skatrak commented Oct 13, 2023 •

edited

Loading