Skip to content
This repository has been archived by the owner on Jan 20, 2024. It is now read-only.

[OpenMP][OMPIRBuilder] Set num_teams, thread_limit and trip_count arguments in call to __tgt_target_kernel #241

Merged

Conversation

skatrak
Copy link
Contributor

@skatrak skatrak commented Oct 10, 2023

This patch gathers the information during translation to LLVM IR about the values for num_teams, thread_limit and trip count to later use them to set up the RTL call to __tgt_target_kernel. It takes into account different allowed combinations of the teams, distribute, parallel and do directives inside a target region and it does not impact device codegen.

However, this approach has some known issues:

  • If the thread_limit clause is set for the omp.target operation, it will result in a compiler error due to an attempt to remove the operation from which it is initialized. It complains about removing it while there are uses remaining. This seems likely to be related to the implicit mapping, but we can ignore this limitation for milestone2b.
  • The LLVM values for the num_teams and thread_limit clauses of the omp.teams operation must be compile-time constants, as well as the calculated trip count of the omp.wsloop at the core of the kernel. Otherwise, it will result in a compiler crash. The reason for this is that these values are defined in an inner scope, so they are not visible to the parent omp.target that tries to access them after. This results in the num_teams and thread_limit clauses being defined after their use, as well as the trip count produced by the CanonicalLoopInfo class. In addition, the definition of many of these values ends up outlined to a different function.

The parallel_test application can be compiled with these limitations, but it's likely real applications won't due to the need for a constant trip count (i.e. do i=1,n will crash whereas do i=1,10 will not).

Even though the arguments passed to __tgt_target_kernel now match between flang-new and clang, switching on libomptarget traces shows that the number of teams and threads with which the kernel is launched does not match. It's likely that some kernel attributes that are only set by clang are impacting the defaults for num_teams and thread_limit. As a workaround, the num_teams and thread_limit clauses of the teams directive can be used to set these values by hand.

@skatrak skatrak self-assigned this Oct 10, 2023
@skatrak
Copy link
Contributor Author

skatrak commented Oct 10, 2023

After adding support for populating the target-cpu and target-features kernel attributes, the default number of teams and threads used by clang and flang-new still doesn't match. The reason for this divergence must be elsewhere.

@skatrak
Copy link
Contributor Author

skatrak commented Oct 12, 2023

Latest commits add support for non-constant num_teams and thread_limit clauses, and also variable trip count. The variable trip count does not currently work, though, because extra arguments to the kernel are not being added to the kernel_args structure. So GPU execution results in a memory access fault (probably trying to dereference a pointer that has not been set by the host). I'm currently investigating this problem.

@skatrak
Copy link
Contributor Author

skatrak commented Oct 13, 2023

After some investigation, the remaining issues seem to be unrelated to the trip count. Rather, there seems to be a problem with mapping in certain situations.

  • First, implicit mapping for scalars does not seem to be working (this patch is currently based on an older ATD, so maybe rebasing would address this). This results in the Example 1 below failing to run because n is not added to the kernel_args structure, but it's expected by the kernel.
  • By explicitly mapping every variable used inside of the target region, we can work around this limitation. However, using these outside values doesn't seem to work well either. Example 2 is a simple test case where it is expected that the k value would be set to the first element of the array, and yet it prints zeros after the target region is executed.
  • Interestingly, in Example 3 , by manually outlining the kernel body into a declare target subroutine, we obtain the expected result. This happens even though the same variables are mapped in the same way as in the Example 2.
  • What's even more interesting is that Example 4 fails again by keeping all elements of the array set to zero, even though it calls the function that we previously checked to do the expected computation. However, if we reorder these statements as shown in Example 5, then the two array elements are updated as expected.
  • It turns out that having passed the array variable to some other function makes following updates work. This is shown in Example 6, where the called function does nothing but as long as it runs before the updates to it, these updates that otherwise wouldn't work, now do work. That no-op function call could also be replaced by a v(1) = v(1) statement and the same behavior occurs.

I have not been able to introduce any kind of no-ops like the ones in Example 6 that would work in a target teams distribute parallel do to make Example 1 run with some workarounds. However, this patch does set the trip count, num_teams and thread_limit properly even when they are not compile-time constants, as corroborated by libomptarget traces. These other limitations are what currently prevents us from being able to run these loops.

All examples compiled with flang-new -save-temps -fopenmp --offload-arch=gfx1030 test.f90 -o main and executed with LIBOMPTARGET_INFO=-1 OMP_TARGET_OFFLOAD=MANDATORY ./main.

Example 1 -- Runtime crash

program main
  integer :: v(3), n = 3
  do i=1, n
    v(i) = 0
  end do

  !$omp target teams distribute parallel do map(tofrom: v)
  do i=1, n
    v(i) = i
  end do
  !$omp end target teams distribute parallel do
end program main

Example 2 -- Explicit map, computation ignored

program main
  integer :: v(3), k = 5
  do i=1, 3
    v(i) = 0
  end do

  !$omp target map(tofrom: v) map(to:k)
    v(1) = k
  !$omp end target

  do i=1, 3
    print *, "v(", i, ") = ", v(i)
  end do
end program main

Example 3 -- Manual outlining, computation performed

program main
  integer :: v(3), k = 5
  do i=1, 3
    v(i) = 0
  end do

  !$omp target map(tofrom: v) map(to:k)
    call writeSingle(v, 1, k)
  !$omp end target

  do i=1, 3
    print *, "v(", i, ") = ", v(i)
  end do
end program main

subroutine writeSingle(v, i, k)
  integer, intent(inout) :: v(3)
  integer, intent(in) :: i, k
  !$omp declare target
  v(i) = k
end subroutine writeSingle

Example 4 -- Inline and outline modifications, both ignored

program main
  integer :: v(3), k = 5
  do i=1, 3
    v(i) = 0
  end do

  !$omp target map(tofrom: v) map(to:k)
    v(1) = k
    call writeSingle(v, 2, k)
  !$omp end target

  do i=1, 3
    print *, "v(", i, ") = ", v(i)
  end do
end program main

subroutine writeSingle(v, i, k)
  integer, intent(inout) :: v(3)
  integer, intent(in) :: i, k
  !$omp declare target
  v(i) = k
end subroutine writeSingle

Example 5 -- Inline and outline modifications, both executed

program main
  integer :: v(3), k = 5
  do i=1, 3
    v(i) = 0
  end do

  !$omp target map(tofrom: v) map(to:k)
    call writeSingle(v, 2, k)
    v(1) = k
  !$omp end target

  do i=1, 3
    print *, "v(", i, ") = ", v(i)
  end do
end program main

subroutine writeSingle(v, i, k)
  integer, intent(inout) :: v(3)
  integer, intent(in) :: i, k
  !$omp declare target
  v(i) = k
end subroutine writeSingle

Example 6 -- No-op introduced, computation performed

program main
  integer :: v(3), k = 5
  do i=1, 3
    v(i) = 0
  end do

  !$omp target map(tofrom: v) map(to:k)
    ! Can be replaced by `v(1) = v(1)`, with the same result.
    call doNothing(v)
    v(1) = k
    v(2) = k+1
    v(3) = k+2
  !$omp end target

  do i=1, 3
    print *, "v(", i, ") = ", v(i)
  end do
end program main

subroutine doNothing(v)
  integer, intent(inout) :: v(3)
  !$omp declare target
end subroutine doNothing

…uments in call to `__tgt_target_kernel`

This patch gathers the information during translation to LLVM IR about the
values for num_teams, thread_limit and trip count to later use them to set up
the RTL call to `__tgt_target_kernel`. It takes into account different allowed
combinations of the teams, distribute, parallel and do directives inside a
target region and it does not impact device codegen.

However, this approach has some known issues:
  - If the thread_limit clause is set for the omp.target operation, it will
  result in a compiler error due to an attempt to remove the operation from
  which it is initialized. It complains about removing it while there are uses
  remaining. This seems likely to be related to the implicit mapping, but
  we can ignore this limitation for milestone2.
  - The LLVM values for the num_teams and thread_limit clauses of the omp.teams
  operation must be compile-time constants, as well as the calculated trip
  count of the omp.wsloop at the core of the kernel. Otherwise, it will result
  in a compiler crash. The reason for this is that these values are defined in
  an inner scope, so they are not visible to the parent omp.target that tries
  to access them after. This results in the num_teams and thread_limit clauses
  being defined after their use, as well as the trip count produced by the
  CanonicalLoopInfo class. In addition, the definition of many of these values
  ends up outlined to a different function.

The parallel_test application can be compiled with these limitations, but it's
likely real applications won't due to the need for a constant trip count (i.e.
`do i=1,n` will crash whereas `do i=1,10` will not).

Even though the arguments passed to `__tgt_target_kernel` now match between
flang-new and clang, switching on libomptarget traces shows that the number of
teams and threads with which the kernel is launched does not match. It's likely
that some kernel attributes that are only set by clang are impacting the
defaults for num_teams and thread_limit. As a workaround, the num_teams and
thread_limit clauses of the teams directive can be used to set these values by
hand.
@gregrodgers gregrodgers merged commit c57822a into ROCm-Developer-Tools:amd-trunk-dev Oct 13, 2023
2 checks passed
@skatrak
Copy link
Contributor Author

skatrak commented Oct 13, 2023

Comparing the generated LLVM IR for host and device on these example applications I've been able to figure out the source for the issue impacting examples 2-6. The problem is that the order in which kernel arguments are set for the outlined target region between host and device differ in certain situations. Some examples are failing because the pointers for k and for v are being swapped in the host and kept in the same order in the device. So, this minimal reproducer prints "x = 2 , y = 2" if executed in the device and "x = 1 , y = 1" if executed in the host.

program main
  integer :: x = 1, y = 2
  !$omp target map(tofrom:y, x)
    y = x
  !$omp end target
  print *, "x =", x, ", y =", y
end program main

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants