Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Multiple issues in training scripts #2077

Open
IMbackK opened this issue Jan 7, 2025 · 1 comment
Open

[Issue]: Multiple issues in training scripts #2077

IMbackK opened this issue Jan 7, 2025 · 1 comment

Comments

@IMbackK
Copy link

IMbackK commented Jan 7, 2025

Problem Description

There appear to be multiple issues in the training scripts:

First we run a test using ROCBLAS_LEVEL=4 on a simple test application from this we gain a list of executed rocblas calls to tune on:

- { rocblas_function: "rocblas_gemm_batched_ex", atomics_mode: atomics_allowed, a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f32_r", transA: 'T', transB: 'N', M: 128, N: 512, K: 8192, alpha: 2.0, lda: 8192, ldb: 8192, beta: 2.0, ldc: 128, ldd: 128, batch_count: 64, algo: 0, solution_index: 0, flags: none, call_count: 480 }
- { rocblas_function: "rocblas_gemm_ex", atomics_mode: atomics_allowed, a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f32_r", transA: 'T', transB: 'N', M: 4096, N: 2048, K: 14336, alpha: 1.0, lda: 14336, ldb: 14336, beta: 0.0, ldc: 4096, ldd: 4096, algo: 0, solution_index: 0, flags: none, call_count: 744 }

Issue 1

We then try to provision a tuning session like so:
./scripts/provision_tuning.sh tune rocblas.log test.yaml arcturus -m

The above dose not work as the configurations contain flags: none and this is converted to '--flags none' by ConvertToRocBlasBenchCall at:

benchLine = ConvertToRocBlasBenchCall(line)

and then passed to rocblasParser which dose not accept 'none' as it defines --flags as to be an argument of int type:
lineParser.add_argument("--flags",dest="flags", type=int,default=0)

Working around this issue using this patch:

diff --git a/tuning/automation/ExtractSizes.py b/tuning/automation/ExtractSizes.py
index 00c5b619..58c24db5 100644
--- a/tuning/automation/ExtractSizes.py
+++ b/tuning/automation/ExtractSizes.py
@@ -731,6 +731,8 @@ def ConvertToRocBlasBenchCall(line):
     sameParams = set(['b_type','c_type','d_type','compute_type','lda','ldb','ldc','ldd','batch','batch_count','algo','solution_index','flags','stride_a','stride_b','stride_c','stride_d','alpha','beta'])
 
     for item in range(2,len(line)):
+        if line[item] == 'flags' and line[item+1] == 'none':
+            line[item+1] = '0'
         if line[item] in sameParams:
             benchLine += ('--'+line[item]+' '+line[item+1]+' ')
         if line[item] == 'transA':

Issue 2

We imminently encounter the next issue:

From the above rocblas configurations Tensile has generated the following benchmark configuration:

GlobalParameters: {CMakeBuildType: Release, DataInitTypeAB: 0, DataInitTypeBeta: 0, Device: 0, EnqueuesPerSync: 1, ForceRedoBenchmarkProblems: true, ForceRedoLibraryClient: true, ForceRedoLibraryLogic: true, KernelTime: true, LibraryPrintDebug: false, MergeFiles: false, MinimumRequiredVersion: 4.2.0, NumElementsToValidate: 0, PinClocks: false, Platform: 0, PrintSolutionRejectionReason: true, PrintWinnersOnly: 1, ShortNames: false, SleepPercent: 50, SolutionSelectionAlg: 1, SyncsPerBenchmark: 1, ValidationMaxToPrint: 4, ValidationPrintValids: false}
BenchmarkProblems:
- - {Batched: true, ComputeDataType: h, DataType: h, DestDataType: h, OperationType: GEMM,
    TransposeA: true, TransposeB: false, UseBeta: true}
  - BenchmarkCommonParameters:
    - EdgeType: [ShiftPtr]
    - KernelLanguage: [Assembly]
    - LoopTail: [true]
    - GlobalSplitU: [1]
    - FractionalLoad: [1]
    - PrefetchLocalRead: [true]
    - PrefetchGlobalRead: [true]
    - AssertSummationElementMultiple: [2]
    - AssertFree0ElementMultiple: [2]
    BenchmarkFinalParameters:
    - ProblemSizes:
      - Exact: [4096, 2048, 1, 14336, 4096, 4096, 14336, 14336]
    BenchmarkForkParameters: null
    BenchmarkJoinParameters: null
    ForkParameters:
    - WorkGroupMapping: [1, 8]
    - DepthU: [8, 16, 24, 32]
    - VectorWidth: [2, 4, 8]
    - TransposeLDS: [0, 1]
    - ThreadTile:
      - [4, 4]
      - [8, 4]
      - [4, 8]
      - [8, 8]
      - [6, 4]
      - [4, 6]
    - WorkGroup:
      - [16, 16, 1]
      - [16, 8, 1]
      - [8, 16, 1]
      - [4, 16, 1]
      - [16, 4, 1]
      - [8, 8, 1]
    InitialSolutionParameters: null
    JoinParameters: null
  - BenchmarkCommonParameters:
    - EdgeType: [ShiftPtr]
    - KernelLanguage: [Assembly]
    - LoopTail: [true]
    - FractionalLoad: [1]
    - PrefetchLocalRead: [true]
    - PrefetchGlobalRead: [true]
    - AssertSummationElementMultiple: [2]
    - AssertFree0ElementMultiple: [2]
    BenchmarkFinalParameters:
    - ProblemSizes:
      - Exact: [128, 512, 64, 8192, 128, 128, 8192, 8192]
    BenchmarkForkParameters: null
    BenchmarkJoinParameters: null
    ForkParameters:
    - WorkGroupMapping: [1, 8]
    - DepthU: [8, 16, 24, 32]
    - VectorWidth: [2, 4, 8]
    - GlobalSplitU: [1, 16]
    - TransposeLDS: [0, 1]
    - ThreadTile:
      - [4, 4]
      - [8, 4]
      - [4, 8]
      - [8, 8]
      - [6, 4]
      - [4, 6]
    - WorkGroup:
      - [16, 16, 1]
      - [16, 8, 1]
      - [8, 16, 1]
      - [4, 16, 1]
      - [16, 4, 1]
      - [8, 8, 1]
    InitialSolutionParameters: null
    JoinParameters: null
LibraryLogic:
  ArchitectureName: gfx908
  DeviceNames: [Device 7380, Device 7388, Device 738c, Device 7390]
  ScheduleName: arcturus
LibraryClient:

Unfortunately this is wrong both rocblas configurations have compute_type: "f32_r" and the benchmark problem now proudly proclames ComputeDataType: h, ofc this means that whatever we benchmark next wont use MFMA and will never be chosen, Ignoreing that we are now benchmarking the wrong thing, we continue with runTensileTuning-all.sh which succeeds.

Issue 3

We next run:
./scripts/provision_verification.sh tune tune/tensile/Tensile arcturus
Unfortionatly this script is broken as

cp "${REFERENCE_LIBRARY_ASM}"/* "${ASM_PATH}"

Tries to copy the content of the path defined here:
REFERENCE_LIBRARY_ASM=${ROCBLAS_PATH}/library/src/blas3/Tensile/Logic/asm_full

using cp. Unfortionatly ${ROCBLAS_PATH}/library/src/blas3/Tensile/Logic/asm_full contains only directories, which ofc cp will not copy without -r so this operation fails.

Issue 4

Ignoring issue 3 we encounter the next problem:
provision_verification.sh executes ${TENSILE_PATH}/Tensile/Utilities/merge.py
Which contains:

from Tensile.Utilities.ConditionalImports import yamlLoader, yamlDumper

This ofc requires Tensile to be installed and happen be a comptatble version with the version that provision_tuning.sh installed, which ofc it is not so this fails.
We work around this using this patch:

diff --git a/tuning/scripts/provision_verification.sh b/tuning/scripts/provision_verification.sh
index a0fe5a05..82fe55d6 100755
--- a/tuning/scripts/provision_verification.sh
+++ b/tuning/scripts/provision_verification.sh
@@ -193,6 +193,7 @@ if [ "${LIBRARY}" == arcturus ]; then
   fi
 fi
 
+export PYTHONPATH=${TENSILE_PATH}
 MERGE_SCRIPT=${TENSILE_PATH}/Tensile/Utilities/merge.py
 MASSAGE_SCRIPT=${REFERENCE_LIBRARY_ARCHIVE}/massage.py

Issue 5

Unfortionatly merge.py remains broken, tune/logs/merge.log is empty and no merged logic files are created.
At this point i gave up. Tensile/tuning is broken to a degree that would be funny if it weren't sad.

Operating System

Ubuntu 24.04

CPU

Epyc 7552

GPU

MI100

ROCm Version

ROCm 6.3.0

ROCm Component

rocBLAS, Tensile

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@ppanchad-amd
Copy link

Hi @IMbackK. Internal ticket has been created to investigate your issue. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants