[Issue]: Multiple issues in training scripts #2077

IMbackK · 2025-01-07T23:35:13Z

Problem Description

There appear to be multiple issues in the training scripts:

First we run a test using ROCBLAS_LEVEL=4 on a simple test application from this we gain a list of executed rocblas calls to tune on:

- { rocblas_function: "rocblas_gemm_batched_ex", atomics_mode: atomics_allowed, a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f32_r", transA: 'T', transB: 'N', M: 128, N: 512, K: 8192, alpha: 2.0, lda: 8192, ldb: 8192, beta: 2.0, ldc: 128, ldd: 128, batch_count: 64, algo: 0, solution_index: 0, flags: none, call_count: 480 }
- { rocblas_function: "rocblas_gemm_ex", atomics_mode: atomics_allowed, a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f32_r", transA: 'T', transB: 'N', M: 4096, N: 2048, K: 14336, alpha: 1.0, lda: 14336, ldb: 14336, beta: 0.0, ldc: 4096, ldd: 4096, algo: 0, solution_index: 0, flags: none, call_count: 744 }

Issue 1

We then try to provision a tuning session like so:
./scripts/provision_tuning.sh tune rocblas.log test.yaml arcturus -m

The above dose not work as the configurations contain flags: none and this is converted to '--flags none' by ConvertToRocBlasBenchCall at:

Tensile/tuning/automation/ExtractSizes.py

Line 789 in 3c91b64

benchLine = ConvertToRocBlasBenchCall(line)

and then passed to rocblasParser which dose not accept 'none' as it defines --flags as to be an argument of int type:

Tensile/tuning/automation/ExtractSizes.py

Line 70 in 3c91b64

lineParser.add_argument("--flags",dest="flags", type=int,default=0)

Working around this issue using this patch:

diff --git a/tuning/automation/ExtractSizes.py b/tuning/automation/ExtractSizes.py
index 00c5b619..58c24db5 100644
--- a/tuning/automation/ExtractSizes.py
+++ b/tuning/automation/ExtractSizes.py
@@ -731,6 +731,8 @@ def ConvertToRocBlasBenchCall(line):
     sameParams = set(['b_type','c_type','d_type','compute_type','lda','ldb','ldc','ldd','batch','batch_count','algo','solution_index','flags','stride_a','stride_b','stride_c','stride_d','alpha','beta'])
 
     for item in range(2,len(line)):
+        if line[item] == 'flags' and line[item+1] == 'none':
+            line[item+1] = '0'
         if line[item] in sameParams:
             benchLine += ('--'+line[item]+' '+line[item+1]+' ')
         if line[item] == 'transA':

Issue 2

We imminently encounter the next issue:

From the above rocblas configurations Tensile has generated the following benchmark configuration:

GlobalParameters: {CMakeBuildType: Release, DataInitTypeAB: 0, DataInitTypeBeta: 0, Device: 0, EnqueuesPerSync: 1, ForceRedoBenchmarkProblems: true, ForceRedoLibraryClient: true, ForceRedoLibraryLogic: true, KernelTime: true, LibraryPrintDebug: false, MergeFiles: false, MinimumRequiredVersion: 4.2.0, NumElementsToValidate: 0, PinClocks: false, Platform: 0, PrintSolutionRejectionReason: true, PrintWinnersOnly: 1, ShortNames: false, SleepPercent: 50, SolutionSelectionAlg: 1, SyncsPerBenchmark: 1, ValidationMaxToPrint: 4, ValidationPrintValids: false}
BenchmarkProblems:
- - {Batched: true, ComputeDataType: h, DataType: h, DestDataType: h, OperationType: GEMM,
    TransposeA: true, TransposeB: false, UseBeta: true}
  - BenchmarkCommonParameters:
    - EdgeType: [ShiftPtr]
    - KernelLanguage: [Assembly]
    - LoopTail: [true]
    - GlobalSplitU: [1]
    - FractionalLoad: [1]
    - PrefetchLocalRead: [true]
    - PrefetchGlobalRead: [true]
    - AssertSummationElementMultiple: [2]
    - AssertFree0ElementMultiple: [2]
    BenchmarkFinalParameters:
    - ProblemSizes:
      - Exact: [4096, 2048, 1, 14336, 4096, 4096, 14336, 14336]
    BenchmarkForkParameters: null
    BenchmarkJoinParameters: null
    ForkParameters:
    - WorkGroupMapping: [1, 8]
    - DepthU: [8, 16, 24, 32]
    - VectorWidth: [2, 4, 8]
    - TransposeLDS: [0, 1]
    - ThreadTile:
      - [4, 4]
      - [8, 4]
      - [4, 8]
      - [8, 8]
      - [6, 4]
      - [4, 6]
    - WorkGroup:
      - [16, 16, 1]
      - [16, 8, 1]
      - [8, 16, 1]
      - [4, 16, 1]
      - [16, 4, 1]
      - [8, 8, 1]
    InitialSolutionParameters: null
    JoinParameters: null
  - BenchmarkCommonParameters:
    - EdgeType: [ShiftPtr]
    - KernelLanguage: [Assembly]
    - LoopTail: [true]
    - FractionalLoad: [1]
    - PrefetchLocalRead: [true]
    - PrefetchGlobalRead: [true]
    - AssertSummationElementMultiple: [2]
    - AssertFree0ElementMultiple: [2]
    BenchmarkFinalParameters:
    - ProblemSizes:
      - Exact: [128, 512, 64, 8192, 128, 128, 8192, 8192]
    BenchmarkForkParameters: null
    BenchmarkJoinParameters: null
    ForkParameters:
    - WorkGroupMapping: [1, 8]
    - DepthU: [8, 16, 24, 32]
    - VectorWidth: [2, 4, 8]
    - GlobalSplitU: [1, 16]
    - TransposeLDS: [0, 1]
    - ThreadTile:
      - [4, 4]
      - [8, 4]
      - [4, 8]
      - [8, 8]
      - [6, 4]
      - [4, 6]
    - WorkGroup:
      - [16, 16, 1]
      - [16, 8, 1]
      - [8, 16, 1]
      - [4, 16, 1]
      - [16, 4, 1]
      - [8, 8, 1]
    InitialSolutionParameters: null
    JoinParameters: null
LibraryLogic:
  ArchitectureName: gfx908
  DeviceNames: [Device 7380, Device 7388, Device 738c, Device 7390]
  ScheduleName: arcturus
LibraryClient:

Unfortunately this is wrong both rocblas configurations have compute_type: "f32_r" and the benchmark problem now proudly proclames ComputeDataType: h, ofc this means that whatever we benchmark next wont use MFMA and will never be chosen, Ignoreing that we are now benchmarking the wrong thing, we continue with runTensileTuning-all.sh which succeeds.

Issue 3

We next run:
./scripts/provision_verification.sh tune tune/tensile/Tensile arcturus
Unfortionatly this script is broken as

Tensile/tuning/scripts/provision_verification.sh

Line 175 in 3c91b64

cp "${REFERENCE_LIBRARY_ASM}"/* "${ASM_PATH}"

Tries to copy the content of the path defined here:

Tensile/tuning/scripts/provision_verification.sh

Line 166 in 3c91b64

REFERENCE_LIBRARY_ASM=${ROCBLAS_PATH}/library/src/blas3/Tensile/Logic/asm_full

using cp. Unfortionatly ${ROCBLAS_PATH}/library/src/blas3/Tensile/Logic/asm_full contains only directories, which ofc cp will not copy without -r so this operation fails.

Issue 4

Ignoring issue 3 we encounter the next problem:
provision_verification.sh executes ${TENSILE_PATH}/Tensile/Utilities/merge.py
Which contains:

Tensile/Tensile/Utilities/merge.py

Line 25 in 3c91b64

from Tensile.Utilities.ConditionalImports import yamlLoader, yamlDumper

This ofc requires Tensile to be installed and happen be a comptatble version with the version that provision_tuning.sh installed, which ofc it is not so this fails.
We work around this using this patch:

diff --git a/tuning/scripts/provision_verification.sh b/tuning/scripts/provision_verification.sh
index a0fe5a05..82fe55d6 100755
--- a/tuning/scripts/provision_verification.sh
+++ b/tuning/scripts/provision_verification.sh
@@ -193,6 +193,7 @@ if [ "${LIBRARY}" == arcturus ]; then
   fi
 fi
 
+export PYTHONPATH=${TENSILE_PATH}
 MERGE_SCRIPT=${TENSILE_PATH}/Tensile/Utilities/merge.py
 MASSAGE_SCRIPT=${REFERENCE_LIBRARY_ARCHIVE}/massage.py

Issue 5

Unfortionatly merge.py remains broken, tune/logs/merge.log is empty and no merged logic files are created.
At this point i gave up. Tensile/tuning is broken to a degree that would be funny if it weren't sad.

Operating System

Ubuntu 24.04

CPU

Epyc 7552

GPU

MI100

ROCm Version

ROCm 6.3.0

ROCm Component

rocBLAS, Tensile

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

The text was updated successfully, but these errors were encountered:

ppanchad-amd · 2025-01-08T14:43:54Z

Hi @IMbackK. Internal ticket has been created to investigate your issue. Thanks!

ppanchad-amd added the under investigation label Jan 8, 2025

IMbackK mentioned this issue Jan 8, 2025

Fixes various issues in the tuneing scripts #2079

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: Multiple issues in training scripts #2077

[Issue]: Multiple issues in training scripts #2077

IMbackK commented Jan 7, 2025 •

edited

Loading

ppanchad-amd commented Jan 8, 2025

[Issue]: Multiple issues in training scripts #2077

[Issue]: Multiple issues in training scripts #2077

Comments

IMbackK commented Jan 7, 2025 • edited Loading

Problem Description

Issue 1

Issue 2

Issue 3

Issue 4

Issue 5

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

ppanchad-amd commented Jan 8, 2025

IMbackK commented Jan 7, 2025 •

edited

Loading