Fix: Driver --batch option sets Window Dimensions. #3770

lakhinderwalia · 2025-01-20T20:07:58Z

No description provided.

causten · 2025-01-22T22:14:53Z

Take a look at the models that are failing in CI. You likely have caught some input parameter assumptions.

codecov · 2025-01-23T08:10:55Z

Codecov Report

Attention: Patch coverage is 95.23810% with 1 line in your changes missing coverage. Please review.

Project coverage is 92.29%. Comparing base (5dc0199) to head (d25a969).

Files with missing lines	Patch %	Lines
src/onnx/onnx_parser.cpp	93.75%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #3770      +/-   ##
===========================================
- Coverage    92.29%   92.29%   -0.01%     
===========================================
  Files          519      519              
  Lines        22233    22241       +8     
===========================================
+ Hits         20520    20527       +7     
- Misses        1713     1714       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/onnx/onnx_parser.cpp

src/onnx/include/migraphx/onnx/onnx_parser.hpp

lakhinderwalia · 2025-01-28T20:26:20Z

vicuna-fastchat model is failing due to its unspecified dynamic dimensions for input_ids:
vicuna/encoder_model.onnx --exhaustive-tune --fill1 input_ids --input-dim @input_ids.
Also, attention_mask should be specified.
Reference: #3770 (comment)

pfultz2 · 2025-01-28T21:15:17Z

src/onnx/include/migraphx/onnx/onnx_parser.hpp

@@ -97,6 +97,7 @@ struct onnx_parser
    std::unordered_map<std::string, instruction_ref> instructions;
    program prog                                   = program();
    shape::dynamic_dimension default_dyn_dim_value = {1, 1};
+    size_t default_dim_value                       = 0;


This should be set to 1. The value 0 is not a valid value.

#3770 (comment)

This should be named batch_dimension to make it clearer its setting the batch dimension. Also the --default-dim flag should be added to the driver to enable the old behavior, since the --batch flag is used to set a default dimension.

It isn't so much about the internal variable names, but the realization that we might need to preserve the default behavior. But else we should raise the exception as it is. Thus, either an environment variable or a command line option should explicitly turn on the old behavior.

Internal variable names are still important for clarity(as I thought this did something else), but I am also talking about the driver flags. I dont think you understand what the --batch flag does. It does 2 things:

Sets the default dim for parameterized dims in the onnx model(similar to --default-dyn-dim)

Adjust the rate on the perf report

The problem you are trying to fix is related to the first item because it will inadvertently set the window dimensions. We can have the --batch set only for the first dimension when its a paramterized dim, and when its not, we just ignore it so it can still adjust the perf report. The only way for use to know in the parser that we want to set the first dimension is by having a variable such as batch_dimension in the onnx parser.

However, we have used the --batch flag to set the default dims in the past as its easier than using --default-dyn-dim flag for that. So to make this complete we should also add a --default-dim flag which will set the default_dyn_dim_value variable in the parser. This should be a simple change to the driver.

I'd prefer to have it error out so the user is very aware that the batch option is not applicable to the onnx file they are running. Ignoring it to "hack" perf reports will only enable QA and others to run things like batch-sweeps on onnx files that do not support it.

The --batch flag is used on onxx files that have the batch already set to update the perf report. What do you suggest we do to update the perf report with the batch size? It would be too much of a breaking change to change this.

@lakhinderwalia this is very close to what my thoughts were at the end of the call. Although my preference would be to not just ignore the command line argument when the batch is not a paramterized dim, I'd prefer to have it error out so the user is very aware that the batch option is not applicable to the onnx file they are running. Ignoring it to "hack" perf reports will only enable QA and others to run things like batch-sweeps on onnx files that do not support it.

This PR now additionally prints a raw rate besides the previously printed rate. So that any existing test-scripts might continue without breaking.

Ignoring it to "hack" perf reports will only enable QA and others to run things like batch-sweeps on onnx files that do not support it.

Also I dont think there is an easy way to check for an error to avoid this case without preventing valid cases. The current error checking doesnt check if the batch is unused, it only is an error if there is a parametrized dimension, which means we will still have an issue with batch-sweeps on onxx files that dont have a dynamic batch.

Also, throwing an error when it finds a parameterized dim thats not the first dimension will prevent valid uses cases beyond changing the perf report. For example, if a model takes two parameters and the first parameter has batch and the second parameter has a parametrized dim thats not the first, then setting the batch and using the default dim on the second parameter will throw an error when there is no user error. Furthermore parse_type can't see all the parameters to check for such errors, and refactoring to handle all these scenarios would just make the code too complicated that it is not worth it.

Either way, if QA does a batch-sweep on a model that doesnt support it, we will still get a bug report, but instead of an error about batch it will be a slower rate. Since we are adding a "Raw Rate" we can easily see that the batch was never changing.

hmm I see, I don't really like the idea of --batch argument having 2 different meanings/functions (ie. sometimes changing input shape for compile, and always modifying the QPS in perf report). Ideally a perf report should just say "Given input shape [a, b, c, d], the model takes x s to execute" and that's it. Anything more should really be for the user/tester to interpret (ie. User/tester can decide that since the shape is [a, b, c, d], the batch size is a and so the "Rate" is a/x).

But i guess its too late for that now since the current calculation is expected in many workflows, I think its ok to ignore and move on in the static batch dim for now. But I do still think that --batch should only try to alter index 0 of an input shape, and if there are more dynamic dims then just ask the user to specify shapes using --input-dim

We are adding an additional statistic for a Raw rate, so that hopefully helps about the perf numbers. And we don't throw any exceptions here, and keep the current workflow.

The main reason for the exception is, a serious issue: avoiding an input dimension of size 1pxx1px! Also, we have had another case, where the encoder_sequence_length was being default-set to 1 -- Luckily this PR caught it.

pfultz2 · 2025-01-28T21:16:48Z

src/onnx/onnx_parser.cpp

@@ -646,6 +649,10 @@ shape onnx_parser::parse_type(const onnx::TypeProto& t) const
                       }
                       else
                       {
+                           if(idx && default_dim_value)
+                               MIGRAPHX_THROW("Batch inserted at index " + std::to_string(idx) +
+                                              " of " + name);


What is this error even saying?

The message is a boilerplate string. What it is saying is: that a default batch is being inserted at index X, where as typically batch should be only at index 0. In this case, X>0. Please feel free to suggest a different message.

This shouldn't be an error. The default_dim should be applied to all parameterized dimensions not just the batch.

src/onnx/onnx_parser.cpp

pfultz2 · 2025-01-28T21:41:53Z

src/onnx/onnx_parser.cpp

@@ -617,12 +617,13 @@ literal onnx_parser::parse_tensor(const onnx::TensorProto& t) const
    MIGRAPHX_THROW("PARSE_TENSOR: Invalid tensor type");
 }

-shape onnx_parser::parse_type(const onnx::TypeProto& t) const
+shape onnx_parser::parse_type(const std::string& name, const onnx::TypeProto& t) const


I dont think a name parameter should be passed to this function. Its not needed.

This was for the exception: so that the user knows where it is coming from.

The exception should be removed.

…n message

pfultz2 · 2025-01-31T22:11:35Z

src/onnx/onnx_parser.cpp

+            idx++;
+            dynamic_dims.push_back(default_dyn_dim_value);
+        }
+    }


Use std::transform and check and update the batch afterwards instead of using idx increment, which is hard to follow and fragile with early exits.

std::transform(tensor_dims.begin(), tensor_dims.end(), std::back_inserter(dynamic_dims), [&](auto&& d) -> shape::dynamic_dimension { ... }); const auto& batch_tensor_dim = tensor_dims.front(); if(batch_tensor_dim.has_dim_param() and not contains(dim_params, batch_tensor_dim.dim_param())) dynamic_dims.front() = {batch_dimension, batch_dimension};

It is fragile as it is, agreed, and I would prefer to make it a regular for-loop with an index in it and will push those changes.

(The limitation of your proposed solution is, if the transform loop is left as it is, it isn't giving out the exception that we need to flag. There is a recent case where a User has unwittingly set up an experiment with batch=1. Because of this, for its input tensor of size 3xheightxwidth -- the 1 is the default-value that is being filled in for height and width, and the input tensor size is a dangerous 3x1x1.)

It is fragile as it is, agreed,

Then change it to what I showed above.

I would prefer to make it a regular for-loop

STL algorithms should be preferred. It also forces us to write the code in a cleaner less fragile way.

pfultz2 · 2025-01-31T22:26:55Z

src/onnx/include/migraphx/onnx/onnx_parser.hpp

@@ -97,6 +97,7 @@ struct onnx_parser
    std::unordered_map<std::string, instruction_ref> instructions;
    program prog                                   = program();
    shape::dynamic_dimension default_dyn_dim_value = {1, 1};
+    size_t default_dim_value                       = 0;


Ignoring it to "hack" perf reports will only enable QA and others to run things like batch-sweeps on onnx files that do not support it.

Also I dont think there is an easy way to check for an error to avoid this case without preventing valid cases. The current error checking doesnt check if the batch is unused, it only is an error if there is a parametrized dimension, which means we will still have an issue with batch-sweeps on onxx files that dont have a dynamic batch.

Also, throwing an error when it finds a parameterized dim thats not the first dimension will prevent valid uses cases beyond changing the perf report. For example, if a model takes two parameters and the first parameter has batch and the second parameter has a parametrized dim thats not the first, then setting the batch and using the default dim on the second parameter will throw an error when there is no user error. Furthermore parse_type can't see all the parameters to check for such errors, and refactoring to handle all these scenarios would just make the code too complicated that it is not worth it.

Either way, if QA does a batch-sweep on a model that doesnt support it, we will still get a bug report, but instead of an error about batch it will be a slower rate. Since we are adding a "Raw Rate" we can easily see that the batch was never changing.

pfultz2 · 2025-01-31T22:31:40Z

src/onnx/onnx_parser.cpp

+        {
+            if(idx != 0 and default_dim_value != 0)
+                MIGRAPHX_THROW("Batch inserted at non-zero index: " + std::to_string(idx) +
+                               +", instead set input-dim option for node \'" + name + "\'");


Remove the error since this will throw an error on valid cases. It will throw an error if a parameter has multiple parameterized dims and we are setting the batch dimension. Also when there is parameters with batches and other parameters without.

If a parameter has multiple parameterized dims, it should be explicitly set, don't you agree? Setting it all from batch is fraught with serious errors, as a recent User experience showed us, where the width and height was being set to 1, unwittingly.

If a parameter has multiple parameterized dims, it should be explicitly set, don't you agree?

No because it should use the default_dim_value, not the batch_dimension. There are many cases where 1 is an acceptable value so we dont set anything.

Setting it all from batch is fraught with serious errors

Of course, I am not suggesting that. I am saying that only the batch dimension should be set by batch the other dims should be set by the default_dim_value or set explicitly.

where the width and height was being set to 1, unwittingly.

But with this error we cant even read the model anymore in the driver to even show the parameters(ie the params command) such that we could update dims. This error should be removed.

…ng exception

migraphx-bot · 2025-02-04T04:56:28Z

Test	Batch	Rate new d25a96	Rate old 5dc019	Diff	Compare
torchvision-resnet50	64	3,232.09	3,229.26	0.09%	✅
torchvision-resnet50_fp16	64	6,862.17	6,862.73	-0.01%	✅
torchvision-densenet121	32	2,430.50	2,429.91	0.02%	✅
torchvision-densenet121_fp16	32	4,184.12	4,173.61	0.25%	✅
torchvision-inceptionv3	32	1,612.55	1,611.02	0.10%	✅
torchvision-inceptionv3_fp16	32	2,683.53	2,683.26	0.01%	✅
cadene-inceptionv4	16	748.79	749.06	-0.04%	✅
cadene-resnext64x4	16	808.78	808.64	0.02%	✅
slim-mobilenet	64	6,652.94	6,657.58	-0.07%	✅
slim-nasnetalarge	64	198.96	198.90	0.03%	✅
slim-resnet50v2	64	3,425.10	3,424.77	0.01%	✅
bert-mrpc-onnx	8	1,134.94	1,138.15	-0.28%	✅
bert-mrpc-tf	1	474.94	470.57	0.93%	✅
pytorch-examples-wlang-gru	1	424.81	425.82	-0.24%	✅
pytorch-examples-wlang-lstm	1	390.08	397.58	-1.89%	✅
torchvision-resnet50_1	1	781.57	789.46	-1.00%	✅
cadene-dpn92_1	1	412.56	415.00	-0.59%	✅
cadene-resnext101_1	1	388.68	389.35	-0.17%	✅
onnx-taau-downsample	1	371.90	372.43	-0.14%	✅
dlrm-criteoterabyte	1	30.55	30.53	0.07%	✅
dlrm-criteoterabyte_fp16	1	49.13	49.09	0.07%	✅
agentmodel	1	7,535.40	7,273.63	3.60%	🔆
unet_fp16	2	57.60	57.82	-0.39%	✅
resnet50v1_fp16	1	1,005.19	978.89	2.69%	✅
resnet50v1_int8	1	796.05	781.27	1.89%	✅
bert_base_cased_fp16	64	1,172.27	1,171.32	0.08%	✅
bert_large_uncased_fp16	32	362.24	362.15	0.02%	✅
bert_large_fp16	1	198.03	198.55	-0.26%	✅
distilgpt2_fp16	16	2,215.50	2,214.29	0.05%	✅
yolov5s	1	522.24	515.26	1.35%	✅
tinyllama	1	43.40	43.42	-0.04%	✅
vicuna-fastchat	1	43.77	43.81	-0.10%	✅
whisper-tiny-encoder	1	409.92	410.65	-0.18%	✅
whisper-tiny-decoder	1	406.59	406.71	-0.03%	✅

Check results before merge 🔆

migraphx-bot · 2025-02-04T04:56:30Z

✅ bert-mrpc-onnx: PASSED: MIGraphX meets tolerance

✅ bert-mrpc-tf: PASSED: MIGraphX meets tolerance

✅ pytorch-examples-wlang-gru: PASSED: MIGraphX meets tolerance

✅ pytorch-examples-wlang-lstm: PASSED: MIGraphX meets tolerance

✅ torchvision-resnet50_1: PASSED: MIGraphX meets tolerance

✅ cadene-dpn92_1: PASSED: MIGraphX meets tolerance

✅ cadene-resnext101_1: PASSED: MIGraphX meets tolerance

❌dlrm-criteoterabyte: ERROR - check error output

Traceback (most recent call last):
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 340, in
main()
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 205, in main
model = migraphx.parse_onnx(model_name, default_dim_value=batch)
RuntimeError: /src/AMDMIGraphX/src/onnx/onnx_parser.cpp:650: parse_type: Batch inserted at non-zero index: 1, instead set input-dim option for node 'lS_i'

✅ agentmodel: PASSED: MIGraphX meets tolerance

✅ unet: PASSED: MIGraphX meets tolerance

✅ resnet50v1: PASSED: MIGraphX meets tolerance

✅ bert_base_cased_fp16: PASSED: MIGraphX meets tolerance

🔴bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output

✅ bert_large: PASSED: MIGraphX meets tolerance

✅ yolov5s: PASSED: MIGraphX meets tolerance

✅ tinyllama: PASSED: MIGraphX meets tolerance

❌vicuna-fastchat: ERROR - check error output

Traceback (most recent call last):
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 340, in
main()
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 205, in main
model = migraphx.parse_onnx(model_name, default_dim_value=batch)
RuntimeError: /src/AMDMIGraphX/src/onnx/onnx_parser.cpp:650: parse_type: Batch inserted at non-zero index: 1, instead set input-dim option for node 'input_ids'

✅ whisper-tiny-encoder: PASSED: MIGraphX meets tolerance

✅ whisper-tiny-decoder: PASSED: MIGraphX meets tolerance

✅ distilgpt2_fp16: PASSED: MIGraphX meets tolerance

Exception for replacing non-zero index dynamic-dimension by Batch

44b893b

lakhinderwalia requested a review from causten as a code owner January 20, 2025 20:07

lakhinderwalia self-assigned this Jan 20, 2025

print raw rate, without the batch multiplier

46b5fe8

causten requested review from CharlieL7 and shivadbhavsar January 22, 2025 22:13

handle a case for default_dyn_dim_value option

5966118

lakhinderwalia requested a review from kahmed10 January 23, 2025 04:20

CharlieL7 reviewed Jan 23, 2025

View reviewed changes

src/onnx/onnx_parser.cpp Outdated Show resolved Hide resolved

CharlieL7 reviewed Jan 23, 2025

View reviewed changes

src/onnx/include/migraphx/onnx/onnx_parser.hpp Show resolved Hide resolved

CharlieL7 approved these changes Jan 24, 2025

View reviewed changes

pfultz2 reviewed Jan 28, 2025

View reviewed changes

src/onnx/onnx_parser.cpp Outdated Show resolved Hide resolved

pfultz2 reviewed Jan 28, 2025

View reviewed changes

lakhinderwalia requested a review from TedThemistokleous January 31, 2025 19:42

update transform loop into a in-seqeunce processing + update exceptio…

303980f

…n message

pfultz2 requested changes Jan 31, 2025

View reviewed changes

lakhinderwalia added 3 commits January 31, 2025 21:52

change the for-loop

bf4bf0d

Params are shown with Batch based dimension values, and without raisi…

dc04076

…ng exception

Merge branch 'develop' into lw/batch_option_fix

d25a969

lakhinderwalia requested a review from pfultz2 February 4, 2025 02:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Driver --batch option sets Window Dimensions. #3770

Fix: Driver --batch option sets Window Dimensions. #3770

lakhinderwalia commented Jan 20, 2025

causten commented Jan 22, 2025

codecov bot commented Jan 23, 2025 •

edited

Loading

lakhinderwalia commented Jan 28, 2025

pfultz2 Jan 28, 2025

lakhinderwalia Jan 28, 2025

pfultz2 Jan 28, 2025

lakhinderwalia Jan 28, 2025

pfultz2 Jan 28, 2025

pfultz2 Jan 31, 2025

lakhinderwalia Jan 31, 2025 •

edited

Loading

pfultz2 Jan 31, 2025

shivadbhavsar Jan 31, 2025

lakhinderwalia Jan 31, 2025

pfultz2 Jan 28, 2025

lakhinderwalia Jan 28, 2025

pfultz2 Jan 28, 2025

pfultz2 Jan 28, 2025

lakhinderwalia Jan 28, 2025

pfultz2 Jan 28, 2025

pfultz2 Jan 31, 2025

lakhinderwalia Jan 31, 2025

pfultz2 Feb 3, 2025

pfultz2 Jan 31, 2025

pfultz2 Jan 31, 2025

lakhinderwalia Jan 31, 2025

pfultz2 Feb 3, 2025

migraphx-bot commented Feb 4, 2025

migraphx-bot commented Feb 4, 2025

Fix: Driver --batch option sets Window Dimensions. #3770

Are you sure you want to change the base?

Fix: Driver --batch option sets Window Dimensions. #3770

Conversation

lakhinderwalia commented Jan 20, 2025

causten commented Jan 22, 2025

codecov bot commented Jan 23, 2025 • edited Loading

Codecov Report

lakhinderwalia commented Jan 28, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lakhinderwalia Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

migraphx-bot commented Feb 4, 2025

migraphx-bot commented Feb 4, 2025

codecov bot commented Jan 23, 2025 •

edited

Loading

lakhinderwalia Jan 31, 2025 •

edited

Loading