Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Driver --batch option sets Window Dimensions. #3770

Open
wants to merge 7 commits into
base: develop
Choose a base branch
from

Conversation

lakhinderwalia
Copy link
Contributor

No description provided.

@lakhinderwalia lakhinderwalia self-assigned this Jan 20, 2025
@causten
Copy link
Collaborator

causten commented Jan 22, 2025

Take a look at the models that are failing in CI. You likely have caught some input parameter assumptions.

Copy link

codecov bot commented Jan 23, 2025

Codecov Report

Attention: Patch coverage is 95.23810% with 1 line in your changes missing coverage. Please review.

Project coverage is 92.29%. Comparing base (5dc0199) to head (d25a969).

Files with missing lines Patch % Lines
src/onnx/onnx_parser.cpp 93.75% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3770      +/-   ##
===========================================
- Coverage    92.29%   92.29%   -0.01%     
===========================================
  Files          519      519              
  Lines        22233    22241       +8     
===========================================
+ Hits         20520    20527       +7     
- Misses        1713     1714       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/onnx/onnx_parser.cpp Outdated Show resolved Hide resolved
@lakhinderwalia
Copy link
Contributor Author

vicuna-fastchat model is failing due to its unspecified dynamic dimensions for input_ids:
vicuna/encoder_model.onnx --exhaustive-tune --fill1 input_ids --input-dim @input_ids.
Also, attention_mask should be specified.
Reference: #3770 (comment)

@@ -97,6 +97,7 @@ struct onnx_parser
std::unordered_map<std::string, instruction_ref> instructions;
program prog = program();
shape::dynamic_dimension default_dyn_dim_value = {1, 1};
size_t default_dim_value = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be set to 1. The value 0 is not a valid value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be named batch_dimension to make it clearer its setting the batch dimension. Also the --default-dim flag should be added to the driver to enable the old behavior, since the --batch flag is used to set a default dimension.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't so much about the internal variable names, but the realization that we might need to preserve the default behavior. But else we should raise the exception as it is. Thus, either an environment variable or a command line option should explicitly turn on the old behavior.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internal variable names are still important for clarity(as I thought this did something else), but I am also talking about the driver flags. I dont think you understand what the --batch flag does. It does 2 things:

  • Sets the default dim for parameterized dims in the onnx model(similar to --default-dyn-dim)
  • Adjust the rate on the perf report

The problem you are trying to fix is related to the first item because it will inadvertently set the window dimensions. We can have the --batch set only for the first dimension when its a paramterized dim, and when its not, we just ignore it so it can still adjust the perf report. The only way for use to know in the parser that we want to set the first dimension is by having a variable such as batch_dimension in the onnx parser.

However, we have used the --batch flag to set the default dims in the past as its easier than using --default-dyn-dim flag for that. So to make this complete we should also add a --default-dim flag which will set the default_dyn_dim_value variable in the parser. This should be a simple change to the driver.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to have it error out so the user is very aware that the batch option is not applicable to the onnx file they are running. Ignoring it to "hack" perf reports will only enable QA and others to run things like batch-sweeps on onnx files that do not support it.

The --batch flag is used on onxx files that have the batch already set to update the perf report. What do you suggest we do to update the perf report with the batch size? It would be too much of a breaking change to change this.

Copy link
Contributor Author

@lakhinderwalia lakhinderwalia Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lakhinderwalia this is very close to what my thoughts were at the end of the call. Although my preference would be to not just ignore the command line argument when the batch is not a paramterized dim, I'd prefer to have it error out so the user is very aware that the batch option is not applicable to the onnx file they are running. Ignoring it to "hack" perf reports will only enable QA and others to run things like batch-sweeps on onnx files that do not support it.

This PR now additionally prints a raw rate besides the previously printed rate. So that any existing test-scripts might continue without breaking.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring it to "hack" perf reports will only enable QA and others to run things like batch-sweeps on onnx files that do not support it.

Also I dont think there is an easy way to check for an error to avoid this case without preventing valid cases. The current error checking doesnt check if the batch is unused, it only is an error if there is a parametrized dimension, which means we will still have an issue with batch-sweeps on onxx files that dont have a dynamic batch.

Also, throwing an error when it finds a parameterized dim thats not the first dimension will prevent valid uses cases beyond changing the perf report. For example, if a model takes two parameters and the first parameter has batch and the second parameter has a parametrized dim thats not the first, then setting the batch and using the default dim on the second parameter will throw an error when there is no user error. Furthermore parse_type can't see all the parameters to check for such errors, and refactoring to handle all these scenarios would just make the code too complicated that it is not worth it.

Either way, if QA does a batch-sweep on a model that doesnt support it, we will still get a bug report, but instead of an error about batch it will be a slower rate. Since we are adding a "Raw Rate" we can easily see that the batch was never changing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I see, I don't really like the idea of --batch argument having 2 different meanings/functions (ie. sometimes changing input shape for compile, and always modifying the QPS in perf report). Ideally a perf report should just say "Given input shape [a, b, c, d], the model takes x s to execute" and that's it. Anything more should really be for the user/tester to interpret (ie. User/tester can decide that since the shape is [a, b, c, d], the batch size is a and so the "Rate" is a/x).

But i guess its too late for that now since the current calculation is expected in many workflows, I think its ok to ignore and move on in the static batch dim for now. But I do still think that --batch should only try to alter index 0 of an input shape, and if there are more dynamic dims then just ask the user to specify shapes using --input-dim

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are adding an additional statistic for a Raw rate, so that hopefully helps about the perf numbers. And we don't throw any exceptions here, and keep the current workflow.

The main reason for the exception is, a serious issue: avoiding an input dimension of size 1pxx1px! Also, we have had another case, where the encoder_sequence_length was being default-set to 1 -- Luckily this PR caught it.

@@ -646,6 +649,10 @@ shape onnx_parser::parse_type(const onnx::TypeProto& t) const
}
else
{
if(idx && default_dim_value)
MIGRAPHX_THROW("Batch inserted at index " + std::to_string(idx) +
" of " + name);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this error even saying?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message is a boilerplate string. What it is saying is: that a default batch is being inserted at index X, where as typically batch should be only at index 0. In this case, X>0. Please feel free to suggest a different message.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be an error. The default_dim should be applied to all parameterized dimensions not just the batch.

src/onnx/onnx_parser.cpp Outdated Show resolved Hide resolved
@@ -617,12 +617,13 @@ literal onnx_parser::parse_tensor(const onnx::TensorProto& t) const
MIGRAPHX_THROW("PARSE_TENSOR: Invalid tensor type");
}

shape onnx_parser::parse_type(const onnx::TypeProto& t) const
shape onnx_parser::parse_type(const std::string& name, const onnx::TypeProto& t) const
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think a name parameter should be passed to this function. Its not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was for the exception: so that the user knows where it is coming from.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception should be removed.

idx++;
dynamic_dims.push_back(default_dyn_dim_value);
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use std::transform and check and update the batch afterwards instead of using idx increment, which is hard to follow and fragile with early exits.

    std::transform(tensor_dims.begin(),
                   tensor_dims.end(),
                   std::back_inserter(dynamic_dims),
                   [&](auto&& d) -> shape::dynamic_dimension {
                        ...
                   });
    const auto& batch_tensor_dim = tensor_dims.front();
    if(batch_tensor_dim.has_dim_param() and not contains(dim_params, batch_tensor_dim.dim_param()))
        dynamic_dims.front() = {batch_dimension, batch_dimension};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fragile as it is, agreed, and I would prefer to make it a regular for-loop with an index in it and will push those changes.

(The limitation of your proposed solution is, if the transform loop is left as it is, it isn't giving out the exception that we need to flag. There is a recent case where a User has unwittingly set up an experiment with batch=1. Because of this, for its input tensor of size 3xheightxwidth -- the 1 is the default-value that is being filled in for height and width, and the input tensor size is a dangerous 3x1x1.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fragile as it is, agreed,

Then change it to what I showed above.

I would prefer to make it a regular for-loop

STL algorithms should be preferred. It also forces us to write the code in a cleaner less fragile way.

@@ -97,6 +97,7 @@ struct onnx_parser
std::unordered_map<std::string, instruction_ref> instructions;
program prog = program();
shape::dynamic_dimension default_dyn_dim_value = {1, 1};
size_t default_dim_value = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring it to "hack" perf reports will only enable QA and others to run things like batch-sweeps on onnx files that do not support it.

Also I dont think there is an easy way to check for an error to avoid this case without preventing valid cases. The current error checking doesnt check if the batch is unused, it only is an error if there is a parametrized dimension, which means we will still have an issue with batch-sweeps on onxx files that dont have a dynamic batch.

Also, throwing an error when it finds a parameterized dim thats not the first dimension will prevent valid uses cases beyond changing the perf report. For example, if a model takes two parameters and the first parameter has batch and the second parameter has a parametrized dim thats not the first, then setting the batch and using the default dim on the second parameter will throw an error when there is no user error. Furthermore parse_type can't see all the parameters to check for such errors, and refactoring to handle all these scenarios would just make the code too complicated that it is not worth it.

Either way, if QA does a batch-sweep on a model that doesnt support it, we will still get a bug report, but instead of an error about batch it will be a slower rate. Since we are adding a "Raw Rate" we can easily see that the batch was never changing.

{
if(idx != 0 and default_dim_value != 0)
MIGRAPHX_THROW("Batch inserted at non-zero index: " + std::to_string(idx) +
+", instead set input-dim option for node \'" + name + "\'");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the error since this will throw an error on valid cases. It will throw an error if a parameter has multiple parameterized dims and we are setting the batch dimension. Also when there is parameters with batches and other parameters without.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a parameter has multiple parameterized dims, it should be explicitly set, don't you agree? Setting it all from batch is fraught with serious errors, as a recent User experience showed us, where the width and height was being set to 1, unwittingly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a parameter has multiple parameterized dims, it should be explicitly set, don't you agree?

No because it should use the default_dim_value, not the batch_dimension. There are many cases where 1 is an acceptable value so we dont set anything.

Setting it all from batch is fraught with serious errors

Of course, I am not suggesting that. I am saying that only the batch dimension should be set by batch the other dims should be set by the default_dim_value or set explicitly.

where the width and height was being set to 1, unwittingly.

But with this error we cant even read the model anymore in the driver to even show the parameters(ie the params command) such that we could update dims. This error should be removed.

@migraphx-bot
Copy link
Collaborator

Test Batch Rate new
d25a96
Rate old
5dc019
Diff Compare
torchvision-resnet50 64 3,232.09 3,229.26 0.09%
torchvision-resnet50_fp16 64 6,862.17 6,862.73 -0.01%
torchvision-densenet121 32 2,430.50 2,429.91 0.02%
torchvision-densenet121_fp16 32 4,184.12 4,173.61 0.25%
torchvision-inceptionv3 32 1,612.55 1,611.02 0.10%
torchvision-inceptionv3_fp16 32 2,683.53 2,683.26 0.01%
cadene-inceptionv4 16 748.79 749.06 -0.04%
cadene-resnext64x4 16 808.78 808.64 0.02%
slim-mobilenet 64 6,652.94 6,657.58 -0.07%
slim-nasnetalarge 64 198.96 198.90 0.03%
slim-resnet50v2 64 3,425.10 3,424.77 0.01%
bert-mrpc-onnx 8 1,134.94 1,138.15 -0.28%
bert-mrpc-tf 1 474.94 470.57 0.93%
pytorch-examples-wlang-gru 1 424.81 425.82 -0.24%
pytorch-examples-wlang-lstm 1 390.08 397.58 -1.89%
torchvision-resnet50_1 1 781.57 789.46 -1.00%
cadene-dpn92_1 1 412.56 415.00 -0.59%
cadene-resnext101_1 1 388.68 389.35 -0.17%
onnx-taau-downsample 1 371.90 372.43 -0.14%
dlrm-criteoterabyte 1 30.55 30.53 0.07%
dlrm-criteoterabyte_fp16 1 49.13 49.09 0.07%
agentmodel 1 7,535.40 7,273.63 3.60% 🔆
unet_fp16 2 57.60 57.82 -0.39%
resnet50v1_fp16 1 1,005.19 978.89 2.69%
resnet50v1_int8 1 796.05 781.27 1.89%
bert_base_cased_fp16 64 1,172.27 1,171.32 0.08%
bert_large_uncased_fp16 32 362.24 362.15 0.02%
bert_large_fp16 1 198.03 198.55 -0.26%
distilgpt2_fp16 16 2,215.50 2,214.29 0.05%
yolov5s 1 522.24 515.26 1.35%
tinyllama 1 43.40 43.42 -0.04%
vicuna-fastchat 1 43.77 43.81 -0.10%
whisper-tiny-encoder 1 409.92 410.65 -0.18%
whisper-tiny-decoder 1 406.59 406.71 -0.03%

Check results before merge 🔆

@migraphx-bot
Copy link
Collaborator


     ✅ bert-mrpc-onnx: PASSED: MIGraphX meets tolerance

     ✅ bert-mrpc-tf: PASSED: MIGraphX meets tolerance

     ✅ pytorch-examples-wlang-gru: PASSED: MIGraphX meets tolerance

     ✅ pytorch-examples-wlang-lstm: PASSED: MIGraphX meets tolerance

     ✅ torchvision-resnet50_1: PASSED: MIGraphX meets tolerance

     ✅ cadene-dpn92_1: PASSED: MIGraphX meets tolerance

     ✅ cadene-resnext101_1: PASSED: MIGraphX meets tolerance

❌dlrm-criteoterabyte: ERROR - check error outputTraceback (most recent call last):
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 340, in
main()
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 205, in main
model = migraphx.parse_onnx(model_name, default_dim_value=batch)
RuntimeError: /src/AMDMIGraphX/src/onnx/onnx_parser.cpp:650: parse_type: Batch inserted at non-zero index: 1, instead set input-dim option for node 'lS_i'


     ✅ agentmodel: PASSED: MIGraphX meets tolerance

     ✅ unet: PASSED: MIGraphX meets tolerance

     ✅ resnet50v1: PASSED: MIGraphX meets tolerance

     ✅ bert_base_cased_fp16: PASSED: MIGraphX meets tolerance

🔴bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output


     ✅ bert_large: PASSED: MIGraphX meets tolerance

     ✅ yolov5s: PASSED: MIGraphX meets tolerance

     ✅ tinyllama: PASSED: MIGraphX meets tolerance

❌vicuna-fastchat: ERROR - check error outputTraceback (most recent call last):
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 340, in
main()
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 205, in main
model = migraphx.parse_onnx(model_name, default_dim_value=batch)
RuntimeError: /src/AMDMIGraphX/src/onnx/onnx_parser.cpp:650: parse_type: Batch inserted at non-zero index: 1, instead set input-dim option for node 'input_ids'


     ✅ whisper-tiny-encoder: PASSED: MIGraphX meets tolerance

     ✅ whisper-tiny-decoder: PASSED: MIGraphX meets tolerance

     ✅ distilgpt2_fp16: PASSED: MIGraphX meets tolerance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants