-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: Driver --batch option sets Window Dimensions. #3770
base: develop
Are you sure you want to change the base?
Conversation
Take a look at the models that are failing in CI. You likely have caught some input parameter assumptions. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #3770 +/- ##
===========================================
- Coverage 92.29% 92.29% -0.01%
===========================================
Files 519 519
Lines 22233 22241 +8
===========================================
+ Hits 20520 20527 +7
- Misses 1713 1714 +1 ☔ View full report in Codecov by Sentry. |
vicuna-fastchat model is failing due to its unspecified dynamic dimensions for |
@@ -97,6 +97,7 @@ struct onnx_parser | |||
std::unordered_map<std::string, instruction_ref> instructions; | |||
program prog = program(); | |||
shape::dynamic_dimension default_dyn_dim_value = {1, 1}; | |||
size_t default_dim_value = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be set to 1. The value 0
is not a valid value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be named batch_dimension
to make it clearer its setting the batch dimension. Also the --default-dim
flag should be added to the driver to enable the old behavior, since the --batch
flag is used to set a default dimension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't so much about the internal variable names, but the realization that we might need to preserve the default behavior. But else we should raise the exception as it is. Thus, either an environment variable or a command line option should explicitly turn on the old behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Internal variable names are still important for clarity(as I thought this did something else), but I am also talking about the driver flags. I dont think you understand what the --batch
flag does. It does 2 things:
- Sets the default dim for parameterized dims in the onnx model(similar to
--default-dyn-dim
) - Adjust the rate on the perf report
The problem you are trying to fix is related to the first item because it will inadvertently set the window dimensions. We can have the --batch
set only for the first dimension when its a paramterized dim, and when its not, we just ignore it so it can still adjust the perf report. The only way for use to know in the parser that we want to set the first dimension is by having a variable such as batch_dimension
in the onnx parser.
However, we have used the --batch
flag to set the default dims in the past as its easier than using --default-dyn-dim
flag for that. So to make this complete we should also add a --default-dim
flag which will set the default_dyn_dim_value
variable in the parser. This should be a simple change to the driver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to have it error out so the user is very aware that the batch option is not applicable to the onnx file they are running. Ignoring it to "hack" perf reports will only enable QA and others to run things like batch-sweeps on onnx files that do not support it.
The --batch
flag is used on onxx files that have the batch already set to update the perf report. What do you suggest we do to update the perf report with the batch size? It would be too much of a breaking change to change this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lakhinderwalia this is very close to what my thoughts were at the end of the call. Although my preference would be to not just ignore the command line argument when the batch is not a paramterized dim, I'd prefer to have it error out so the user is very aware that the batch option is not applicable to the onnx file they are running. Ignoring it to "hack" perf reports will only enable QA and others to run things like batch-sweeps on onnx files that do not support it.
This PR now additionally prints a raw rate
besides the previously printed rate. So that any existing test-scripts might continue without breaking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignoring it to "hack" perf reports will only enable QA and others to run things like batch-sweeps on onnx files that do not support it.
Also I dont think there is an easy way to check for an error to avoid this case without preventing valid cases. The current error checking doesnt check if the batch is unused, it only is an error if there is a parametrized dimension, which means we will still have an issue with batch-sweeps on onxx files that dont have a dynamic batch.
Also, throwing an error when it finds a parameterized dim thats not the first dimension will prevent valid uses cases beyond changing the perf report. For example, if a model takes two parameters and the first parameter has batch and the second parameter has a parametrized dim thats not the first, then setting the batch and using the default dim on the second parameter will throw an error when there is no user error. Furthermore parse_type
can't see all the parameters to check for such errors, and refactoring to handle all these scenarios would just make the code too complicated that it is not worth it.
Either way, if QA does a batch-sweep on a model that doesnt support it, we will still get a bug report, but instead of an error about batch it will be a slower rate. Since we are adding a "Raw Rate" we can easily see that the batch was never changing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I see, I don't really like the idea of --batch argument having 2 different meanings/functions (ie. sometimes changing input shape for compile, and always modifying the QPS in perf report). Ideally a perf report should just say "Given input shape [a, b, c, d]
, the model takes x
s to execute" and that's it. Anything more should really be for the user/tester to interpret (ie. User/tester can decide that since the shape is [a, b, c, d]
, the batch size is a
and so the "Rate" is a/x
).
But i guess its too late for that now since the current calculation is expected in many workflows, I think its ok to ignore and move on in the static batch dim for now. But I do still think that --batch should only try to alter index 0 of an input shape, and if there are more dynamic dims then just ask the user to specify shapes using --input-dim
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are adding an additional statistic for a Raw rate
, so that hopefully helps about the perf numbers. And we don't throw any exceptions here, and keep the current workflow.
The main reason for the exception is, a serious issue: avoiding an input dimension of size 1px
x1px
! Also, we have had another case, where the encoder_sequence_length
was being default-set to 1
-- Luckily this PR caught it.
src/onnx/onnx_parser.cpp
Outdated
@@ -646,6 +649,10 @@ shape onnx_parser::parse_type(const onnx::TypeProto& t) const | |||
} | |||
else | |||
{ | |||
if(idx && default_dim_value) | |||
MIGRAPHX_THROW("Batch inserted at index " + std::to_string(idx) + | |||
" of " + name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this error even saying?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The message is a boilerplate string. What it is saying is: that a default batch
is being inserted at index X
, where as typically batch should be only at index 0
. In this case, X>0
. Please feel free to suggest a different message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be an error. The default_dim should be applied to all parameterized dimensions not just the batch.
@@ -617,12 +617,13 @@ literal onnx_parser::parse_tensor(const onnx::TensorProto& t) const | |||
MIGRAPHX_THROW("PARSE_TENSOR: Invalid tensor type"); | |||
} | |||
|
|||
shape onnx_parser::parse_type(const onnx::TypeProto& t) const | |||
shape onnx_parser::parse_type(const std::string& name, const onnx::TypeProto& t) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think a name parameter should be passed to this function. Its not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was for the exception: so that the user knows where it is coming from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception should be removed.
idx++; | ||
dynamic_dims.push_back(default_dyn_dim_value); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use std::transform
and check and update the batch afterwards instead of using idx
increment, which is hard to follow and fragile with early exits.
std::transform(tensor_dims.begin(),
tensor_dims.end(),
std::back_inserter(dynamic_dims),
[&](auto&& d) -> shape::dynamic_dimension {
...
});
const auto& batch_tensor_dim = tensor_dims.front();
if(batch_tensor_dim.has_dim_param() and not contains(dim_params, batch_tensor_dim.dim_param()))
dynamic_dims.front() = {batch_dimension, batch_dimension};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is fragile as it is, agreed, and I would prefer to make it a regular for-loop with an index in it and will push those changes.
(The limitation of your proposed solution is, if the transform loop is left as it is, it isn't giving out the exception that we need to flag. There is a recent case where a User has unwittingly set up an experiment with batch=1
. Because of this, for its input tensor of size 3xheight
xwidth
-- the 1 is the default-value that is being filled in for height
and width
, and the input tensor size is a dangerous 3x1x1.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is fragile as it is, agreed,
Then change it to what I showed above.
I would prefer to make it a regular for-loop
STL algorithms should be preferred. It also forces us to write the code in a cleaner less fragile way.
@@ -97,6 +97,7 @@ struct onnx_parser | |||
std::unordered_map<std::string, instruction_ref> instructions; | |||
program prog = program(); | |||
shape::dynamic_dimension default_dyn_dim_value = {1, 1}; | |||
size_t default_dim_value = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignoring it to "hack" perf reports will only enable QA and others to run things like batch-sweeps on onnx files that do not support it.
Also I dont think there is an easy way to check for an error to avoid this case without preventing valid cases. The current error checking doesnt check if the batch is unused, it only is an error if there is a parametrized dimension, which means we will still have an issue with batch-sweeps on onxx files that dont have a dynamic batch.
Also, throwing an error when it finds a parameterized dim thats not the first dimension will prevent valid uses cases beyond changing the perf report. For example, if a model takes two parameters and the first parameter has batch and the second parameter has a parametrized dim thats not the first, then setting the batch and using the default dim on the second parameter will throw an error when there is no user error. Furthermore parse_type
can't see all the parameters to check for such errors, and refactoring to handle all these scenarios would just make the code too complicated that it is not worth it.
Either way, if QA does a batch-sweep on a model that doesnt support it, we will still get a bug report, but instead of an error about batch it will be a slower rate. Since we are adding a "Raw Rate" we can easily see that the batch was never changing.
{ | ||
if(idx != 0 and default_dim_value != 0) | ||
MIGRAPHX_THROW("Batch inserted at non-zero index: " + std::to_string(idx) + | ||
+", instead set input-dim option for node \'" + name + "\'"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the error since this will throw an error on valid cases. It will throw an error if a parameter has multiple parameterized dims and we are setting the batch dimension. Also when there is parameters with batches and other parameters without.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a parameter has multiple parameterized dims, it should be explicitly set, don't you agree? Setting it all from batch
is fraught with serious errors, as a recent User experience showed us, where the width
and height
was being set to 1
, unwittingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a parameter has multiple parameterized dims, it should be explicitly set, don't you agree?
No because it should use the default_dim_value, not the batch_dimension. There are many cases where 1
is an acceptable value so we dont set anything.
Setting it all from batch is fraught with serious errors
Of course, I am not suggesting that. I am saying that only the batch dimension should be set by batch the other dims should be set by the default_dim_value or set explicitly.
where the width and height was being set to 1, unwittingly.
But with this error we cant even read the model anymore in the driver to even show the parameters(ie the params
command) such that we could update dims. This error should be removed.
Check results before merge 🔆 |
❌dlrm-criteoterabyte: ERROR - check error outputTraceback (most recent call last):File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 340, in main() File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 205, in main model = migraphx.parse_onnx(model_name, default_dim_value=batch) RuntimeError: /src/AMDMIGraphX/src/onnx/onnx_parser.cpp:650: parse_type: Batch inserted at non-zero index: 1, instead set input-dim option for node 'lS_i' 🔴bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output❌vicuna-fastchat: ERROR - check error outputTraceback (most recent call last):File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 340, in main() File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 205, in main model = migraphx.parse_onnx(model_name, default_dim_value=batch) RuntimeError: /src/AMDMIGraphX/src/onnx/onnx_parser.cpp:650: parse_type: Batch inserted at non-zero index: 1, instead set input-dim option for node 'input_ids' |
No description provided.