Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter out devices that are not run #6277

Merged
merged 12 commits into from
Feb 19, 2025
Merged

Conversation

huydhn
Copy link
Contributor

@huydhn huydhn commented Feb 11, 2025

Fixes pytorch/executorch#7986 (for real this time)

The problem is that the records in the benchmark database alone don't have the information to differentiate between benchmarks that are failed to run and benchmarks that are not run. Both show up as 0 on the dashboard. Note that we can do a join with workflow_job table to get this information, but it's a rather slow and expensive route. So, I opt for a quicker approach to keep track of the list of valid devices on the dashboard side. A valid device is one that is run by the selected commit and has at least one non-zero record there.

Testing

https://torchci-git-fork-huydhn-better-model-filter-fbopensource.vercel.app/benchmark/llms?repoName=pytorch%2Fexecutorch

@huydhn huydhn requested review from guangy10 and yangw-dev February 11, 2025 17:31
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 11, 2025
Copy link

vercel bot commented Feb 11, 2025

@huydhn is attempting to deploy a commit to the Meta Open Source Team on Vercel.

A member of the Team first needs to authorize it.

Copy link

vercel bot commented Feb 11, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
torchci ✅ Ready (Inspect) Visit Preview Feb 19, 2025 8:16pm

@guangy10
Copy link

guangy10 commented Feb 12, 2025

Thanks for the fix!

Case 1
I compared two commits from main where meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8 and meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8 passed on base commit but failed on the new commit. It shows as regression in the dashboard (highlighted in red), which is expected. View it on the dashboard

Update: Oh actually case 1 has the same issue as case two where it displays lots of entires as "0->number", but without highlighting it in green, so not obvious. Apple jobs aren't run on the new commit, so all entries with Device Apple should be hide from the view.

Case 2
If I revert the base and new commits, I'd expect to the dashboard to show it as "fix of regression", i.e. the red items in previous view turns to green. However, it's behaving differently. View it on the dashboard. Another issue: The apple jobs are not triggered on the base commit, so all entires with Device "Apple" should be hidden from the view. (nothing to compare)

Case 3
I compared two commits from main where meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8 and meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8 failed on both commits. These models can not be found on the dashboard, which is expected I think because it's neither a fix or a regression. Here is the dashboard view

Copy link

@guangy10 guangy10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some issues to further look into

@@ -27,12 +27,22 @@ export const IS_INCREASING_METRIC_VALUE_GOOD: { [k: string]: boolean } = {
flops_utilization: true,
"compilation_time(s)": false,
speedup: true,
"avg_inference_latency(ms)": false,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot that we need to update this to highlight the metrics correctly. A better solution is probably need as part of pytorch/executorch#8239

@huydhn
Copy link
Contributor Author

huydhn commented Feb 12, 2025

@guangy10 The new version is ready. Please take a look

@huydhn huydhn requested a review from guangy10 February 12, 2025 21:05
@huydhn
Copy link
Contributor Author

huydhn commented Feb 12, 2025

Case 4

There is a case where there is no overlapping between the two commits at all, i.e. one run all Android job and the other running all iOS jobs. In this case, I opt to just show the results from the right commit, but without any highlighting. For example:

  • Right commit has all iOS run, left commit has all Android run dashboard
  • And the other way around dashboard

@guangy10
Copy link

Case 4

There is a case where there is no overlapping between the two commits at all, i.e. one run all Android job and the other running all iOS jobs. In this case, I opt to just show the results from the right commit, but without any highlighting. For example:

  • Right commit has all iOS run, left commit has all Android run dashboard
  • And the other way around dashboard

Should we show nothing (empty board) instead since there is no overlap?

Discovered a new issue: If you toggle the "Platform", the irrelevant entries will be back.

@guangy10
Copy link

qq: If a benchmark job passed after retry, would the new result be uploaded and merged to the db? I know the step upload-benchmark-results is rerun on retry, but not sure how the data got updated in the db. Want to make sure it's behaving correctly

@guangy10
Copy link

guangy10 commented Feb 12, 2025

Case 4

There is a case where there is no overlapping between the two commits at all, i.e. one run all Android job and the other running all iOS jobs. In this case, I opt to just show the results from the right commit, but without any highlighting. For example:

  • Right commit has all iOS run, left commit has all Android run dashboard
  • And the other way around dashboard

Actually I'm thinking of this UX.

  • By default, can we disable cross commits comparison? Just display all entries from new commit.
  • Compare entires only if users select a base commit. And in the list to be selected by users, it should only show commits that have overlap entries with the new commit. (This way there is no edge case 4)

This seems to be the UI improvement that @yangw-dev can help with? If you decide to handle it separately

@huydhn
Copy link
Contributor Author

huydhn commented Feb 13, 2025

qq: If a benchmark job passed after retry, would the new result be uploaded and merged to the db? I know the step upload-benchmark-results is rerun on retry, but not sure how the data got updated in the db. Want to make sure it's behaving correctly

No worry, a retry would just upload the the result as usual because the upload is a step in the workflow

@huydhn
Copy link
Contributor Author

huydhn commented Feb 13, 2025

Should we show nothing (empty board) instead since there is no overlap?

I feel that showing an empty page is just as confusing. And yes, we can default to show all entries from the latest (right) commit in the landing page.

Discovered a new issue: If you toggle the "Platform", the irrelevant entries will be back.

Oh ok, this needs to be handled Wait, this is correct though, when you select a Platform for example, all the commits that doesn't that Platform will be removed. Notice that the list of commits changes. That was the original implementation I had to fix this issue until we decided to add the All Platforms option. So, they are not irrelevant entries though, but legit discrepancies.

@huydhn
Copy link
Contributor Author

huydhn commented Feb 13, 2025

Actually I'm thinking of this UX.

  • By default, can we disable cross commits comparison? Just display all entries from new commit.
  • Compare entires only if users select a base commit. And in the list to be selected by users, it should only show commits that have overlap entries with the new commit. (This way there is no edge case 4)

This seems to be the UI improvement that @yangw-dev can help with? If you decide to handle it separately

Agree, let's do this in a separate PR.

@guangy10
Copy link

Actually I'm thinking of this UX.

  • By default, can we disable cross commits comparison? Just display all entries from new commit.
  • Compare entires only if users select a base commit. And in the list to be selected by users, it should only show commits that have overlap entries with the new commit. (This way there is no edge case 4)

This seems to be the UI improvement that @yangw-dev can help with? If you decide to handle it separately

Agree, let's do this in a separate PR.

if we are not doing this in this PR, can you file a GitHub issue for it?

@huydhn
Copy link
Contributor Author

huydhn commented Feb 14, 2025

if we are not doing this in this PR, can you file a GitHub issue for it?

#6293 captures the gist of it

So, here are the cases that have been fixed:

  • Only show devices that are covered by both the base and the new commit:
  • If a model is failing or not running on both commits, it won't show up at all as there is no records in both cases
  • If both commits run the same type of benchmark (Android or iOS), only highlight the models from the new commit dashboard

@guangy10
Copy link

Something is still quite confusion.

Screenshot 2025-02-13 at 7 58 03 PM

Why is some entry shown only one data point? Shouldn't be always like "num1 -> num2" as we always compare base and new commits?
Also for the llama3_coreml_ane, it failed on base but passed on new, shouldn't it be displayed as a fix of regression?

See the details of the base and new commits below:
Screenshot 2025-02-13 at 8 04 02 PM

@huydhn
Copy link
Contributor Author

huydhn commented Feb 14, 2025

Something is still quite confusion.

Oh, I realize that I have a logic to show only one number if the value from the base and the new commit is the same. I have removed that. So, now whenever there is only one number, it means that only the new commit has the benchmark results.

The highlight issue has been fixed, I hope, dashboard. The tweaking here is difficult to get it right because I need to deduce the info there instead of just showing a failed benchmark run. I create an issue for that here #6294

@guangy10
Copy link

Something is still quite confusion.

Oh, I realize that I have a logic to show only one number if the value from the base and the new commit is the same. I have removed that. So, now whenever there is only one number, it means that only the new commit has the benchmark results.

The highlight issue has been fixed, I hope, dashboard. The tweaking here is difficult to get it right because I need to deduce the info there instead of just showing a failed benchmark run. I create an issue for that here #6294

In this new revision there are lots of entries marked as "0->number" though the jobs run successfully on both commits.

@guangy10
Copy link

If we merge the PR like this, the UX won’t be better as there are many entries are still misleading. We should probably narrow this PR to only fix what can be fixed independent from the saving failed runs to the DB. Then we should think about how we can split the issue, what you can help with and what Yang can take over.

@huydhn
Copy link
Contributor Author

huydhn commented Feb 19, 2025

To sum up what have been fixed by this PR are:

  1. Show only devices and models that are available from both left and right commit
  2. If there is no corresponding benchmark on the left commit, show only the value from the right commit instead of L → R

This ensures that if anything is highlighted on the dashboard, they are legit.

What has not yet been fixed is that CI failures won't show up on the dashboard as there is currently no way to differentiate between having no data because of CI failure v.s. having no data because of benchmark not running. This will be covered next by #6294.

@guangy10
Copy link

To sum up what have been fixed by this PR are:

  1. Show only devices and models that are available from both left and right commit
  2. If there is no corresponding benchmark on the left commit, show only the value from the right commit instead of L → R

This ensures that if anything is highlighted on the dashboard, they are legit.

What has not yet been fixed is that CI failures won't show up on the dashboard as there is currently no way to differentiate between having no data because of CI failure v.s. having no data because of benchmark not running. This will be covered next by #6294.

Yeah, it sounds a concrete fix that we can done without having #6294 and #6293. Just to clarify, when we compare the devices, it includes both the device name and the os version, both must match exactly?

@huydhn
Copy link
Contributor Author

huydhn commented Feb 19, 2025

Yeah, it sounds a concrete fix that we can done without having #6294 and #6293. Just to clarify, when we compare the devices, it includes both the device name and the os version, both must match exactly?

Yup, they must match from both left and right commits

@guangy10
Copy link

Yeah, it sounds a concrete fix that we can done without having #6294 and #6293. Just to clarify, when we compare the devices, it includes both the device name and the os version, both must match exactly?

Yup, they must match from both left and right commits

Can we loose the device match to only cover the major OS version instead? That is, "iPhone 15 Plus iOS 17.4" on the left commit will be considered as a match with "iPhone Plus iOS 17.2" on the right commit.

@huydhn
Copy link
Contributor Author

huydhn commented Feb 19, 2025

Can we loose the device match to only cover the major OS version instead? That is, "iPhone 15 Plus iOS 17.4" on the left commit will be considered as a match with "iPhone Plus iOS 17.2" on the right commit.

Done, you can see that the matching conditions https://github.com/pytorch/test-infra/pull/6277/files#diff-8b946f720239823df7504798036de4008e6028f8961d7a1d04de6969d9b784b7R112 are:

  1. Device name without the arch part, i.e. iOS 17.2
  2. Model + backend name

Copy link

@guangy10 guangy10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go! 🚀

@huydhn huydhn merged commit 5de69da into pytorch:main Feb 19, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DevX] Commits comparison via dashboard is misleading
4 participants