Filter out devices that are not run #6277

huydhn · 2025-02-11T17:31:59Z

Fixes pytorch/executorch#7986 (for real this time)

The problem is that the records in the benchmark database alone don't have the information to differentiate between benchmarks that are failed to run and benchmarks that are not run. Both show up as 0 on the dashboard. Note that we can do a join with workflow_job table to get this information, but it's a rather slow and expensive route. So, I opt for a quicker approach to keep track of the list of valid devices on the dashboard side. A valid device is one that is run by the selected commit and has at least one non-zero record there.

Testing

https://torchci-git-fork-huydhn-better-model-filter-fbopensource.vercel.app/benchmark/llms?repoName=pytorch%2Fexecutorch

vercel · 2025-02-11T17:32:17Z

@huydhn is attempting to deploy a commit to the Meta Open Source Team on Vercel.

A member of the Team first needs to authorize it.

vercel · 2025-02-11T17:33:59Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated (UTC)
torchci	✅ Ready (Inspect)	Visit Preview	Feb 19, 2025 8:16pm

torchci/lib/benchmark/llmUtils.ts

guangy10 · 2025-02-12T18:38:00Z

Thanks for the fix!

Case 1
I compared two commits from main where meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8 and meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8 passed on base commit but failed on the new commit. It shows as regression in the dashboard (highlighted in red), which is expected. View it on the dashboard

Update: Oh actually case 1 has the same issue as case two where it displays lots of entires as "0->number", but without highlighting it in green, so not obvious. Apple jobs aren't run on the new commit, so all entries with Device Apple should be hide from the view.

Case 2
If I revert the base and new commits, I'd expect to the dashboard to show it as "fix of regression", i.e. the red items in previous view turns to green. However, it's behaving differently. View it on the dashboard. Another issue: The apple jobs are not triggered on the base commit, so all entires with Device "Apple" should be hidden from the view. (nothing to compare)

Case 3
I compared two commits from main where meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8 and meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8 failed on both commits. These models can not be found on the dashboard, which is expected I think because it's neither a fix or a regression. Here is the dashboard view

guangy10

There are some issues to further look into

huydhn · 2025-02-12T20:53:32Z

torchci/components/benchmark/llms/common.tsx

@@ -27,12 +27,22 @@ export const IS_INCREASING_METRIC_VALUE_GOOD: { [k: string]: boolean } = {
  flops_utilization: true,
  "compilation_time(s)": false,
  speedup: true,
+  "avg_inference_latency(ms)": false,


I forgot that we need to update this to highlight the metrics correctly. A better solution is probably need as part of pytorch/executorch#8239

huydhn · 2025-02-12T21:05:20Z

@guangy10 The new version is ready. Please take a look

huydhn · 2025-02-12T21:27:50Z

Case 4

There is a case where there is no overlapping between the two commits at all, i.e. one run all Android job and the other running all iOS jobs. In this case, I opt to just show the results from the right commit, but without any highlighting. For example:

Right commit has all iOS run, left commit has all Android run dashboard
And the other way around dashboard

guangy10 · 2025-02-12T21:34:24Z

Case 4

There is a case where there is no overlapping between the two commits at all, i.e. one run all Android job and the other running all iOS jobs. In this case, I opt to just show the results from the right commit, but without any highlighting. For example:

Right commit has all iOS run, left commit has all Android run dashboard

And the other way around dashboard

Should we show nothing (empty board) instead since there is no overlap?

Discovered a new issue: If you toggle the "Platform", the irrelevant entries will be back.

guangy10 · 2025-02-12T21:41:21Z

qq: If a benchmark job passed after retry, would the new result be uploaded and merged to the db? I know the step upload-benchmark-results is rerun on retry, but not sure how the data got updated in the db. Want to make sure it's behaving correctly

guangy10 · 2025-02-12T21:46:32Z

Case 4

There is a case where there is no overlapping between the two commits at all, i.e. one run all Android job and the other running all iOS jobs. In this case, I opt to just show the results from the right commit, but without any highlighting. For example:

Right commit has all iOS run, left commit has all Android run dashboard

And the other way around dashboard

Actually I'm thinking of this UX.

By default, can we disable cross commits comparison? Just display all entries from new commit.
Compare entires only if users select a base commit. And in the list to be selected by users, it should only show commits that have overlap entries with the new commit. (This way there is no edge case 4)

This seems to be the UI improvement that @yangw-dev can help with? If you decide to handle it separately

huydhn · 2025-02-13T03:31:47Z

qq: If a benchmark job passed after retry, would the new result be uploaded and merged to the db? I know the step upload-benchmark-results is rerun on retry, but not sure how the data got updated in the db. Want to make sure it's behaving correctly

No worry, a retry would just upload the the result as usual because the upload is a step in the workflow

huydhn · 2025-02-13T03:33:46Z

Should we show nothing (empty board) instead since there is no overlap?

I feel that showing an empty page is just as confusing. And yes, we can default to show all entries from the latest (right) commit in the landing page.

Discovered a new issue: If you toggle the "Platform", the irrelevant entries will be back.

~~Oh ok, this needs to be handled~~ Wait, this is correct though, when you select a Platform for example, all the commits that doesn't that Platform will be removed. Notice that the list of commits changes. That was the original implementation I had to fix this issue until we decided to add the All Platforms option. So, they are not irrelevant entries though, but legit discrepancies.

huydhn · 2025-02-13T03:39:59Z

Actually I'm thinking of this UX.

By default, can we disable cross commits comparison? Just display all entries from new commit.

Compare entires only if users select a base commit. And in the list to be selected by users, it should only show commits that have overlap entries with the new commit. (This way there is no edge case 4)

This seems to be the UI improvement that @yangw-dev can help with? If you decide to handle it separately

Agree, let's do this in a separate PR.

guangy10 · 2025-02-14T00:50:43Z

Actually I'm thinking of this UX.

By default, can we disable cross commits comparison? Just display all entries from new commit.

Compare entires only if users select a base commit. And in the list to be selected by users, it should only show commits that have overlap entries with the new commit. (This way there is no edge case 4)

This seems to be the UI improvement that @yangw-dev can help with? If you decide to handle it separately

Agree, let's do this in a separate PR.

if we are not doing this in this PR, can you file a GitHub issue for it?

huydhn · 2025-02-14T01:55:16Z

if we are not doing this in this PR, can you file a GitHub issue for it?

#6293 captures the gist of it

So, here are the cases that have been fixed:

Only show devices that are covered by both the base and the new commit:
- Both running Android benchmarks
- Same snapshot, but having the two commits swapped
If a model is failing or not running on both commits, it won't show up at all as there is no records in both cases
If both commits run the same type of benchmark (Android or iOS), only highlight the models from the new commit dashboard

guangy10 · 2025-02-14T04:05:32Z

Something is still quite confusion.

Why is some entry shown only one data point? Shouldn't be always like "num1 -> num2" as we always compare base and new commits?
Also for the llama3_coreml_ane, it failed on base but passed on new, shouldn't it be displayed as a fix of regression?

See the details of the base and new commits below:

huydhn · 2025-02-14T17:54:01Z

Something is still quite confusion.

Oh, I realize that I have a logic to show only one number if the value from the base and the new commit is the same. I have removed that. So, now whenever there is only one number, it means that only the new commit has the benchmark results.

The highlight issue has been fixed, I hope, dashboard. The tweaking here is difficult to get it right because I need to deduce the info there instead of just showing a failed benchmark run. I create an issue for that here #6294

guangy10 · 2025-02-18T18:48:42Z

Something is still quite confusion.

Oh, I realize that I have a logic to show only one number if the value from the base and the new commit is the same. I have removed that. So, now whenever there is only one number, it means that only the new commit has the benchmark results.

The highlight issue has been fixed, I hope, dashboard. The tweaking here is difficult to get it right because I need to deduce the info there instead of just showing a failed benchmark run. I create an issue for that here #6294

In this new revision there are lots of entries marked as "0->number" though the jobs run successfully on both commits.

guangy10 · 2025-02-18T18:55:44Z

If we merge the PR like this, the UX won’t be better as there are many entries are still misleading. We should probably narrow this PR to only fix what can be fixed independent from the saving failed runs to the DB. Then we should think about how we can split the issue, what you can help with and what Yang can take over.

huydhn · 2025-02-19T02:17:16Z

To sum up what have been fixed by this PR are:

Show only devices and models that are available from both left and right commit
If there is no corresponding benchmark on the left commit, show only the value from the right commit instead of L → R

This ensures that if anything is highlighted on the dashboard, they are legit.

What has not yet been fixed is that CI failures won't show up on the dashboard as there is currently no way to differentiate between having no data because of CI failure v.s. having no data because of benchmark not running. This will be covered next by #6294.

guangy10 · 2025-02-19T18:50:01Z

To sum up what have been fixed by this PR are:

Show only devices and models that are available from both left and right commit

If there is no corresponding benchmark on the left commit, show only the value from the right commit instead of L → R

This ensures that if anything is highlighted on the dashboard, they are legit.

What has not yet been fixed is that CI failures won't show up on the dashboard as there is currently no way to differentiate between having no data because of CI failure v.s. having no data because of benchmark not running. This will be covered next by #6294.

Yeah, it sounds a concrete fix that we can done without having #6294 and #6293. Just to clarify, when we compare the devices, it includes both the device name and the os version, both must match exactly?

huydhn · 2025-02-19T19:06:01Z

Yeah, it sounds a concrete fix that we can done without having #6294 and #6293. Just to clarify, when we compare the devices, it includes both the device name and the os version, both must match exactly?

Yup, they must match from both left and right commits

guangy10 · 2025-02-19T20:03:01Z

Yeah, it sounds a concrete fix that we can done without having #6294 and #6293. Just to clarify, when we compare the devices, it includes both the device name and the os version, both must match exactly?

Yup, they must match from both left and right commits

Can we loose the device match to only cover the major OS version instead? That is, "iPhone 15 Plus iOS 17.4" on the left commit will be considered as a match with "iPhone Plus iOS 17.2" on the right commit.

huydhn · 2025-02-19T20:16:18Z

Can we loose the device match to only cover the major OS version instead? That is, "iPhone 15 Plus iOS 17.4" on the left commit will be considered as a match with "iPhone Plus iOS 17.2" on the right commit.

Done, you can see that the matching conditions https://github.com/pytorch/test-infra/pull/6277/files#diff-8b946f720239823df7504798036de4008e6028f8961d7a1d04de6969d9b784b7R112 are:

Device name without the arch part, i.e. iOS 17.2
Model + backend name

guangy10

Good to go! 🚀

Filter out devices that are not run

a927e58

huydhn requested review from guangy10 and yangw-dev February 11, 2025 17:31

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 11, 2025

vercel bot deployed to Preview February 11, 2025 17:34 View deployment

huydhn mentioned this pull request Feb 11, 2025

[DevX] Commits comparison via dashboard is misleading pytorch/executorch#7986

Closed

yangw-dev approved these changes Feb 12, 2025

View reviewed changes

torchci/lib/benchmark/llmUtils.ts Outdated Show resolved Hide resolved

guangy10 requested changes Feb 12, 2025

View reviewed changes

Address review comments

250a210

huydhn commented Feb 12, 2025

View reviewed changes

huydhn added 3 commits February 12, 2025 12:53

Use 10% threshold

a0b0138

Remove debug message

62bc931

Yarn format

8b2a804

vercel bot deployed to Preview February 12, 2025 21:01 View deployment

huydhn requested a review from guangy10 February 12, 2025 21:05

Handle case where there is no overlapping

c65ea8a

vercel bot deployed to Preview February 12, 2025 21:24 View deployment

Fix yarn

a3e2088

vercel bot deployed to Preview February 12, 2025 21:27 View deployment

Handle valid model and backend

67d1412

vercel bot deployed to Preview February 14, 2025 01:44 View deployment

huydhn mentioned this pull request Feb 14, 2025

Revisit the comparison view between base and new commit #6293

Open

Another tweak

2c69168

vercel bot deployed to Preview February 14, 2025 17:20 View deployment

Also filter by model

d8c583a

vercel bot deployed to Preview February 14, 2025 17:39 View deployment

huydhn mentioned this pull request Feb 14, 2025

Record failed benchmark runs in the database #6294

Open

huydhn mentioned this pull request Feb 18, 2025

Benchmark: Fix uploading results for jobs finishes with partial success pytorch/executorch#7989

Closed

guangy10 mentioned this pull request Feb 18, 2025

DevX: Track the benchmark infra health and usage pytorch/executorch#8247

Open

Another tweak

45a91f1

vercel bot deployed to Preview February 19, 2025 01:53 View deployment

Merge branch 'main' into better-model-filter

9c6c642

vercel bot deployed to Preview February 19, 2025 20:16 View deployment

guangy10 approved these changes Feb 19, 2025

View reviewed changes

huydhn merged commit 5de69da into pytorch:main Feb 19, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter out devices that are not run #6277

Filter out devices that are not run #6277

huydhn commented Feb 11, 2025 •

edited

Loading

vercel bot commented Feb 11, 2025

vercel bot commented Feb 11, 2025 •

edited

Loading

guangy10 commented Feb 12, 2025 •

edited

Loading

guangy10 left a comment

huydhn Feb 12, 2025

huydhn commented Feb 12, 2025

huydhn commented Feb 12, 2025

guangy10 commented Feb 12, 2025

guangy10 commented Feb 12, 2025

guangy10 commented Feb 12, 2025 •

edited

Loading

huydhn commented Feb 13, 2025

huydhn commented Feb 13, 2025 •

edited

Loading

huydhn commented Feb 13, 2025

guangy10 commented Feb 14, 2025

huydhn commented Feb 14, 2025 •

edited

Loading

guangy10 commented Feb 14, 2025

huydhn commented Feb 14, 2025 •

edited

Loading

guangy10 commented Feb 18, 2025

guangy10 commented Feb 18, 2025

huydhn commented Feb 19, 2025

guangy10 commented Feb 19, 2025

huydhn commented Feb 19, 2025

guangy10 commented Feb 19, 2025

huydhn commented Feb 19, 2025

guangy10 left a comment

Filter out devices that are not run #6277

Filter out devices that are not run #6277

Conversation

huydhn commented Feb 11, 2025 • edited Loading

Testing

vercel bot commented Feb 11, 2025

vercel bot commented Feb 11, 2025 • edited Loading

guangy10 commented Feb 12, 2025 • edited Loading

guangy10 left a comment

Choose a reason for hiding this comment

huydhn Feb 12, 2025

Choose a reason for hiding this comment

huydhn commented Feb 12, 2025

huydhn commented Feb 12, 2025

guangy10 commented Feb 12, 2025

guangy10 commented Feb 12, 2025

guangy10 commented Feb 12, 2025 • edited Loading

huydhn commented Feb 13, 2025

huydhn commented Feb 13, 2025 • edited Loading

huydhn commented Feb 13, 2025

guangy10 commented Feb 14, 2025

huydhn commented Feb 14, 2025 • edited Loading

guangy10 commented Feb 14, 2025

huydhn commented Feb 14, 2025 • edited Loading

guangy10 commented Feb 18, 2025

guangy10 commented Feb 18, 2025

huydhn commented Feb 19, 2025

guangy10 commented Feb 19, 2025

huydhn commented Feb 19, 2025

guangy10 commented Feb 19, 2025

huydhn commented Feb 19, 2025

guangy10 left a comment

Choose a reason for hiding this comment

huydhn commented Feb 11, 2025 •

edited

Loading

vercel bot commented Feb 11, 2025 •

edited

Loading

guangy10 commented Feb 12, 2025 •

edited

Loading

guangy10 commented Feb 12, 2025 •

edited

Loading

huydhn commented Feb 13, 2025 •

edited

Loading

huydhn commented Feb 14, 2025 •

edited

Loading

huydhn commented Feb 14, 2025 •

edited

Loading