-
Notifications
You must be signed in to change notification settings - Fork 469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DevX: Track the benchmark infra health and usage #8247
Comments
This is something that we can build overtime after we have auto regression detection in place |
Good question. I'd turn this into a discussion item, and clarify which questions you want the dash to answer and we work from there to figure out what the metrics should be and how and where they are presented. |
@byjlw This task is not to discuss what metrics we want to show in the benchmark dashboard to OSS users. It's mainly for the infra admin/developers to have an easier way to monitor the infra health and usage. If you have thoughts on what to show on dash we can open a new GitHub Issue for it. |
Hi @yangw-dev, @huydhn and I have discussed this yesterday. I'd like to minimize the efforts on this task (size M -> S or XS), but focus on enable the auto regression detection and alert with details to debug. See #8239 for details. For this specific task, I expect to have a better way than monitor the CI health via HUD using this link: https://hud.pytorch.org/hud/pytorch/executorch/main/1?per_page=50&name_filter=-perf&mergeLF=true. The main problems with view on HUD are:
|
Hi guang, sounds good! I will take a look into tihs
Hi guang, sounds good! I will take a look into this |
Another issue which is a good case to show why monitoring HUD is not efficient. When testing on pytorch/test-infra#6277 @huydhn and I noticed some entries are marked as "0->new number" on the dashboard though the model+benchmark_config passed on both the base and new commits. After digging into it, it turns out that they are actually running on different devices, i.e. one runs on iPhone 15 Plus and one doesn't. Without looking into each individual job (on HUD for example), it's impossible to notice this discrepancy. |
It’s a good idea to build a feature to allow ExecuTorch to self-manage the device pool going forward. Dev Infra folks can do so using AWS console at https://us-west-2.console.aws.amazon.com/devicefarm/home?region=us-east-1#/mobile/projects/02a2cf0f-6d9b-45ee-ba1a-a086587469e6/settings, so we need to expose this functionality somehow. Let me create a tracking issue for this |
Today I'm monitoring the infra health only via the HUD by filtering jobs with "-perf": https://hud.pytorch.org/hud/pytorch/executorch/main/1?per_page=50&name_filter=-perf&mergeLF=true
I'm wondering if there is a better way to monitor the health and with detailed metrics. It could be something like this: https://hud.pytorch.org/metrics, where I can see the historical run and success rate of the benchmark jobs, nightly runs vs. on-demand. High frequent failures, hotspot devices, etc.
cc: @kimishpatel @digantdesai
cc @huydhn @kirklandsign @shoumikhin @mergennachin @byjlw
The text was updated successfully, but these errors were encountered: