[Track] enable e2e process to add new training machine #1910

sunya-ch · 2025-01-15T05:01:06Z

What would you like to be added?

This is a broken down issue from #1906 to focus on the first task.
We must have an automation ready for new machine to be integrated into Kepler-metal-ci training and validation report.
In addition, we need to prepare a dispatch CI to push the trained model to kepler-model-db to make it available there and also to kepler as well.

Previous issue: sustainable-computing-io/kepler-model-server#258

Why is this needed?

[Track] Call for Machine/Device Sponsorship #1906
[Discussion and action required] ec2 data collection workflows (need links to Kepler) kepler-model-server#258
Add workflow to notify Kepler to update the models once there is a new model release kepler-model-db#24

Action items

clean up unused workflow files
generalize the training and validation flow to support different action to create a runner
add a machine layer for validation result page
manual workflow to fetch kepler-model-db to add the trained model and push a PR signed by the machine-specific account

Next step (future action items)

dispatch the PR push workflow on release
dispatch the PR to kepler repo once the new trained model merged.

sunya-ch · 2025-01-24T02:51:48Z

During the development of kepler and kepler-model-server, we have to retrain and validate the model for every change to ensure that we have no regression.
The results will report to the github page in the same way we have in https://sustainable-computing-io.github.io/kepler-metal-ci/kepler-model-train-validate.html but per machine. (meaning that no need to upload retrained model during the development)

On each release, we want the model to be exported and be available on the kepler-model-db where other users can use these models.
To export the model, we have the threshold of validation result to allow only accurate model to be pushed and shared.

sunya-ch added kind/feature New feature or request model-server This issue is related to kepler-model-server model-db This issue is related to kepler-model-db metal-ci This issue is related to kepler-metal-ci gh-action This issue is related to kepler-action labels Jan 15, 2025

sunya-ch mentioned this issue Jan 22, 2025

Cleanup redundant GH workflows sustainable-computing-io/kepler-metal-ci#351

Closed

sunya-ch added this to the 2/2 POC for issue #1910 milestone Jan 22, 2025

sunya-ch mentioned this issue Jan 22, 2025

[CI][brainstorming] make model training as a github action based on tekton sustainable-computing-io/kepler-model-server#212

Open

SamYuan1990 mentioned this issue Jan 23, 2025

try to locate the file which going to PR back to kepler-model-db sustainable-computing-io/kepler-metal-ci#356

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Track] enable e2e process to add new training machine #1910

[Track] enable e2e process to add new training machine #1910

sunya-ch commented Jan 15, 2025 •

edited

Loading

sunya-ch commented Jan 24, 2025

[Track] enable e2e process to add new training machine #1910

[Track] enable e2e process to add new training machine #1910

Comments

sunya-ch commented Jan 15, 2025 • edited Loading

What would you like to be added?

Why is this needed?

Action items

Next step (future action items)

sunya-ch commented Jan 24, 2025

sunya-ch commented Jan 15, 2025 •

edited

Loading