Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Track] enable e2e process to add new training machine #1910

Open
6 tasks
sunya-ch opened this issue Jan 15, 2025 · 1 comment
Open
6 tasks

[Track] enable e2e process to add new training machine #1910

sunya-ch opened this issue Jan 15, 2025 · 1 comment
Labels
gh-action This issue is related to kepler-action kind/feature New feature or request metal-ci This issue is related to kepler-metal-ci model-db This issue is related to kepler-model-db model-server This issue is related to kepler-model-server

Comments

@sunya-ch
Copy link
Collaborator

sunya-ch commented Jan 15, 2025

What would you like to be added?

This is a broken down issue from #1906 to focus on the first task.
We must have an automation ready for new machine to be integrated into Kepler-metal-ci training and validation report.
In addition, we need to prepare a dispatch CI to push the trained model to kepler-model-db to make it available there and also to kepler as well.

Previous issue: sustainable-computing-io/kepler-model-server#258

Why is this needed?

Action items

  • clean up unused workflow files
  • generalize the training and validation flow to support different action to create a runner
  • add a machine layer for validation result page
  • manual workflow to fetch kepler-model-db to add the trained model and push a PR signed by the machine-specific account

Next step (future action items)

  • dispatch the PR push workflow on release
  • dispatch the PR to kepler repo once the new trained model merged.
@sunya-ch
Copy link
Collaborator Author

During the development of kepler and kepler-model-server, we have to retrain and validate the model for every change to ensure that we have no regression.
The results will report to the github page in the same way we have in https://sustainable-computing-io.github.io/kepler-metal-ci/kepler-model-train-validate.html but per machine. (meaning that no need to upload retrained model during the development)

On each release, we want the model to be exported and be available on the kepler-model-db where other users can use these models.
To export the model, we have the threshold of validation result to allow only accurate model to be pushed and shared.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gh-action This issue is related to kepler-action kind/feature New feature or request metal-ci This issue is related to kepler-metal-ci model-db This issue is related to kepler-model-db model-server This issue is related to kepler-model-server
Projects
None yet
Development

No branches or pull requests

1 participant