You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue is for listing out the goals for the future state of the tuner, focusing on better testing and ease of setup/use.
In the simplest terms, the end goal of this issue is for the tuner to have little to no setup time, and if a user is able to compile and run a program, then the user should be able to (nearly) just as easily tune the program using the tuner. This means that nearly all of the current process for tuning needs to be automated, and hooked into components that are directly generated from the compiler, which leads to the next point:
Another focus of this issue is to continue hooking the tuner into components directly generated by the compiler. The current state of the tuner requires the user to know about many special flags (marking root ops, dumping benchmarks, etc), and then manually arrange the necessary inputs (flag file, benchmark files) and outputs (concatenated tuning TD spec). All of the inputs to the tuner should be easily directly generated by the compiler, and all outputs should be directly generated by the tuner.
Future Tasks
There is a lot to be done, so I will try to break down some of the work into smaller sub-projects:
Extracting Dispatches to Tune
In the current state, the first manual step of the tuner is to collect a tracy profile, and pick out the top dispatches to tune based on the runtime percentage in the model. This should ultimately be automated somehow.
Once times are collected, the compiler or tuner should be able to automatically dump the tunable benchmarks files, which can be directly ingested by the tuner for tuning.
Offload Work to the Compiler
There is a lot of python code to go from benchmark -> candidate TD spec in the tuner. Ideally, the compiler should generate something that is easy for the tuner ingest, and the TD spec should be very simple to create.
Create friendlier TransformDialect ops for tuning. We currently use transform.iree.match.cast_compatible_dag_from_root to match the operation, but this op is very sensitive to extra attributes, and we need to be careful about what attributes are present in the TD spec. Ideally there should be a TD op designed for tuning spec matching, which is less sensitive to extraneous attributes.
Expose utils for finding tunable ops to the python bindings. We are using a hacky attribute that is set by a compiler flag to match for the root op of a dispatch, but there should be an exposed function for finding the set of tunable ops in a dispatch.
Use python bindings for more in the tuner (building TD specs, finding contraction dimensions, etc)
Tuner Ease of Use
This refers to an overall easier user experience. This means reducing the number of flags required by the user, and automating the setup process for the tuner.
Automate generation of compile/benchmark flag files. This should be done in the compiler, so a user who compiles and benchmarks a program can simply add an option to dump the flags to be later used for tuning.
Create better defaults for tuner flags. This includes things like the codegen-pipeline, the search space for gpu pipeline options, the number of each type of candidate. The user should not have to be aware of any tuner implementation, and these flags should have defaults that work well out of the box.
Create a general tuning loop that can be used to automagically tune a model, given the compilation and benchmarking flags. We have been relying on the examples/simple example for tuning, but that is only meant to be an example for how to make a tuning client. There should be a central tuning loop, and it should be obvious to the user how to use it.
Automatically generate concatenated TD specs after tuning.
Support more dispatch types (NCHW conv, Attention, fusions)
Improve Test Coverage in the Tuner
The poor test coverage was made very clear in the last sprint for SDXL tuning, as there were many bugs found in the new tuner path once real model tuning loops were being used. There needs to be overall better testing coverage and error handling in the tuner, since each bug that is hit at the end of tuning leads to the loss of a lot of time, which is very important in time pressure.
Add tests for runtime failures of all external calls within the tuner
Restructure code to make more parts testable with mocking. All code in the tuner should have tests written for it, and large functions should be broken down into smaller functions that can be easily mocked and tested.
Add eventual e2e tuning loop tests. This would probably require CPU tuning to be implemented, since we do not want to require GPU runners for tuning tests, but it would be good to eventually have e2e tests of the full tuner flow running in the CI.
The text was updated successfully, but these errors were encountered:
Another big action item should be to automatically collect profiles so that users don't have to collect Tracy traces and manually select ops to tune. This is described in the original tuner issue: iree-org/iree#16952 . This will require compiler support as well.
One more thing: support dispatches with dynamic shapes. This requires us to add support for generating benchmarks for dynamic shapes: iree-org/iree#19518
Another big action item should be to automatically collect profiles so that users don't have to collect Tracy traces and manually select ops to tune. This is described in the original tuner issue: iree-org/iree#16952 . This will require compiler support as well.
One more thing: support dispatches with dynamic shapes. This requires us to add support for generating benchmarks for dynamic shapes: iree-org/iree#19518
Thanks for the suggestions! I'll add them to the task list. When you say automatically collect profiles, do you specifically mean tracy profiles? One of my tasks above talks about adding some simple hooks in the compiler to track total run time, but I did not include automating the full tracy trace, since I didn't think the full tracy trace was necessary for the tuning loop.
Not exactly tracy profiles but something equivalent with enough fidelity for the tuner to identify top dispatches. Ideally we should survey existing profile data formats used in PGO/AutoFDO and pick something portable, if that exists.
Overview and Goals
This issue is for listing out the goals for the future state of the tuner, focusing on better testing and ease of setup/use.
In the simplest terms, the end goal of this issue is for the tuner to have little to no setup time, and if a user is able to compile and run a program, then the user should be able to (nearly) just as easily tune the program using the tuner. This means that nearly all of the current process for tuning needs to be automated, and hooked into components that are directly generated from the compiler, which leads to the next point:
Another focus of this issue is to continue hooking the tuner into components directly generated by the compiler. The current state of the tuner requires the user to know about many special flags (marking root ops, dumping benchmarks, etc), and then manually arrange the necessary inputs (flag file, benchmark files) and outputs (concatenated tuning TD spec). All of the inputs to the tuner should be easily directly generated by the compiler, and all outputs should be directly generated by the tuner.
Future Tasks
There is a lot to be done, so I will try to break down some of the work into smaller sub-projects:
Extracting Dispatches to Tune
In the current state, the first manual step of the tuner is to collect a tracy profile, and pick out the top dispatches to tune based on the runtime percentage in the model. This should ultimately be automated somehow.
Offload Work to the Compiler
There is a lot of python code to go from benchmark -> candidate TD spec in the tuner. Ideally, the compiler should generate something that is easy for the tuner ingest, and the TD spec should be very simple to create.
transform.iree.match.cast_compatible_dag_from_root
to match the operation, but this op is very sensitive to extra attributes, and we need to be careful about what attributes are present in the TD spec. Ideally there should be a TD op designed for tuning spec matching, which is less sensitive to extraneous attributes.Tuner Ease of Use
This refers to an overall easier user experience. This means reducing the number of flags required by the user, and automating the setup process for the tuner.
examples/simple
example for tuning, but that is only meant to be an example for how to make a tuning client. There should be a central tuning loop, and it should be obvious to the user how to use it.Further Tuning Support
Improve Test Coverage in the Tuner
The poor test coverage was made very clear in the last sprint for SDXL tuning, as there were many bugs found in the new tuner path once real model tuning loops were being used. There needs to be overall better testing coverage and error handling in the tuner, since each bug that is hit at the end of tuning leads to the loss of a lot of time, which is very important in time pressure.
The text was updated successfully, but these errors were encountered: