Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tuner] Improving ease of use for the tuner #814

Open
15 tasks
Max191 opened this issue Jan 10, 2025 · 4 comments
Open
15 tasks

[Tuner] Improving ease of use for the tuner #814

Max191 opened this issue Jan 10, 2025 · 4 comments
Assignees
Labels

Comments

@Max191
Copy link
Contributor

Max191 commented Jan 10, 2025

Overview and Goals

This issue is for listing out the goals for the future state of the tuner, focusing on better testing and ease of setup/use.

In the simplest terms, the end goal of this issue is for the tuner to have little to no setup time, and if a user is able to compile and run a program, then the user should be able to (nearly) just as easily tune the program using the tuner. This means that nearly all of the current process for tuning needs to be automated, and hooked into components that are directly generated from the compiler, which leads to the next point:

Another focus of this issue is to continue hooking the tuner into components directly generated by the compiler. The current state of the tuner requires the user to know about many special flags (marking root ops, dumping benchmarks, etc), and then manually arrange the necessary inputs (flag file, benchmark files) and outputs (concatenated tuning TD spec). All of the inputs to the tuner should be easily directly generated by the compiler, and all outputs should be directly generated by the tuner.

Future Tasks

There is a lot to be done, so I will try to break down some of the work into smaller sub-projects:

Extracting Dispatches to Tune

In the current state, the first manual step of the tuner is to collect a tracy profile, and pick out the top dispatches to tune based on the runtime percentage in the model. This should ultimately be automated somehow.

Offload Work to the Compiler

There is a lot of python code to go from benchmark -> candidate TD spec in the tuner. Ideally, the compiler should generate something that is easy for the tuner ingest, and the TD spec should be very simple to create.

  • Create friendlier TransformDialect ops for tuning. We currently use transform.iree.match.cast_compatible_dag_from_root to match the operation, but this op is very sensitive to extra attributes, and we need to be careful about what attributes are present in the TD spec. Ideally there should be a TD op designed for tuning spec matching, which is less sensitive to extraneous attributes.
  • Expose utils for finding tunable ops to the python bindings. We are using a hacky attribute that is set by a compiler flag to match for the root op of a dispatch, but there should be an exposed function for finding the set of tunable ops in a dispatch.
  • Use python bindings for more in the tuner (building TD specs, finding contraction dimensions, etc)

Tuner Ease of Use

This refers to an overall easier user experience. This means reducing the number of flags required by the user, and automating the setup process for the tuner.

  • Automate generation of compile/benchmark flag files. This should be done in the compiler, so a user who compiles and benchmarks a program can simply add an option to dump the flags to be later used for tuning.
  • Create better defaults for tuner flags. This includes things like the codegen-pipeline, the search space for gpu pipeline options, the number of each type of candidate. The user should not have to be aware of any tuner implementation, and these flags should have defaults that work well out of the box.
  • Create a general tuning loop that can be used to automagically tune a model, given the compilation and benchmarking flags. We have been relying on the examples/simple example for tuning, but that is only meant to be an example for how to make a tuning client. There should be a central tuning loop, and it should be obvious to the user how to use it.
  • Automatically generate concatenated TD specs after tuning.
  • Generate better logs with more condensed and organized information.

Further Tuning Support

Improve Test Coverage in the Tuner

The poor test coverage was made very clear in the last sprint for SDXL tuning, as there were many bugs found in the new tuner path once real model tuning loops were being used. There needs to be overall better testing coverage and error handling in the tuner, since each bug that is hit at the end of tuning leads to the loss of a lot of time, which is very important in time pressure.

  • Add tests for runtime failures of all external calls within the tuner
  • Restructure code to make more parts testable with mocking. All code in the tuner should have tests written for it, and large functions should be broken down into smaller functions that can be easily mocked and tested.
  • Add eventual e2e tuning loop tests. This would probably require CPU tuning to be implemented, since we do not want to require GPU runners for tuning tests, but it would be good to eventually have e2e tests of the full tuner flow running in the CI.
@kuhar
Copy link
Member

kuhar commented Jan 10, 2025

We currently use transform.iree.match.cast_compatible_dag_from_root to match the operation,

Another issue is that it does not support matching constants that may be used in bodies of linalg ops.

@kuhar
Copy link
Member

kuhar commented Jan 10, 2025

Another big action item should be to automatically collect profiles so that users don't have to collect Tracy traces and manually select ops to tune. This is described in the original tuner issue: iree-org/iree#16952 . This will require compiler support as well.

One more thing: support dispatches with dynamic shapes. This requires us to add support for generating benchmarks for dynamic shapes: iree-org/iree#19518

@Max191
Copy link
Contributor Author

Max191 commented Jan 10, 2025

Another big action item should be to automatically collect profiles so that users don't have to collect Tracy traces and manually select ops to tune. This is described in the original tuner issue: iree-org/iree#16952 . This will require compiler support as well.

One more thing: support dispatches with dynamic shapes. This requires us to add support for generating benchmarks for dynamic shapes: iree-org/iree#19518

Thanks for the suggestions! I'll add them to the task list. When you say automatically collect profiles, do you specifically mean tracy profiles? One of my tasks above talks about adding some simple hooks in the compiler to track total run time, but I did not include automating the full tracy trace, since I didn't think the full tracy trace was necessary for the tuning loop.

@kuhar
Copy link
Member

kuhar commented Jan 10, 2025

Not exactly tracy profiles but something equivalent with enough fidelity for the tuner to identify top dispatches. Ideally we should survey existing profile data formats used in PGO/AutoFDO and pick something portable, if that exists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants