[Tuner] Improving ease of use for the tuner #814

Max191 · 2025-01-10T22:09:40Z

Overview and Goals

This issue is for listing out the goals for the future state of the tuner, focusing on better testing and ease of setup/use.

In the simplest terms, the end goal of this issue is for the tuner to have little to no setup time, and if a user is able to compile and run a program, then the user should be able to (nearly) just as easily tune the program using the tuner. This means that nearly all of the current process for tuning needs to be automated, and hooked into components that are directly generated from the compiler, which leads to the next point:

Another focus of this issue is to continue hooking the tuner into components directly generated by the compiler. The current state of the tuner requires the user to know about many special flags (marking root ops, dumping benchmarks, etc), and then manually arrange the necessary inputs (flag file, benchmark files) and outputs (concatenated tuning TD spec). All of the inputs to the tuner should be easily directly generated by the compiler, and all outputs should be directly generated by the tuner.

Future Tasks

There is a lot to be done, so I will try to break down some of the work into smaller sub-projects:

Extracting Dispatches to Tune

In the current state, the first manual step of the tuner is to collect a tracy profile, and pick out the top dispatches to tune based on the runtime percentage in the model. This should ultimately be automated somehow.

Add some simple in-tree hooks for just tracking the total run time of each dispatch. Similar work has been done for tracking CPU events: Executable library call hooks system, and a sample Linux/CPU event implementation iree-org/iree#15803.
Once times are collected, the compiler or tuner should be able to automatically dump the tunable benchmarks files, which can be directly ingested by the tuner for tuning.

Offload Work to the Compiler

There is a lot of python code to go from benchmark -> candidate TD spec in the tuner. Ideally, the compiler should generate something that is easy for the tuner ingest, and the TD spec should be very simple to create.

Create friendlier TransformDialect ops for tuning. We currently use transform.iree.match.cast_compatible_dag_from_root to match the operation, but this op is very sensitive to extra attributes, and we need to be careful about what attributes are present in the TD spec. Ideally there should be a TD op designed for tuning spec matching, which is less sensitive to extraneous attributes.
Expose utils for finding tunable ops to the python bindings. We are using a hacky attribute that is set by a compiler flag to match for the root op of a dispatch, but there should be an exposed function for finding the set of tunable ops in a dispatch.
Use python bindings for more in the tuner (building TD specs, finding contraction dimensions, etc)

Tuner Ease of Use

This refers to an overall easier user experience. This means reducing the number of flags required by the user, and automating the setup process for the tuner.

Automate generation of compile/benchmark flag files. This should be done in the compiler, so a user who compiles and benchmarks a program can simply add an option to dump the flags to be later used for tuning.
Create better defaults for tuner flags. This includes things like the codegen-pipeline, the search space for gpu pipeline options, the number of each type of candidate. The user should not have to be aware of any tuner implementation, and these flags should have defaults that work well out of the box.
Create a general tuning loop that can be used to automagically tune a model, given the compilation and benchmarking flags. We have been relying on the examples/simple example for tuning, but that is only meant to be an example for how to make a tuning client. There should be a central tuning loop, and it should be obvious to the user how to use it.
Automatically generate concatenated TD specs after tuning.
- [Tuner] Automatic merging of tuner-generated td specs #810
Generate better logs with more condensed and organized information.
- [Tuner] Reduce log file size #806

Further Tuning Support

Support tuning of dynamic shaped ops.
- Enable benchmarking dispatches with dynamic shapes iree-org/iree#19518
Support more dispatch types (NCHW conv, Attention, fusions)

Improve Test Coverage in the Tuner

The poor test coverage was made very clear in the last sprint for SDXL tuning, as there were many bugs found in the new tuner path once real model tuning loops were being used. There needs to be overall better testing coverage and error handling in the tuner, since each bug that is hit at the end of tuning leads to the loss of a lot of time, which is very important in time pressure.

Add tests for runtime failures of all external calls within the tuner
- [Tuner] Add tests to account for runtime failures #811
Restructure code to make more parts testable with mocking. All code in the tuner should have tests written for it, and large functions should be broken down into smaller functions that can be easily mocked and tested.
Add eventual e2e tuning loop tests. This would probably require CPU tuning to be implemented, since we do not want to require GPU runners for tuning tests, but it would be good to eventually have e2e tests of the full tuner flow running in the CI.

The text was updated successfully, but these errors were encountered:

kuhar · 2025-01-10T22:16:42Z

We currently use transform.iree.match.cast_compatible_dag_from_root to match the operation,

Another issue is that it does not support matching constants that may be used in bodies of linalg ops.

kuhar · 2025-01-10T22:19:58Z

Another big action item should be to automatically collect profiles so that users don't have to collect Tracy traces and manually select ops to tune. This is described in the original tuner issue: iree-org/iree#16952 . This will require compiler support as well.

One more thing: support dispatches with dynamic shapes. This requires us to add support for generating benchmarks for dynamic shapes: iree-org/iree#19518

Max191 · 2025-01-10T22:23:54Z

Another big action item should be to automatically collect profiles so that users don't have to collect Tracy traces and manually select ops to tune. This is described in the original tuner issue: iree-org/iree#16952 . This will require compiler support as well.

One more thing: support dispatches with dynamic shapes. This requires us to add support for generating benchmarks for dynamic shapes: iree-org/iree#19518

Thanks for the suggestions! I'll add them to the task list. When you say automatically collect profiles, do you specifically mean tracy profiles? One of my tasks above talks about adding some simple hooks in the compiler to track total run time, but I did not include automating the full tracy trace, since I didn't think the full tracy trace was necessary for the tuning loop.

kuhar · 2025-01-10T22:24:42Z

Not exactly tracy profiles but something equivalent with enough fidelity for the tuner to identify top dispatches. Ideally we should survey existing profile data formats used in PGO/AutoFDO and pick something portable, if that exists.

Max191 added the tuner label Jan 10, 2025

Max191 assigned kuhar, bangtianliu and Max191 Jan 10, 2025

kuhar mentioned this issue Jan 10, 2025

Add Tuning Support (Umbrella Issue) iree-org/iree#16952

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tuner] Improving ease of use for the tuner #814

[Tuner] Improving ease of use for the tuner #814

Max191 commented Jan 10, 2025 •

edited

Loading

kuhar commented Jan 10, 2025

kuhar commented Jan 10, 2025

Max191 commented Jan 10, 2025

kuhar commented Jan 10, 2025 •

edited

Loading

[Tuner] Improving ease of use for the tuner #814

[Tuner] Improving ease of use for the tuner #814

Comments

Max191 commented Jan 10, 2025 • edited Loading

Overview and Goals

Future Tasks

Extracting Dispatches to Tune

Offload Work to the Compiler

Tuner Ease of Use

Further Tuning Support

Improve Test Coverage in the Tuner

kuhar commented Jan 10, 2025

kuhar commented Jan 10, 2025

Max191 commented Jan 10, 2025

kuhar commented Jan 10, 2025 • edited Loading

Max191 commented Jan 10, 2025 •

edited

Loading

kuhar commented Jan 10, 2025 •

edited

Loading