Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

removing old tuning files #2081

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from
Open

removing old tuning files #2081

wants to merge 1 commit into from

Conversation

babakpst
Copy link
Collaborator

@babakpst babakpst commented Jan 9, 2025

Summary:

*What is being changed and why?
Some old tuning files that we don't support were deleted.
Outcomes:

What is the result of this change? What components of the project does it affect?
No impact on the current tuning process.
Notable changes:

Are there any changes that are of particular importance?
No.
Testing and Environment:

What environment are you targeting (OS, ROCm version, Python versions, etc.)?

*What testing did you do to ensure this change will integrate successfully?
Deleted files are not part of the Tensile workflow.

@IMbackK
Copy link

IMbackK commented Jan 9, 2025

This pr fails to remove the extensive references to these scripts in the documentation.

I would also like to know what replacement for these scripts exists/is planed as they are invalubale due to the fact that rocblas/tensile is very poor at selecting the correct kernel without tuning on this problem specific size. Pytorch contains a workaround to this in the form of tuneableop but any other client is left in the cold by this change.

@TorreZuk
Copy link
Contributor

TorreZuk commented Jan 9, 2025

@IMbackK can you clarify are you trying to compare existing kernels that can solve a problem or generating and tuning your own custom kernels? Just wonder how this relates to https://github.com/ROCm/rocBLAS/blob/develop/clients/samples/example_user_driven_tuning.cpp . Thanks.

@IMbackK
Copy link

IMbackK commented Jan 9, 2025

@TorreZuk So i noticed that rocblas is very poor at selecting the fastest kernel and experimenting with the solution_index by hand can result in very significant uplifts, often on the order of 100% or more. This is semi viable in my own code but where 3rd party library code is used this is ofc not viable. A similar observation was clearly made by AMD's pytorch team as tuneableop was introduced solely for the rocm back end of pytorch with similar goles.

As an experiment i wanted to tune tensile for the exact problem shapes encountered in my use case since this would result in theory tensile having an easy time selecting a good solution since the library logic files will then contain the required configuration exactly. To this end i fixed various problems in the scripts that this pr removes, see #2079, and tuned tensile on some of the sizes in my use case. The internals of tensile are opaque to me im not sure how the tuning process and the kernel generation process in tensile interact, but the conjecture that this improves the selection outcome seams correct.

An additional goal here was understanding tensile a bit more to maybe figure out why the Arcturus solutions in hipBLASLt are so slow, usually they perform about 1/4 of the equivalent rocblas call.
https://github.com/ROCm/hipBLASLt/tree/develop/library/src/amd_detail/rocblaslt/src/Tensile/Logic/asm_full/arcturus/GridBased
The tuning scripts here and the accompanying documentation are as far as i know the best resource to begin to understand tensiles workings.

There are also other issues at play here that i think are best discussed in separate places

@TorreZuk
Copy link
Contributor

TorreZuk commented Jan 9, 2025

@IMbackK thanks for the quick feedback. rocblas defers to either Tensile or hipblasLt to select the best kernel. So https://github.com/ROCm/rocBLAS/blob/develop/clients/samples/example_user_driven_tuning.cpp does relate to your use case to provide the list of potential kernels. We will analyze how we might improve this from the rocblas client side but yes understanding Tensile or hipBLASLt solution selection is relevant to what is used by default, as only superficial selection logic is in the rocblas repo (although Tensile build gets embedded into rocblas library). The best kernel for a shape could vary based on node load and clocks so any use along with other running kernels will definitely impact the "winner". Tensile will have to provide more documentation to guide you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants