Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add batched_transpose implement #1660

Open
wants to merge 18 commits into
base: develop
Choose a base branch
from
Open

Conversation

fangche123
Copy link

No description provided.

@fangche123 fangche123 marked this pull request as draft November 13, 2024 03:03
@fangche123 fangche123 marked this pull request as ready for review November 13, 2024 03:06
@fangche123 fangche123 marked this pull request as draft November 13, 2024 03:06
@ThomasNing
Copy link
Contributor

@fangche123 PTAL at the comments.

@ThomasNing
Copy link
Contributor

Could we also measure the latency and TFLOPs of this operator? Thanks!

@ThomasNing ThomasNing marked this pull request as ready for review January 16, 2025 19:41
@ThomasNing ThomasNing requested a review from a team as a code owner January 17, 2025 07:52
@@ -0,0 +1,9 @@
add_executable(tile_example_batched_transpose EXCLUDE_FROM_ALL batched_transpose.cpp batched_transpose_api.cpp)
Copy link
Contributor

@poyenc poyenc Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid repeating the target name, can you save the name into a variabble and reuse it later in CMake?

set(TARGET_NAME tile_example_batched_transpose)
add_executable(${TARGET_NAME} EXCLUDE_FROM_ALL batched_transpose.cpp batched_transpose_api.cpp)
target_include_directories(${TARGET_NAME} PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been addressed

index_t batch;
index_t height;
index_t width;
index_t dim_blocks;
Copy link
Contributor

@poyenc poyenc Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dim_blocks, dim_block_h and dim_block_w are never used in kernel, suggest to remove them

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

index_t batch;
index_t height;
index_t width;
index_t dim_blocks;
Copy link
Contributor

@poyenc poyenc Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you never use dim_blocks in rest of code. suggest to remove it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

bool is_valid = is_kernel_valid(N, C, H, W, &iter_kparam, layout_in);
if(is_valid)
{
kparam = iter_kparam;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are doing part of kernel selection here, I suggest to pass kparam directly to batched_transpose() API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would need us to pass args as well would you suggest me to do that? That would duplicate few parts of the code!

uint32_t height = nchw2nhwc ? C : H * W;
uint32_t width = nchw2nhwc ? H * W : C;
uint32_t dim_block_h = (height + kparam.tile_y - 1) / kparam.tile_y;
uint32_t dim_block_w = (width + kparam.tile_x - 1) / kparam.tile_x;
Copy link
Contributor

@poyenc poyenc Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to pass kparam directly to batched_transpose() API, so that we won't have to expose implementation detail here. Then we can move those computations into the BATCHED_TRANSPOSE_DISPATCH() macro.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To find karg we need both args(N, C, H, W) as well as kparam do you suggest to pass both these into batched_transpose()?

bool transpose_kernel_is_valid(uint32_t batch,
uint32_t height,
uint32_t width,
const transpose_kernel_param_t* kparam)
Copy link
Contributor

@poyenc poyenc Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use pointer instead of reference const transpose_kernel_param_t& here? is kparam nullable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

uint32_t width,
const transpose_kernel_param_t* kparam)
{
(void)batch;
Copy link
Contributor

@poyenc poyenc Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just remove the name of unused parameter. So you won't have to do this.

bool transpose_kernel_is_valid(uint32_t, // 'batch' removed
                               uint32_t height,
                               uint32_t width,
                               const transpose_kernel_param_t* kparam)
{ /* ... */ }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

uint32_t c,
uint32_t h,
uint32_t w,
const transpose_kernel_param_t* kparam,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use reference here: const transpose_kernel_param_t& if kparam is not nullable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

#include <string>
#include <type_traits>

#ifndef BATCHED_TRANSPOSE_USE_RAW_TILE_WINDOW
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove no longer used macro

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

x_dev.ToDevice(x_host.data());

transpose_kernel_param_t kparam;
for(auto iter_kparam : get_transpose_all_kernel(prec))
Copy link
Contributor

@poyenc poyenc Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You always set tile size=16x16 in BATCHED_TRANSPOSE_DISPATCH(). Why do we use other variables: kparam.tile_x/kparam.tile_y to decide the grid size?

static constexpr bool kPadN = Problem::kPadN;

static constexpr ck_tile::index_t kMPerThread = Problem::kMPerThread;
static constexpr ck_tile::index_t kNPerThread = Problem::kNPerThread;
Copy link
Contributor

@poyenc poyenc Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add static_assert() to specify the limitations here

static_assert(kMPerThread == 1 && kNPerThread == 1);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

Copy link
Contributor

@spolifroni-amd spolifroni-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing to be reviewed for docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants