-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add batched_transpose implement #1660
base: develop
Are you sure you want to change the base?
Conversation
include/ck_tile/ops/batched_transpose/pipeline/batched_transpose_pipeline.hpp
Outdated
Show resolved
Hide resolved
@fangche123 PTAL at the comments. |
include/ck_tile/ops/batched_transpose/kernel/batched_transpose_kernel.hpp
Outdated
Show resolved
Hide resolved
include/ck_tile/ops/batched_transpose/pipeline/batched_transpose_pipeline.hpp
Outdated
Show resolved
Hide resolved
include/ck_tile/ops/batched_transpose/pipeline/batched_transpose_problem.hpp
Outdated
Show resolved
Hide resolved
Could we also measure the latency and TFLOPs of this operator? Thanks! |
include/ck_tile/ops/batched_transpose/kernel/batched_transpose_kernel.hpp
Outdated
Show resolved
Hide resolved
include/ck_tile/ops/batched_transpose/kernel/batched_transpose_kernel.hpp
Outdated
Show resolved
Hide resolved
include/ck_tile/ops/batched_transpose/pipeline/batched_transpose_pipeline.hpp
Outdated
Show resolved
Hide resolved
@@ -0,0 +1,9 @@ | |||
add_executable(tile_example_batched_transpose EXCLUDE_FROM_ALL batched_transpose.cpp batched_transpose_api.cpp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to avoid repeating the target name, can you save the name into a variabble and reuse it later in CMake?
set(TARGET_NAME tile_example_batched_transpose)
add_executable(${TARGET_NAME} EXCLUDE_FROM_ALL batched_transpose.cpp batched_transpose_api.cpp)
target_include_directories(${TARGET_NAME} PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been addressed
index_t batch; | ||
index_t height; | ||
index_t width; | ||
index_t dim_blocks; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dim_blocks
, dim_block_h
and dim_block_w
are never used in kernel, suggest to remove them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
index_t batch; | ||
index_t height; | ||
index_t width; | ||
index_t dim_blocks; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you never use dim_blocks
in rest of code. suggest to remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
bool is_valid = is_kernel_valid(N, C, H, W, &iter_kparam, layout_in); | ||
if(is_valid) | ||
{ | ||
kparam = iter_kparam; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you are doing part of kernel selection here, I suggest to pass kparam directly to batched_transpose() API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would need us to pass args as well would you suggest me to do that? That would duplicate few parts of the code!
uint32_t height = nchw2nhwc ? C : H * W; | ||
uint32_t width = nchw2nhwc ? H * W : C; | ||
uint32_t dim_block_h = (height + kparam.tile_y - 1) / kparam.tile_y; | ||
uint32_t dim_block_w = (width + kparam.tile_x - 1) / kparam.tile_x; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to pass kparam
directly to batched_transpose()
API, so that we won't have to expose implementation detail here. Then we can move those computations into the BATCHED_TRANSPOSE_DISPATCH()
macro.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To find karg we need both args(N, C, H, W) as well as kparam do you suggest to pass both these into batched_transpose()?
bool transpose_kernel_is_valid(uint32_t batch, | ||
uint32_t height, | ||
uint32_t width, | ||
const transpose_kernel_param_t* kparam) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why use pointer instead of reference const transpose_kernel_param_t&
here? is kparam
nullable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
uint32_t width, | ||
const transpose_kernel_param_t* kparam) | ||
{ | ||
(void)batch; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just remove the name of unused parameter. So you won't have to do this.
bool transpose_kernel_is_valid(uint32_t, // 'batch' removed
uint32_t height,
uint32_t width,
const transpose_kernel_param_t* kparam)
{ /* ... */ }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
uint32_t c, | ||
uint32_t h, | ||
uint32_t w, | ||
const transpose_kernel_param_t* kparam, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use reference here: const transpose_kernel_param_t&
if kparam
is not nullable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
#include <string> | ||
#include <type_traits> | ||
|
||
#ifndef BATCHED_TRANSPOSE_USE_RAW_TILE_WINDOW |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove no longer used macro
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
x_dev.ToDevice(x_host.data()); | ||
|
||
transpose_kernel_param_t kparam; | ||
for(auto iter_kparam : get_transpose_all_kernel(prec)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You always set tile size=16x16 in BATCHED_TRANSPOSE_DISPATCH()
. Why do we use other variables: kparam.tile_x
/kparam.tile_y
to decide the grid size?
static constexpr bool kPadN = Problem::kPadN; | ||
|
||
static constexpr ck_tile::index_t kMPerThread = Problem::kMPerThread; | ||
static constexpr ck_tile::index_t kNPerThread = Problem::kNPerThread; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add static_assert()
to specify the limitations here
static_assert(kMPerThread == 1 && kNPerThread == 1);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing to be reviewed for docs.
No description provided.