-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cker] Introduce DepthwiseConv2D gradient kernel #12448
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
And IMO the title is better if [cker] Introduce ...
because these codes are not in onert.
4e2bf57
to
02a85ca
Compare
@jyoungyun I am waiting for the commit/PR title change as @YongseopKim commented: |
Oh, I was missing the |
This commit introduces DepthwiseConv2D gradient kernel. The depthwise_conv_op.h file came from TensorFlow code. This commit transforms the depthwise_conv related code to fit onert and execute this kernel function. ONE-DCO-1.0-Signed-off-by: Jiyoung Yun <[email protected]>
02a85ca
to
11fcfc9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
} | ||
|
||
{ | ||
const int thread_count = _dconv_kernel->getThreadCount() + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please let me know why thread count should be plus 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ragmani
The meaning of +1
is for main thread. The numThreads()
in the eigen library only returns the thread count except for the main thread. This function can be executed in the main thread depending on the situation. So I added 1
for main thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand. Out-of-curiosity, I searched tensorflow code (e.g. depthwise_conv_op.cc), it seems to use worker thread counts. I can't find +1
, but I found the following:
auto worker_threads = *(ctx->device()->tensorflow_cpu_worker_threads());
I guess, the thread_count
should be the number of worker threads.
Is the main thread is part of worker threads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason to add 1 for main thread in the thread count? As far as I know, main thread is usually not used in optimized kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason to add 1 for main thread in the thread count? As far as I know, main thread is usually not used in optimized kernel.
Could you give a link related to main thread is usually not used in optimized kernel
? This is because when I tested this kernel using several cases, the main thread was also used. Especially if the batch size is 1, it usually uses the main thread. And in other cases, which thread is being used depends on the situation.
if (n <= 1 || numThreads() == 1 ||
CostModel::numThreads(n, cost, static_cast<int>(numThreads())) == 1) {
f(0, n);
return;
}
In this code, n is a batch_size. And the batch size is 1, it works on main thread.
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff7a48780 (LWP 1326007) "test_cker" nnfw::cker::depthwise_conv_op::LaunchDepthwiseConvBackpropInputOp<Eigen::ThreadPoolDevice, float>::operator()(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*, float*, float*, bool, float*, float*)::{lambda(long, long)#1}::operator()(long, long) const (this=0x555555b0c560, start=0, limit=1) at /home/jyoung/git/npu/ONE/compute/cker/include/cker/eigen/depthwise_conv_op.h:592
2 Thread 0x7ffff797e700 (LWP 1326011) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0c9d8) at ../sysdeps/nptl/futex-internal.h:183
3 Thread 0x7ffff717d700 (LWP 1326012) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0ca58) at ../sysdeps/nptl/futex-internal.h:183
4 Thread 0x7ffff697c700 (LWP 1326013) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0cad8) at ../sysdeps/nptl/futex-internal.h:183
5 Thread 0x7ffff617b700 (LWP 1326014) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0cb58) at ../sysdeps/nptl/futex-internal.h:183
6 Thread 0x7ffff597a700 (LWP 1326015) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0cbd8) at ../sysdeps/nptl/futex-internal.h:183
7 Thread 0x7ffff5179700 (LWP 1326016) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0cc58) at ../sysdeps/nptl/futex-internal.h:183
8 Thread 0x7ffff4978700 (LWP 1326017) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0ccd8) at ../sysdeps/nptl/futex-internal.h:183
9 Thread 0x7ffff4177700 (LWP 1326018) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0cd58) at ../sysdeps/nptl/futex-internal.h:183
10 Thread 0x7ffff3976700 (LWP 1326019) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0cdd8) at ../sysdeps/nptl/futex-internal.h:183
11 Thread 0x7ffff3175700 (LWP 1326020) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0ce58) at ../sysdeps/nptl/futex-internal.h:183
12 Thread 0x7ffff2974700 (LWP 1326021) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0ced8) at ../sysdeps/nptl/futex-internal.h:183
13 Thread 0x7ffff2173700 (LWP 1326022) "test_cker" nnfw::cker::depthwise_conv_op::LaunchDepthwiseConvBackpropInputOp<Eigen::ThreadPoolDevice, float>::operator()(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*, float*, float*, bool, float*, float*)::{lambda(long, long)#1}::operator()(long, long) const (this=0x555555b0c560, start=8, limit=9) at /home/jyoung/git/npu/ONE/compute/cker/include/cker/eigen/depthwise_conv_op.h:592
14 Thread 0x7ffff1972700 (LWP 1326023) "test_cker" nnfw::cker::depthwise_conv_op::LaunchDepthwiseConvBackpropInputOp<Eigen::ThreadPoolDevice, float>::operator()(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*, float*, float*, bool, float*, float*)::{lambda(long, long)#1}::operator()(long, long) const (this=0x555555b0c560, start=9, limit=10) at /home/jyoung/git/npu/ONE/compute/cker/include/cker/eigen/depthwise_conv_op.h:592
15 Thread 0x7ffff1171700 (LWP 1326024) "test_cker" nnfw::cker::depthwise_conv_op::LaunchDepthwiseConvBackpropInputOp<Eigen::ThreadPoolDevice, float>::operator()(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*, float*, float*, bool, float*, float*)::{lambda(long, long)#1}::operator()(long, long) const (this=0x555555b0c560, start=6, limit=7) at /home/jyoung/git/npu/ONE/compute/cker/include/cker/eigen/depthwise_conv_op.h:592
16 Thread 0x7ffff0970700 (LWP 1326025) "test_cker" nnfw::cker::depthwise_conv_op::LaunchDepthwiseConvBackpropInputOp<Eigen::ThreadPoolDevice, float>::operator()(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*, float*, float*, bool, float*, float*)::{lambda(long, long)#1}::operator()(long, long) const (this=0x555555b0c560, start=4, limit=5) at /home/jyoung/git/npu/ONE/compute/cker/include/cker/eigen/depthwise_conv_op.h:592
17 Thread 0x7ffff016f700 (LWP 1326026) "test_cker" nnfw::cker::depthwise_conv_op::LaunchDepthwiseConvBackpropInputOp<Eigen::ThreadPoolDevice, float>::operator()(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*, float*, float*, bool, float*, float*)::{lambda(long, long)#1}::operator()(long, long) const (this=0x555555b0c560, start=3, limit=4) at /home/jyoung/git/npu/ONE/compute/cker/include/cker/eigen/depthwise_conv_op.h:592
18 Thread 0x7fffef96e700 (LWP 1326027) "test_cker" nnfw::cker::depthwise_conv_op::LaunchDepthwiseConvBackpropInputOp<Eigen::ThreadPoolDevice, float>::operator()(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*, float*, float*, bool, float*, float*)::{lambda(long, long)#1}::operator()(long, long) const (this=0x555555b0c560, start=7, limit=8) at /home/jyoung/git/npu/ONE/compute/cker/include/cker/eigen/depthwise_conv_op.h:592
* 19 Thread 0x7fffef16d700 (LWP 1326028) "test_cker" nnfw::cker::depthwise_conv_op::LaunchDepthwiseConvBackpropInputOp<Eigen::ThreadPoolDevice, float>::operator()(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*, float*, float*, bool, float*, float*)::{lambda(long, long)#1}::operator()(long, long) const (this=0x555555b0c560, start=5, limit=6) at /home/jyoung/git/npu/ONE/compute/cker/include/cker/eigen/depthwise_conv_op.h:592
20 Thread 0x7fffee96c700 (LWP 1326029) "test_cker" nnfw::cker::depthwise_conv_op::LaunchDepthwiseConvBackpropInputOp<Eigen::ThreadPoolDevice, float>::operator()(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*, float*, float*, bool, float*, float*)::{lambda(long, long)#1}::operator()(long, long) const (this=0x555555b0c560, start=1, limit=2) at /home/jyoung/git/npu/ONE/compute/cker/include/cker/eigen/depthwise_conv_op.h:592
21 Thread 0x7fffee16b700 (LWP 1326030) "test_cker" nnfw::cker::depthwise_conv_op::LaunchDepthwiseConvBackpropInputOp<Eigen::ThreadPoolDevice, float>::operator()(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*, float*, float*, bool, float*, float*)::{lambda(long, long)#1}::operator()(long, long) const (this=0x555555b0c560, start=2, limit=3) at /home/jyoung/git/npu/ONE/compute/cker/include/cker/eigen/depthwise_conv_op.h:592
And in the multiple batch size case, the main thread is also used to execute the f
function. So I added the +1
for main thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is returned by
_dconv_kernel->getThreadCount()
? Is the number controllable?
It returns the cpu core number. And we couldn't change this value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the above code, the main thread schedules for multi-threading and also executes the
handleRange
itself.
It seems that you are right. Main thread may call the code below.
if (last - first <= block_size) {
// Single block or less, execute directly.
f(first, last);
barrier.Notify();
return;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I already understood what you say. I want to keep our code that uses external library consistent. A few years ago, there was some requirement to limit the total number of threads. I guess onert::util::config::RUY_THREADS make it easy to control how many threads will be allowed for ruy kernel. Also, we can tune how many threads are better by changing the number of thread.
cker
has already been using Eigen::ThreadPoolDevice
, which has as many threads as the number of cores. For example, Conv
kernel for inference, optimizer kernels for training, BroadcastTo
kernel. If we only use kernels that easily controls the number of threads, we have to implement kernels ourselves to replace kernels that are hard to control.
@jyoungyun
Is there a way to apply the optimized DepthwiseConv
kernels without calling getThreadCount()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need to control the number of threads in Eigen::ThreadPoolDevice
, we can modify the code below:
ONE/compute/cker/include/cker/eigen/EigenSupport.h
Lines 85 to 114 in 2a4d8d6
struct EigenContext | |
{ | |
constexpr static int default_num_threadpool_threads = 4; | |
std::unique_ptr<Eigen::ThreadPoolInterface> thread_pool_wrapper; | |
std::unique_ptr<Eigen::ThreadPoolDevice> device; | |
EigenContext() | |
{ | |
int num_threads = std::thread::hardware_concurrency(); | |
if (num_threads == 0) | |
{ | |
num_threads = default_num_threadpool_threads; | |
} | |
device.reset(); // destroy before we invalidate the thread pool | |
thread_pool_wrapper.reset(new EigenThreadPoolWrapper(new Eigen::ThreadPool(num_threads))); | |
device.reset(new Eigen::ThreadPoolDevice(thread_pool_wrapper.get(), num_threads)); | |
} | |
static inline EigenContext &GetEigenContext() | |
{ | |
static EigenContext instance; | |
return instance; | |
} | |
}; | |
inline const Eigen::ThreadPoolDevice *GetThreadPoolDevice() | |
{ | |
auto &ctx = EigenContext::GetEigenContext(); | |
return ctx.device.get(); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jyoungyun and @ragmani Thank you for explanation. For Eigen, the number of thread is determined by the number of core on the device. It is not controllable like ruy
backend.
3ff2926
Co-authored-by: Jang Jiseob <[email protected]>
Add comments for thread count ONE-DCO-1.0-Signed-off-by: Jiyoung Yun <[email protected]>
4ad73b5
to
f6af2f7
Compare
|
@nnfw-bot test onert-cross-release |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This commit introduces DepthwiseConv2D gradient kernel. The depthwise_conv_op.h file came from TensorFlow code. This commit transforms the depthwise_conv related code to fit onert and execute this kernel function.
ONE-DCO-1.0-Signed-off-by: Jiyoung Yun [email protected]
Draft: #12392