[cker] Introduce DepthwiseConv2D gradient kernel #12448

jyoungyun · 2024-01-11T09:24:12Z

This commit introduces DepthwiseConv2D gradient kernel. The depthwise_conv_op.h file came from TensorFlow code. This commit transforms the depthwise_conv related code to fit onert and execute this kernel function.

ONE-DCO-1.0-Signed-off-by: Jiyoung Yun [email protected]

Draft: #12392

YongseopKim

LGTM 👍
And IMO the title is better if [cker] Introduce ... because these codes are not in onert.

glistening · 2024-01-12T05:55:59Z

@jyoungyun I am waiting for the commit/PR title change as @YongseopKim commented: [onert] to [cker]

jyoungyun · 2024-01-12T06:05:38Z

Oh, I was missing the [cker] header. I will update it.

This commit introduces DepthwiseConv2D gradient kernel. The depthwise_conv_op.h file came from TensorFlow code. This commit transforms the depthwise_conv related code to fit onert and execute this kernel function. ONE-DCO-1.0-Signed-off-by: Jiyoung Yun <[email protected]>

glistening

LGTM

compute/cker/src/train/DepthwiseConv.test.cc

compute/cker/include/cker/train/operation/DepthwiseConv.h

ragmani · 2024-01-12T06:16:13Z

compute/cker/src/train/DepthwiseConv.test.cc

+    }
+
+    {
+      const int thread_count = _dconv_kernel->getThreadCount() + 1;


Could you please let me know why thread count should be plus 1?

@ragmani
The meaning of +1 is for main thread. The numThreads() in the eigen library only returns the thread count except for the main thread. This function can be executed in the main thread depending on the situation. So I added 1 for main thread.

I don't fully understand. Out-of-curiosity, I searched tensorflow code (e.g. depthwise_conv_op.cc), it seems to use worker thread counts. I can't find +1, but I found the following:

auto worker_threads = *(ctx->device()->tensorflow_cpu_worker_threads());

I guess, the thread_count should be the number of worker threads.
Is the main thread is part of worker threads?

Is there any reason to add 1 for main thread in the thread count? As far as I know, main thread is usually not used in optimized kernel.

Is there any reason to add 1 for main thread in the thread count? As far as I know, main thread is usually not used in optimized kernel.

Could you give a link related to main thread is usually not used in optimized kernel? This is because when I tested this kernel using several cases, the main thread was also used. Especially if the batch size is 1, it usually uses the main thread. And in other cases, which thread is being used depends on the situation.

https://github.com/libigl/eigen/blob/1f05f51517ec4fd91eed711e0f89e97a7c028c0e/unsupported/Eigen/CXX11/src/Tensor/TensorDeviceThreadPool.h#L175-L179

if (n <= 1 || numThreads() == 1 || CostModel::numThreads(n, cost, static_cast<int>(numThreads())) == 1) { f(0, n); return; }

In this code, n is a batch_size. And the batch size is 1, it works on main thread.

(gdb) info threads Id Target Id Frame 1 Thread 0x7ffff7a48780 (LWP 1326007) "test_cker" nnfw::cker::depthwise_conv_op::LaunchDepthwiseConvBackpropInputOp<Eigen::ThreadPoolDevice, float>::operator()(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*, float*, float*, bool, float*, float*)::{lambda(long, long)#1}::operator()(long, long) const (this=0x555555b0c560, start=0, limit=1) at /home/jyoung/git/npu/ONE/compute/cker/include/cker/eigen/depthwise_conv_op.h:592 2 Thread 0x7ffff797e700 (LWP 1326011) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0c9d8) at ../sysdeps/nptl/futex-internal.h:183 3 Thread 0x7ffff717d700 (LWP 1326012) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0ca58) at ../sysdeps/nptl/futex-internal.h:183 4 Thread 0x7ffff697c700 (LWP 1326013) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0cad8) at ../sysdeps/nptl/futex-internal.h:183 5 Thread 0x7ffff617b700 (LWP 1326014) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0cb58) at ../sysdeps/nptl/futex-internal.h:183 6 Thread 0x7ffff597a700 (LWP 1326015) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0cbd8) at ../sysdeps/nptl/futex-internal.h:183 7 Thread 0x7ffff5179700 (LWP 1326016) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0cc58) at ../sysdeps/nptl/futex-internal.h:183 8 Thread 0x7ffff4978700 (LWP 1326017) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0ccd8) at ../sysdeps/nptl/futex-internal.h:183 9 Thread 0x7ffff4177700 (LWP 1326018) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0cd58) at ../sysdeps/nptl/futex-internal.h:183 10 Thread 0x7ffff3976700 (LWP 1326019) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0cdd8) at ../sysdeps/nptl/futex-internal.h:183 11 Thread 0x7ffff3175700 (LWP 1326020) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0ce58) at ../sysdeps/nptl/futex-internal.h:183 12 Thread 0x7ffff2974700 (LWP 1326021) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0ced8) at ../sysdeps/nptl/futex-internal.h:183 13 Thread 0x7ffff2173700 (LWP 1326022) "test_cker" nnfw::cker::depthwise_conv_op::LaunchDepthwiseConvBackpropInputOp<Eigen::ThreadPoolDevice, float>::operator()(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*, float*, float*, bool, float*, float*)::{lambda(long, long)#1}::operator()(long, long) const (this=0x555555b0c560, start=8, limit=9) at /home/jyoung/git/npu/ONE/compute/cker/include/cker/eigen/depthwise_conv_op.h:592 14 Thread 0x7ffff1972700 (LWP 1326023) "test_cker" nnfw::cker::depthwise_conv_op::LaunchDepthwiseConvBackpropInputOp<Eigen::ThreadPoolDevice, float>::operator()(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*, float*, float*, bool, float*, float*)::{lambda(long, long)#1}::operator()(long, long) const (this=0x555555b0c560, start=9, limit=10) at /home/jyoung/git/npu/ONE/compute/cker/include/cker/eigen/depthwise_conv_op.h:592 15 Thread 0x7ffff1171700 (LWP 1326024) "test_cker" nnfw::cker::depthwise_conv_op::LaunchDepthwiseConvBackpropInputOp<Eigen::ThreadPoolDevice, float>::operator()(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*, float*, float*, bool, float*, float*)::{lambda(long, long)#1}::operator()(long, long) const (this=0x555555b0c560, start=6, limit=7) at /home/jyoung/git/npu/ONE/compute/cker/include/cker/eigen/depthwise_conv_op.h:592 16 Thread 0x7ffff0970700 (LWP 1326025) "test_cker" nnfw::cker::depthwise_conv_op::LaunchDepthwiseConvBackpropInputOp<Eigen::ThreadPoolDevice, float>::operator()(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*, float*, float*, bool, float*, float*)::{lambda(long, long)#1}::operator()(long, long) const (this=0x555555b0c560, start=4, limit=5) at /home/jyoung/git/npu/ONE/compute/cker/include/cker/eigen/depthwise_conv_op.h:592 17 Thread 0x7ffff016f700 (LWP 1326026) "test_cker" nnfw::cker::depthwise_conv_op::LaunchDepthwiseConvBackpropInputOp<Eigen::ThreadPoolDevice, float>::operator()(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*, float*, float*, bool, float*, float*)::{lambda(long, long)#1}::operator()(long, long) const (this=0x555555b0c560, start=3, limit=4) at /home/jyoung/git/npu/ONE/compute/cker/include/cker/eigen/depthwise_conv_op.h:592 18 Thread 0x7fffef96e700 (LWP 1326027) "test_cker" nnfw::cker::depthwise_conv_op::LaunchDepthwiseConvBackpropInputOp<Eigen::ThreadPoolDevice, float>::operator()(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*, float*, float*, bool, float*, float*)::{lambda(long, long)#1}::operator()(long, long) const (this=0x555555b0c560, start=7, limit=8) at /home/jyoung/git/npu/ONE/compute/cker/include/cker/eigen/depthwise_conv_op.h:592 * 19 Thread 0x7fffef16d700 (LWP 1326028) "test_cker" nnfw::cker::depthwise_conv_op::LaunchDepthwiseConvBackpropInputOp<Eigen::ThreadPoolDevice, float>::operator()(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*, float*, float*, bool, float*, float*)::{lambda(long, long)#1}::operator()(long, long) const (this=0x555555b0c560, start=5, limit=6) at /home/jyoung/git/npu/ONE/compute/cker/include/cker/eigen/depthwise_conv_op.h:592 20 Thread 0x7fffee96c700 (LWP 1326029) "test_cker" nnfw::cker::depthwise_conv_op::LaunchDepthwiseConvBackpropInputOp<Eigen::ThreadPoolDevice, float>::operator()(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*, float*, float*, bool, float*, float*)::{lambda(long, long)#1}::operator()(long, long) const (this=0x555555b0c560, start=1, limit=2) at /home/jyoung/git/npu/ONE/compute/cker/include/cker/eigen/depthwise_conv_op.h:592 21 Thread 0x7fffee16b700 (LWP 1326030) "test_cker" nnfw::cker::depthwise_conv_op::LaunchDepthwiseConvBackpropInputOp<Eigen::ThreadPoolDevice, float>::operator()(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*, float*, float*, bool, float*, float*)::{lambda(long, long)#1}::operator()(long, long) const (this=0x555555b0c560, start=2, limit=3) at /home/jyoung/git/npu/ONE/compute/cker/include/cker/eigen/depthwise_conv_op.h:592

And in the multiple batch size case, the main thread is also used to execute the f function. So I added the +1 for main thread.

@glistening

What is returned by _dconv_kernel->getThreadCount()? Is the number controllable?

It returns the cpu core number. And we couldn't change this value.

@jyoungyun

In the above code, the main thread schedules for multi-threading and also executes the handleRange itself.

It seems that you are right. Main thread may call the code below.

if (last - first <= block_size) { // Single block or less, execute directly. f(first, last); barrier.Notify(); return; }

@glistening

I guess I already understood what you say. I want to keep our code that uses external library consistent. A few years ago, there was some requirement to limit the total number of threads. I guess onert::util::config::RUY_THREADS make it easy to control how many threads will be allowed for ruy kernel. Also, we can tune how many threads are better by changing the number of thread.

cker has already been using Eigen::ThreadPoolDevice, which has as many threads as the number of cores. For example, Conv kernel for inference, optimizer kernels for training, BroadcastTo kernel. If we only use kernels that easily controls the number of threads, we have to implement kernels ourselves to replace kernels that are hard to control.

@jyoungyun
Is there a way to apply the optimized DepthwiseConv kernels without calling getThreadCount()?

If we need to control the number of threads in Eigen::ThreadPoolDevice, we can modify the code below:

ONE/compute/cker/include/cker/eigen/EigenSupport.h

Lines 85 to 114 in 2a4d8d6

struct EigenContext

{

constexpr static int default_num_threadpool_threads = 4;

std::unique_ptr<Eigen::ThreadPoolInterface> thread_pool_wrapper;

std::unique_ptr<Eigen::ThreadPoolDevice> device;

EigenContext()

{

int num_threads = std::thread::hardware_concurrency();

if (num_threads == 0)

{

num_threads = default_num_threadpool_threads;

}

device.reset(); // destroy before we invalidate the thread pool

thread_pool_wrapper.reset(new EigenThreadPoolWrapper(new Eigen::ThreadPool(num_threads)));

device.reset(new Eigen::ThreadPoolDevice(thread_pool_wrapper.get(), num_threads));

}

static inline EigenContext &GetEigenContext()

{

static EigenContext instance;

return instance;

}

};

inline const Eigen::ThreadPoolDevice *GetThreadPoolDevice()

{

auto &ctx = EigenContext::GetEigenContext();

return ctx.device.get();

}

@jyoungyun and @ragmani Thank you for explanation. For Eigen, the number of thread is determined by the number of core on the device. It is not controllable like ruy backend.

compute/cker/include/cker/eigen/depthwise_conv_op.h

compute/cker/include/cker/train/operation/DepthwiseConv.h

compute/cker/src/train/DepthwiseConv.test.cc

compute/cker/include/cker/eigen/depthwise_conv_op.h

Co-authored-by: Jang Jiseob <[email protected]>

Add comments for thread count ONE-DCO-1.0-Signed-off-by: Jiyoung Yun <[email protected]>

ragmani · 2024-01-16T04:35:42Z

[2024-01-16T02:01:00.847Z] Cannot contact xu4-ubuntu-01: hudson.remoting.RequestAbortedException: java.nio.channels.ClosedChannelException

ragmani · 2024-01-16T04:36:15Z

@nnfw-bot test onert-cross-release

glistening

LGTM

ragmani

LGTM

jyoungyun added the PR/ready for review It is ready to review. Please review it. label Jan 11, 2024

jyoungyun requested a review from a team January 11, 2024 09:24

jyoungyun mentioned this pull request Jan 11, 2024

[onert] Introduce DepthwiseConv2d op for training #12382

Closed

3 tasks

YongseopKim previously approved these changes Jan 12, 2024

View reviewed changes

YongseopKim requested a review from a team January 12, 2024 00:20

jyoungyun force-pushed the onert/dconv2d_cker branch from 4e2bf57 to 02a85ca Compare January 12, 2024 01:04

jyoungyun changed the title ~~[onert] Implement DepthwiseConv2D gradient kernel~~ [onert] Introduce DepthwiseConv2D gradient kernel Jan 12, 2024

jyoungyun force-pushed the onert/dconv2d_cker branch from 02a85ca to 11fcfc9 Compare January 12, 2024 06:06

jyoungyun changed the title ~~[onert] Introduce DepthwiseConv2D gradient kernel~~ [cker] Introduce DepthwiseConv2D gradient kernel Jan 12, 2024

glistening previously approved these changes Jan 12, 2024

View reviewed changes

ragmani reviewed Jan 12, 2024

View reviewed changes

compute/cker/include/cker/train/operation/DepthwiseConv.h Show resolved Hide resolved

compute/cker/src/train/DepthwiseConv.test.cc Show resolved Hide resolved

compute/cker/src/train/DepthwiseConv.test.cc Show resolved Hide resolved

ragmani reviewed Jan 12, 2024

View reviewed changes

compute/cker/include/cker/eigen/depthwise_conv_op.h Outdated Show resolved Hide resolved

jyoungyun dismissed stale reviews from glistening and YongseopKim via 3ff2926 January 15, 2024 02:28

jyoungyun and others added 2 commits January 15, 2024 11:28

Apply suggestions from code review

3ff2926

Co-authored-by: Jang Jiseob <[email protected]>

Remove static keyword of functions in hearder file

f6af2f7

Add comments for thread count ONE-DCO-1.0-Signed-off-by: Jiyoung Yun <[email protected]>

jyoungyun force-pushed the onert/dconv2d_cker branch from 4ad73b5 to f6af2f7 Compare January 16, 2024 01:33

glistening approved these changes Jan 17, 2024

View reviewed changes

ragmani approved these changes Jan 17, 2024

View reviewed changes

YongseopKim approved these changes Jan 18, 2024

View reviewed changes

glistening merged commit 92af8cf into Samsung:master Jan 18, 2024
3 checks passed

jyoungyun deleted the onert/dconv2d_cker branch January 18, 2024 01:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cker] Introduce DepthwiseConv2D gradient kernel #12448

[cker] Introduce DepthwiseConv2D gradient kernel #12448

jyoungyun commented Jan 11, 2024 •

edited

Loading

YongseopKim left a comment

glistening commented Jan 12, 2024 •

edited

Loading

jyoungyun commented Jan 12, 2024

glistening left a comment

ragmani Jan 12, 2024

jyoungyun Jan 15, 2024 •

edited

Loading

glistening Jan 15, 2024 •

edited

Loading

ragmani Jan 15, 2024

jyoungyun Jan 15, 2024

jyoungyun Jan 16, 2024 •

edited

Loading

ragmani Jan 16, 2024

ragmani Jan 17, 2024 •

edited

Loading

ragmani Jan 17, 2024 •

edited

Loading

glistening Jan 17, 2024

ragmani commented Jan 16, 2024 •

edited

Loading

ragmani commented Jan 16, 2024

glistening left a comment

ragmani left a comment

	struct EigenContext
	{
	constexpr static int default_num_threadpool_threads = 4;
	std::unique_ptr<Eigen::ThreadPoolInterface> thread_pool_wrapper;
	std::unique_ptr<Eigen::ThreadPoolDevice> device;

	EigenContext()
	{
	int num_threads = std::thread::hardware_concurrency();
	if (num_threads == 0)
	{
	num_threads = default_num_threadpool_threads;
	}
	device.reset(); // destroy before we invalidate the thread pool
	thread_pool_wrapper.reset(new EigenThreadPoolWrapper(new Eigen::ThreadPool(num_threads)));
	device.reset(new Eigen::ThreadPoolDevice(thread_pool_wrapper.get(), num_threads));
	}

	static inline EigenContext &GetEigenContext()
	{
	static EigenContext instance;
	return instance;
	}
	};

	inline const Eigen::ThreadPoolDevice *GetThreadPoolDevice()
	{
	auto &ctx = EigenContext::GetEigenContext();
	return ctx.device.get();
	}

[cker] Introduce DepthwiseConv2D gradient kernel #12448

[cker] Introduce DepthwiseConv2D gradient kernel #12448

Conversation

jyoungyun commented Jan 11, 2024 • edited Loading

YongseopKim left a comment

Choose a reason for hiding this comment

glistening commented Jan 12, 2024 • edited Loading

jyoungyun commented Jan 12, 2024

glistening left a comment

Choose a reason for hiding this comment

ragmani Jan 12, 2024

Choose a reason for hiding this comment

jyoungyun Jan 15, 2024 • edited Loading

Choose a reason for hiding this comment

glistening Jan 15, 2024 • edited Loading

Choose a reason for hiding this comment

ragmani Jan 15, 2024

Choose a reason for hiding this comment

jyoungyun Jan 15, 2024

Choose a reason for hiding this comment

jyoungyun Jan 16, 2024 • edited Loading

Choose a reason for hiding this comment

ragmani Jan 16, 2024

Choose a reason for hiding this comment

ragmani Jan 17, 2024 • edited Loading

Choose a reason for hiding this comment

ragmani Jan 17, 2024 • edited Loading

Choose a reason for hiding this comment

glistening Jan 17, 2024

Choose a reason for hiding this comment

ragmani commented Jan 16, 2024 • edited Loading

ragmani commented Jan 16, 2024

glistening left a comment

Choose a reason for hiding this comment

ragmani left a comment

Choose a reason for hiding this comment

jyoungyun commented Jan 11, 2024 •

edited

Loading

glistening commented Jan 12, 2024 •

edited

Loading

jyoungyun Jan 15, 2024 •

edited

Loading

glistening Jan 15, 2024 •

edited

Loading

jyoungyun Jan 16, 2024 •

edited

Loading

ragmani Jan 17, 2024 •

edited

Loading

ragmani Jan 17, 2024 •

edited

Loading

ragmani commented Jan 16, 2024 •

edited

Loading