Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cker] Introduce DepthwiseConv2D gradient kernel #12448

Merged
merged 3 commits into from
Jan 18, 2024

Conversation

jyoungyun
Copy link
Contributor

@jyoungyun jyoungyun commented Jan 11, 2024

This commit introduces DepthwiseConv2D gradient kernel. The depthwise_conv_op.h file came from TensorFlow code. This commit transforms the depthwise_conv related code to fit onert and execute this kernel function.

ONE-DCO-1.0-Signed-off-by: Jiyoung Yun [email protected]

Draft: #12392

@jyoungyun jyoungyun added the PR/ready for review It is ready to review. Please review it. label Jan 11, 2024
@jyoungyun jyoungyun requested a review from a team January 11, 2024 09:24
YongseopKim
YongseopKim previously approved these changes Jan 12, 2024
Copy link
Contributor

@YongseopKim YongseopKim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍
And IMO the title is better if [cker] Introduce ... because these codes are not in onert.

@YongseopKim YongseopKim requested a review from a team January 12, 2024 00:20
@jyoungyun jyoungyun changed the title [onert] Implement DepthwiseConv2D gradient kernel [onert] Introduce DepthwiseConv2D gradient kernel Jan 12, 2024
@glistening
Copy link
Contributor

glistening commented Jan 12, 2024

@jyoungyun I am waiting for the commit/PR title change as @YongseopKim commented: [onert] to [cker]

@jyoungyun
Copy link
Contributor Author

Oh, I was missing the [cker] header. I will update it.

This commit introduces DepthwiseConv2D gradient kernel.
The depthwise_conv_op.h file came from TensorFlow code.
This commit transforms the depthwise_conv related code to fit onert
and execute this kernel function.

ONE-DCO-1.0-Signed-off-by: Jiyoung Yun <[email protected]>
@jyoungyun jyoungyun changed the title [onert] Introduce DepthwiseConv2D gradient kernel [cker] Introduce DepthwiseConv2D gradient kernel Jan 12, 2024
glistening
glistening previously approved these changes Jan 12, 2024
Copy link
Contributor

@glistening glistening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}

{
const int thread_count = _dconv_kernel->getThreadCount() + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please let me know why thread count should be plus 1?

Copy link
Contributor Author

@jyoungyun jyoungyun Jan 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ragmani
The meaning of +1 is for main thread. The numThreads() in the eigen library only returns the thread count except for the main thread. This function can be executed in the main thread depending on the situation. So I added 1 for main thread.

Copy link
Contributor

@glistening glistening Jan 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand. Out-of-curiosity, I searched tensorflow code (e.g. depthwise_conv_op.cc), it seems to use worker thread counts. I can't find +1, but I found the following:

auto worker_threads = *(ctx->device()->tensorflow_cpu_worker_threads());

I guess, the thread_count should be the number of worker threads.
Is the main thread is part of worker threads?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to add 1 for main thread in the thread count? As far as I know, main thread is usually not used in optimized kernel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to add 1 for main thread in the thread count? As far as I know, main thread is usually not used in optimized kernel.

Could you give a link related to main thread is usually not used in optimized kernel? This is because when I tested this kernel using several cases, the main thread was also used. Especially if the batch size is 1, it usually uses the main thread. And in other cases, which thread is being used depends on the situation.

https://github.com/libigl/eigen/blob/1f05f51517ec4fd91eed711e0f89e97a7c028c0e/unsupported/Eigen/CXX11/src/Tensor/TensorDeviceThreadPool.h#L175-L179

    if (n <= 1 || numThreads() == 1 ||
        CostModel::numThreads(n, cost, static_cast<int>(numThreads())) == 1) {
      f(0, n);
      return;
    }

In this code, n is a batch_size. And the batch size is 1, it works on main thread.

(gdb) info threads
  Id   Target Id                                       Frame 
  1    Thread 0x7ffff7a48780 (LWP 1326007) "test_cker" nnfw::cker::depthwise_conv_op::LaunchDepthwiseConvBackpropInputOp<Eigen::ThreadPoolDevice, float>::operator()(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*, float*, float*, bool, float*, float*)::{lambda(long, long)#1}::operator()(long, long) const (this=0x555555b0c560, start=0, limit=1) at /home/jyoung/git/npu/ONE/compute/cker/include/cker/eigen/depthwise_conv_op.h:592
  2    Thread 0x7ffff797e700 (LWP 1326011) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0c9d8) at ../sysdeps/nptl/futex-internal.h:183
  3    Thread 0x7ffff717d700 (LWP 1326012) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0ca58) at ../sysdeps/nptl/futex-internal.h:183
  4    Thread 0x7ffff697c700 (LWP 1326013) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0cad8) at ../sysdeps/nptl/futex-internal.h:183
  5    Thread 0x7ffff617b700 (LWP 1326014) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0cb58) at ../sysdeps/nptl/futex-internal.h:183
  6    Thread 0x7ffff597a700 (LWP 1326015) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0cbd8) at ../sysdeps/nptl/futex-internal.h:183
  7    Thread 0x7ffff5179700 (LWP 1326016) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0cc58) at ../sysdeps/nptl/futex-internal.h:183
  8    Thread 0x7ffff4978700 (LWP 1326017) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0ccd8) at ../sysdeps/nptl/futex-internal.h:183
  9    Thread 0x7ffff4177700 (LWP 1326018) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0cd58) at ../sysdeps/nptl/futex-internal.h:183
  10   Thread 0x7ffff3976700 (LWP 1326019) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0cdd8) at ../sysdeps/nptl/futex-internal.h:183
  11   Thread 0x7ffff3175700 (LWP 1326020) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0ce58) at ../sysdeps/nptl/futex-internal.h:183
  12   Thread 0x7ffff2974700 (LWP 1326021) "test_cker" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555555b0ced8) at ../sysdeps/nptl/futex-internal.h:183
  13   Thread 0x7ffff2173700 (LWP 1326022) "test_cker" nnfw::cker::depthwise_conv_op::LaunchDepthwiseConvBackpropInputOp<Eigen::ThreadPoolDevice, float>::operator()(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*, float*, float*, bool, float*, float*)::{lambda(long, long)#1}::operator()(long, long) const (this=0x555555b0c560, start=8, limit=9) at /home/jyoung/git/npu/ONE/compute/cker/include/cker/eigen/depthwise_conv_op.h:592
  14   Thread 0x7ffff1972700 (LWP 1326023) "test_cker" nnfw::cker::depthwise_conv_op::LaunchDepthwiseConvBackpropInputOp<Eigen::ThreadPoolDevice, float>::operator()(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*, float*, float*, bool, float*, float*)::{lambda(long, long)#1}::operator()(long, long) const (this=0x555555b0c560, start=9, limit=10) at /home/jyoung/git/npu/ONE/compute/cker/include/cker/eigen/depthwise_conv_op.h:592
  15   Thread 0x7ffff1171700 (LWP 1326024) "test_cker" nnfw::cker::depthwise_conv_op::LaunchDepthwiseConvBackpropInputOp<Eigen::ThreadPoolDevice, float>::operator()(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*, float*, float*, bool, float*, float*)::{lambda(long, long)#1}::operator()(long, long) const (this=0x555555b0c560, start=6, limit=7) at /home/jyoung/git/npu/ONE/compute/cker/include/cker/eigen/depthwise_conv_op.h:592
  16   Thread 0x7ffff0970700 (LWP 1326025) "test_cker" nnfw::cker::depthwise_conv_op::LaunchDepthwiseConvBackpropInputOp<Eigen::ThreadPoolDevice, float>::operator()(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*, float*, float*, bool, float*, float*)::{lambda(long, long)#1}::operator()(long, long) const (this=0x555555b0c560, start=4, limit=5) at /home/jyoung/git/npu/ONE/compute/cker/include/cker/eigen/depthwise_conv_op.h:592
  17   Thread 0x7ffff016f700 (LWP 1326026) "test_cker" nnfw::cker::depthwise_conv_op::LaunchDepthwiseConvBackpropInputOp<Eigen::ThreadPoolDevice, float>::operator()(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*, float*, float*, bool, float*, float*)::{lambda(long, long)#1}::operator()(long, long) const (this=0x555555b0c560, start=3, limit=4) at /home/jyoung/git/npu/ONE/compute/cker/include/cker/eigen/depthwise_conv_op.h:592
  18   Thread 0x7fffef96e700 (LWP 1326027) "test_cker" nnfw::cker::depthwise_conv_op::LaunchDepthwiseConvBackpropInputOp<Eigen::ThreadPoolDevice, float>::operator()(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*, float*, float*, bool, float*, float*)::{lambda(long, long)#1}::operator()(long, long) const (this=0x555555b0c560, start=7, limit=8) at /home/jyoung/git/npu/ONE/compute/cker/include/cker/eigen/depthwise_conv_op.h:592
* 19   Thread 0x7fffef16d700 (LWP 1326028) "test_cker" nnfw::cker::depthwise_conv_op::LaunchDepthwiseConvBackpropInputOp<Eigen::ThreadPoolDevice, float>::operator()(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*, float*, float*, bool, float*, float*)::{lambda(long, long)#1}::operator()(long, long) const (this=0x555555b0c560, start=5, limit=6) at /home/jyoung/git/npu/ONE/compute/cker/include/cker/eigen/depthwise_conv_op.h:592
  20   Thread 0x7fffee96c700 (LWP 1326029) "test_cker" nnfw::cker::depthwise_conv_op::LaunchDepthwiseConvBackpropInputOp<Eigen::ThreadPoolDevice, float>::operator()(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*, float*, float*, bool, float*, float*)::{lambda(long, long)#1}::operator()(long, long) const (this=0x555555b0c560, start=1, limit=2) at /home/jyoung/git/npu/ONE/compute/cker/include/cker/eigen/depthwise_conv_op.h:592
  21   Thread 0x7fffee16b700 (LWP 1326030) "test_cker" nnfw::cker::depthwise_conv_op::LaunchDepthwiseConvBackpropInputOp<Eigen::ThreadPoolDevice, float>::operator()(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float const*, float*, float*, bool, float*, float*)::{lambda(long, long)#1}::operator()(long, long) const (this=0x555555b0c560, start=2, limit=3) at /home/jyoung/git/npu/ONE/compute/cker/include/cker/eigen/depthwise_conv_op.h:592

And in the multiple batch size case, the main thread is also used to execute the f function. So I added the +1 for main thread.

Copy link
Contributor Author

@jyoungyun jyoungyun Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glistening

What is returned by _dconv_kernel->getThreadCount()? Is the number controllable?

It returns the cpu core number. And we couldn't change this value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jyoungyun

In the above code, the main thread schedules for multi-threading and also executes the handleRange itself.

It seems that you are right. Main thread may call the code below.

      if (last - first <= block_size) {
        // Single block or less, execute directly.
        f(first, last);
        barrier.Notify();
        return;
      }

Copy link
Contributor

@ragmani ragmani Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glistening

I guess I already understood what you say. I want to keep our code that uses external library consistent. A few years ago, there was some requirement to limit the total number of threads. I guess onert::util::config::RUY_THREADS make it easy to control how many threads will be allowed for ruy kernel. Also, we can tune how many threads are better by changing the number of thread.

cker has already been using Eigen::ThreadPoolDevice, which has as many threads as the number of cores. For example, Conv kernel for inference, optimizer kernels for training, BroadcastTo kernel. If we only use kernels that easily controls the number of threads, we have to implement kernels ourselves to replace kernels that are hard to control.

@jyoungyun
Is there a way to apply the optimized DepthwiseConv kernels without calling getThreadCount()?

Copy link
Contributor

@ragmani ragmani Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to control the number of threads in Eigen::ThreadPoolDevice, we can modify the code below:

struct EigenContext
{
constexpr static int default_num_threadpool_threads = 4;
std::unique_ptr<Eigen::ThreadPoolInterface> thread_pool_wrapper;
std::unique_ptr<Eigen::ThreadPoolDevice> device;
EigenContext()
{
int num_threads = std::thread::hardware_concurrency();
if (num_threads == 0)
{
num_threads = default_num_threadpool_threads;
}
device.reset(); // destroy before we invalidate the thread pool
thread_pool_wrapper.reset(new EigenThreadPoolWrapper(new Eigen::ThreadPool(num_threads)));
device.reset(new Eigen::ThreadPoolDevice(thread_pool_wrapper.get(), num_threads));
}
static inline EigenContext &GetEigenContext()
{
static EigenContext instance;
return instance;
}
};
inline const Eigen::ThreadPoolDevice *GetThreadPoolDevice()
{
auto &ctx = EigenContext::GetEigenContext();
return ctx.device.get();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jyoungyun and @ragmani Thank you for explanation. For Eigen, the number of thread is determined by the number of core on the device. It is not controllable like ruy backend.

@jyoungyun jyoungyun dismissed stale reviews from glistening and YongseopKim via 3ff2926 January 15, 2024 02:28
jyoungyun and others added 2 commits January 15, 2024 11:28
Add comments for thread count

ONE-DCO-1.0-Signed-off-by: Jiyoung Yun <[email protected]>
@ragmani
Copy link
Contributor

ragmani commented Jan 16, 2024

[2024-01-16T02:01:00.847Z] Cannot contact xu4-ubuntu-01: hudson.remoting.RequestAbortedException: java.nio.channels.ClosedChannelException

@ragmani
Copy link
Contributor

ragmani commented Jan 16, 2024

@nnfw-bot test onert-cross-release

Copy link
Contributor

@glistening glistening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@ragmani ragmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@glistening glistening merged commit 92af8cf into Samsung:master Jan 18, 2024
3 checks passed
@jyoungyun jyoungyun deleted the onert/dconv2d_cker branch January 18, 2024 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR/ready for review It is ready to review. Please review it.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants