Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Appropriate "number of threads" for planning with OpenCilk #369

Open
ProExpertProg opened this issue Oct 2, 2024 · 2 comments
Open

Appropriate "number of threads" for planning with OpenCilk #369

ProExpertProg opened this issue Oct 2, 2024 · 2 comments

Comments

@ProExpertProg
Copy link

ProExpertProg commented Oct 2, 2024

tl;dr: What's the proper "number of threads" when using OpenCilk as the parallelism backend for FFTW?

I am trying to parallelize FFTW using Cilk. Following the manual, I have provided Cilk to FFTW using fftw_threads_set_callback. However, I've noticed that:

  1. The parallel routine only gets called if the "number of threads" is set using fftw_plan_with_nthreads.
  2. The number of jobs (njobs parameter) never seems to exceed the number of threads previously passed in.

While this makes sense for a pthreads or OpenMP-based parallelism backend where the work is divided evenly among threads, it does not extend well to Cilk (or TBB) where all semantic parallelism can be expressed. So when parallelizing with 8 threads, the parallel loop (cilk_for) should really contain at least an order of magnitude more iterations.

To remedy this, I tried setting the number of threads to the problem size, which causes a SEGFAULT. If I just pass 1024, it seems to work, and the parallel_loop hook will receive njobs equal to 1024, 2, or 512.

1024 is likely enough parallelism, but 1024 is just an arbitrary number, so I wanted to know if established practice existed. I'm benchmarking a cilk_for scheduling change so I don't want to be making guesses.

Full code example (compiled with OpenCilk 2.0.0):

#include "fftw3.h"

void parallel_for(void *(*work)(char *), char *jobdata, size_t elsize,
                  int njobs, void *data) {
  // std::cout << "parallel_for: " << njobs << std::endl;
  cilk_for(int i = 0; i < njobs; ++i) { work(jobdata + i * elsize); }
}

int main(int argc, char **argv) {
  fftw_init_threads();
  fftw_plan_with_nthreads(1024);
  fftw_threads_set_callback(parallel_for, nullptr);

  int const N = 16 * 1024 * 1024;

  // this causes a segfault
  // fftw_plan_with_nthreads(N);

  fftw_complex *buf = fftw_malloc(...);
  fftw_plan_dft(...);
  fftw_execute();
}
@stevengj
Copy link
Contributor

stevengj commented Oct 6, 2024

I suspect that a good rule of thumb is probably some multiple of the number of physical cores, e.g. 4x or 8x, so that the threads are granular enough to allow Cilk to load-balance other work (if any) but coarse enough to keep the threading overhead low. @matteo-frigo would be the expert here, however.

To remedy this, I tried setting the number of threads to the problem size

This is definitely too much — FFTW never recurses deep enough to have that much parallelism, because it coarsens the base cases.

@ProExpertProg
Copy link
Author

ProExpertProg commented Oct 18, 2024

Thank you for your response!

This is definitely too much — FFTW never recurses deep enough to have that much parallelism, because it coarsens the base cases.

Does parallel_for get called for multiple levels of recursion?

I also realized that instead of varying the number of jobs and the coarsening of the cilk_for loop, I could just vary the number of jobs and keep the cilk_for loop fixed at grainsize 1 to find the best number of jobs and hence measure the effect of coarsening on performance. I'll do the experiment and report results here - maybe it's useful for someone else in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants