Appropriate "number of threads" for planning with OpenCilk #369

ProExpertProg · 2024-10-02T22:30:38Z

tl;dr: What's the proper "number of threads" when using OpenCilk as the parallelism backend for FFTW?

I am trying to parallelize FFTW using Cilk. Following the manual, I have provided Cilk to FFTW using fftw_threads_set_callback. However, I've noticed that:

The parallel routine only gets called if the "number of threads" is set using fftw_plan_with_nthreads.
The number of jobs (njobs parameter) never seems to exceed the number of threads previously passed in.

While this makes sense for a pthreads or OpenMP-based parallelism backend where the work is divided evenly among threads, it does not extend well to Cilk (or TBB) where all semantic parallelism can be expressed. So when parallelizing with 8 threads, the parallel loop (cilk_for) should really contain at least an order of magnitude more iterations.

To remedy this, I tried setting the number of threads to the problem size, which causes a SEGFAULT. If I just pass 1024, it seems to work, and the parallel_loop hook will receive njobs equal to 1024, 2, or 512.

1024 is likely enough parallelism, but 1024 is just an arbitrary number, so I wanted to know if established practice existed. I'm benchmarking a cilk_for scheduling change so I don't want to be making guesses.

Full code example (compiled with OpenCilk 2.0.0):

#include "fftw3.h"

void parallel_for(void *(*work)(char *), char *jobdata, size_t elsize,
                  int njobs, void *data) {
  // std::cout << "parallel_for: " << njobs << std::endl;
  cilk_for(int i = 0; i < njobs; ++i) { work(jobdata + i * elsize); }
}

int main(int argc, char **argv) {
  fftw_init_threads();
  fftw_plan_with_nthreads(1024);
  fftw_threads_set_callback(parallel_for, nullptr);

  int const N = 16 * 1024 * 1024;

  // this causes a segfault
  // fftw_plan_with_nthreads(N);

  fftw_complex *buf = fftw_malloc(...);
  fftw_plan_dft(...);
  fftw_execute();
}

The text was updated successfully, but these errors were encountered:

stevengj · 2024-10-06T19:50:58Z

I suspect that a good rule of thumb is probably some multiple of the number of physical cores, e.g. 4x or 8x, so that the threads are granular enough to allow Cilk to load-balance other work (if any) but coarse enough to keep the threading overhead low. @matteo-frigo would be the expert here, however.

To remedy this, I tried setting the number of threads to the problem size

This is definitely too much — FFTW never recurses deep enough to have that much parallelism, because it coarsens the base cases.

ProExpertProg · 2024-10-18T00:33:59Z

Thank you for your response!

This is definitely too much — FFTW never recurses deep enough to have that much parallelism, because it coarsens the base cases.

Does parallel_for get called for multiple levels of recursion?

I also realized that instead of varying the number of jobs and the coarsening of the cilk_for loop, I could just vary the number of jobs and keep the cilk_for loop fixed at grainsize 1 to find the best number of jobs and hence measure the effect of coarsening on performance. I'll do the experiment and report results here - maybe it's useful for someone else in the future.

SallySoul mentioned this issue Dec 30, 2024

Add support for multi-threaded FFTW operations. TEALab-org/nhls#11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Appropriate "number of threads" for planning with OpenCilk #369

Appropriate "number of threads" for planning with OpenCilk #369

ProExpertProg commented Oct 2, 2024 •

edited

Loading

stevengj commented Oct 6, 2024 •

edited

Loading

ProExpertProg commented Oct 18, 2024 •

edited

Loading

Appropriate "number of threads" for planning with OpenCilk #369

Appropriate "number of threads" for planning with OpenCilk #369

Comments

ProExpertProg commented Oct 2, 2024 • edited Loading

stevengj commented Oct 6, 2024 • edited Loading

ProExpertProg commented Oct 18, 2024 • edited Loading

ProExpertProg commented Oct 2, 2024 •

edited

Loading

stevengj commented Oct 6, 2024 •

edited

Loading

ProExpertProg commented Oct 18, 2024 •

edited

Loading