You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
tl;dr: What's the proper "number of threads" when using OpenCilk as the parallelism backend for FFTW?
I am trying to parallelize FFTW using Cilk. Following the manual, I have provided Cilk to FFTW using fftw_threads_set_callback. However, I've noticed that:
The parallel routine only gets called if the "number of threads" is set using fftw_plan_with_nthreads.
The number of jobs (njobs parameter) never seems to exceed the number of threads previously passed in.
While this makes sense for a pthreads or OpenMP-based parallelism backend where the work is divided evenly among threads, it does not extend well to Cilk (or TBB) where all semantic parallelism can be expressed. So when parallelizing with 8 threads, the parallel loop (cilk_for) should really contain at least an order of magnitude more iterations.
To remedy this, I tried setting the number of threads to the problem size, which causes a SEGFAULT. If I just pass 1024, it seems to work, and the parallel_loop hook will receive njobs equal to 1024, 2, or 512.
1024 is likely enough parallelism, but 1024 is just an arbitrary number, so I wanted to know if established practice existed. I'm benchmarking a cilk_for scheduling change so I don't want to be making guesses.
Full code example (compiled with OpenCilk 2.0.0):
#include"fftw3.h"voidparallel_for(void *(*work)(char *), char *jobdata, size_t elsize,
int njobs, void *data) {
// std::cout << "parallel_for: " << njobs << std::endl;cilk_for(int i = 0; i < njobs; ++i) { work(jobdata + i * elsize); }
}
intmain(int argc, char **argv) {
fftw_init_threads();
fftw_plan_with_nthreads(1024);
fftw_threads_set_callback(parallel_for, nullptr);
intconst N = 16 * 1024 * 1024;
// this causes a segfault// fftw_plan_with_nthreads(N);
fftw_complex *buf = fftw_malloc(...);
fftw_plan_dft(...);
fftw_execute();
}
The text was updated successfully, but these errors were encountered:
I suspect that a good rule of thumb is probably some multiple of the number of physical cores, e.g. 4x or 8x, so that the threads are granular enough to allow Cilk to load-balance other work (if any) but coarse enough to keep the threading overhead low. @matteo-frigo would be the expert here, however.
To remedy this, I tried setting the number of threads to the problem size
This is definitely too much — FFTW never recurses deep enough to have that much parallelism, because it coarsens the base cases.
This is definitely too much — FFTW never recurses deep enough to have that much parallelism, because it coarsens the base cases.
Does parallel_for get called for multiple levels of recursion?
I also realized that instead of varying the number of jobs and the coarsening of the cilk_for loop, I could just vary the number of jobs and keep the cilk_for loop fixed at grainsize 1 to find the best number of jobs and hence measure the effect of coarsening on performance. I'll do the experiment and report results here - maybe it's useful for someone else in the future.
tl;dr: What's the proper "number of threads" when using OpenCilk as the parallelism backend for FFTW?
I am trying to parallelize FFTW using Cilk. Following the manual, I have provided Cilk to FFTW using
fftw_threads_set_callback
. However, I've noticed that:fftw_plan_with_nthreads
.njobs
parameter) never seems to exceed the number of threads previously passed in.While this makes sense for a
pthreads
orOpenMP
-based parallelism backend where the work is divided evenly among threads, it does not extend well to Cilk (or TBB) where all semantic parallelism can be expressed. So when parallelizing with 8 threads, the parallel loop (cilk_for
) should really contain at least an order of magnitude more iterations.To remedy this, I tried setting the number of threads to the problem size, which causes a SEGFAULT. If I just pass 1024, it seems to work, and the
parallel_loop
hook will receivenjobs
equal to 1024, 2, or 512.1024 is likely enough parallelism, but 1024 is just an arbitrary number, so I wanted to know if established practice existed. I'm benchmarking a
cilk_for
scheduling change so I don't want to be making guesses.Full code example (compiled with OpenCilk 2.0.0):
The text was updated successfully, but these errors were encountered: