-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel.js is slow due to unnecessary per-job overhead #189
Comments
Test codeThis is a reduced test case. It's not the exact code I ran, but similar to what I tested with. Syncronous
Using paralleljs
Split-up the array
|
CauseLooking at the Parallel.js code, the code does seem to try to re-use workers in The message passing between controlling process and the spawned worker process might be slow. I can see in the code that the data is passed to the worker one by one, and returned one by one, one job at a time. That's 180000 messages for the first test case. It is quite likely this constant back and forth between the processes that is so slow. It would match and explain the numbers that I measured. FixThe fix is relatively simple: Pass all values at once into a worker. I.e. do what I did with the split-up code. That leads to a 3x to 4x speedup for workloads with 100000 calls. Edge caseThere might be special cases where there are few jobs, but the running time varies dramatically between different jobs. For such cases, the current design of one by one is better. So, you could enable it based on the number of jobs/values, e.g. 100, where the back and forth overhead becomes significant and differences between the job execution times likely level out. |
There seems to be a significant overhead per job, not just per worker.
My task is to run the levenshtein algo, comparing a given string with about 200000 target strings and finding the best match. This is ideal for parallelism, because it runs the exact same algo thousands of times, the algo is small, needs little data and no context. Without parallelism, it takes a significant time that's noticeable for the end user.
So, I've used paralleljs to run it on multiple CPU cores at the same time. However, I find that using paralleljs actually makes it slower than before with the synchronous version.
Environment:
Terminology:
I understand the purpose of Parallel.js to create the workers and to distribute the jobs to the workers in an efficient manner and with a comfortable and easy to use API.
Test results
Synchronous
No parallel.js used, all evaluations run in a single JS thread
664 ms with 88959 (longer) values
358 ms with 89548 (shorter) values
93 ms with 31552 values
10 ms with 2994 values
0 ms with 79 values
1 ms with 137 values
1 ms with 823 values
0 ms with 360 values
This is too slow and what we want to improve using parallel.js
Using paralleljs with default options
Same with maxWorkers = 16 (= number of SMT virtual CPU cores)
2067 ms with 88959 values
2036 ms with 89548 values
907 ms with 31552 values
332 ms with 2994 values
272 ms with 79 values
287 ms with 137 values
287 ms with 823 values
280 ms with 360 values
It's 300 times slower for small amount of data. This is might be the overhead of creating worker threads.
It's even 3-6 times as slow for large amount of data. This is somewhat surprising, because it cannot be the overhead of creating the worker threads, because they should be limited to 16 and be re-used for additional data, so I would expect at worst the time of the synchronous execution, plus the 280ms overhead of creating the workers, plus a tiny little overhead for the copy of the data. That would be around 700 ms for the second test, but it's actually 2000 ms, almost 3 times as long.
Interestingly, the first test (longer strings) and second test (shorter test) cost about the same time. This makes a big difference in the synchronous execution, but not with parallel.js, where the larger and shorter strings take about the same time. So, the slowdown isn't due to the pure data copy time. (See also the note to the last test below.) This shows that the amount of data has less of an influence than other factors.
This indicates that there might be something wrong (very inefficient) in parallel.js. There is some significant overhead per job, e.g. that workers are not re-used, by re-created all the time for each job, or some other huge overhead per job.
with maxWorkers = 8
(= number of physical CPU cores without SMT)
2002 ms with 88959 (longer) values
1782 ms with 89548 (shorter) values
778 ms with 31552 values
198 ms with 2994 values
156 ms with 79 values
166 ms with 137 values
173 ms with 823 values
153 ms with 360 values
Split up array
Starting only as many jobs as CPU cores, by manually splitting up the value array before running
Parallel
593ms with 88959 values
526ms with 89548 values
357ms with 31552 values
304ms with 2994 values
285ms with 79 values
285ms with 137 values
290ms with 823 values
284ms with 360 values
Here, I first manually split up the data array into one array per CPU core, leading to an nested array with 16 elements, each being an array with thousands of values. This feeds a large array to each worker, one per CPU core. Then, within (!) the worker, I make a synchronous loop over the second level array and the thousands of values.
This is to test whether parallel.js is inefficiently distributing the work, e.g. by creating too many workers, or something with a similar effect. This appears to be the case.
I would expect that Parallel.js creates 1 worker per
maxWorker
, then makes a synchronous loop over the data and runs the function on each value. If it did, it should get speeds very similar to this here. Given that Parallel.js is about 4 times slower, that indicates that there's something to improve.Split up array, with 8 workers
578ms with 88959 values
457ms with 89548 values
228ms with 31552 values
167ms with 2994 values
142ms with 79 values
147ms with 137 values
155ms with 823 values
149ms with 360 values
This is coming closest to the synchronous version, because it's also sl
1 Worker, 1 Job
This is the same as the last 2 cases, just with core = 1. This means there is no parallelism and the entire data array is passed into a single worker. This is effectively the same as the synchronous version without any parallel.js, just that it's running in a single worker created by Parallel.js.
1178ms with 88959 values
838ms with 89548 values
316ms with 31552 values
83ms with 2994 values
48ms with 79 values
45ms with 137 values
60ms with 823 values
52ms with 360 values
This should be slightly slower than the synchronous version, due to the the overhead of creating the workers, and a lot slower than the parallel one, due to only 1 CPU core being used. However, it's a lot faster than the non-split parallel version, which should not be the case.
This also shows that the data copy, which still has to happen here, is a lesser factor. The data copy needs to happen here as well. Yet, it is a lot faster than the non-split version, which indicates that the data copy is not the primary reason for the slowness.
Conclusion
Using Parallel.js in this case is dramatically slower than the synchronous version, by 1 or 2 orders of magnitude.
Inherent thread creation overhead
There is a certain overhead of using Parallel.js, which is very significant for small inputs: hundreds of times slower, i.e. 2 orders of magnitude slower. That is due to the fundamental cost of creating threads or processes.
This could be worked around by the caller, by checking the amount of input and avoiding Parallel.js in these cases. The factors to consider are: Amount of input data, execution time per function call, and execution time per thread creation.
Data copy overhead
There is also an overhead for copying the data. However, while that is there, results show that this is not the primary reasons for the slowdown.
Needless overhead
There is a huge overhead per job, not just per worker.
The results suggest that Parallel.js creates more workers than
maxWorkers
, or has other significant overheads per job with a similar effect.There should be no overhead per job, only per worker and for the data transfer.
It would be good to look into this.
The text was updated successfully, but these errors were encountered: