-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
poor kernel performance with saturated addition on uchar array #85
Comments
One thing I can see is that you should be up to 16 times faster if you actually make use of the native vector width by processing
This is something that I looked into trying to detect and automatically vectorize such code, but there is no proper implementation yet. |
Hello, I have built kernel and main by:
I know that I can use float16 but still the GPU should be 10x faster than CPU ant it is not. Am I doing something wrong or is it stander performance of Video Core 4 on Raspberry Zero? Or is there a possibility that I have badly built and installed VC4C, VC4CL and VC4CLStdLib or other SW? Thank you for your support. Kernel:
main:
Martin |
It probably won't ever be 10x faster than the CPU value unless you actually use the parallelization features of the GPU. Lets try to do some estimations:
The superior processing power of the GPU (24GFLOPS GPU vs. 1GFLOP CPU theoretical maximum) can only be used when the parallelization features (16-way SIMD vector on 12 QPU processors and preferably using both ALUs) are actually utilized. Lets assume we use
This gives us following approximation:
You won't reach that theoretical time, because a large part of the original 700ms will be CPU-side and scheduling overhead, but I assume you should be able significantly lower the execution time. |
Thank you very much for your explanation. Now it makes sense I didn't know that there is so big overhead. I read that Video Card 4 has maximal processing power 24GFLOPS and I expected some overhead but still I thought that I would be able to multiply 3giga samples per the second. Thank you |
The maximum actual performance I ever measured was just above 8GFLOPs running the clpeak floating-point test. So from the calculation you can definitively achieve that. The problem with your kernel is that you have 3 memory accesses (2 reads and 1 write) per hand full of arithmetic calculations. So most of the time will be spent waiting for IO, since the memory access from the VideoCore IV is not that fast. |
Now I modified the kernel so that I removed memory operation and the time for 1.2M operations took 28ms. Memory operations are very expensive. But why the CPU has faster memory access? It is the same memory. Can I somehow speed up memory access? Thank you. |
Hey,
I'm trying to increase the brightness of a greyscale image using the kernel below.
My Problem is, I want to execute this operation on a 640x480 image at 25 fps, this means it cant take more than roughly 15 ms, but the execution of this kernel takes far too long.
Here are the results I got using OpenCL's profiling events for the kernel below:
Frames captured: 100 Average FPS: 9.2 Average time per frame: 109.21 ms Average processing time: 96.21 ms
OpenCL clEnqueueWriteBuffer: 1.792 ms <---- writing the input array to the GPU memory
OpenCL Kernel execution time: 85.851 ms <---- kernel execution time
OpenCL clEnqueueReadBuffer: 1.581 ms <---- reading the GPU output memory into the output array
even more strange is the execution time it took when I changed the line
C[i] = (A[i]+B) >= 255 ? 255 : (A[i]+B)
to
C[i] = A[i];
Frames captured: 160 Average FPS: 5.0 Average time per frame: 199.57 ms Average processing time: 187.72 ms
OpenCL clEnqueueWriteBuffer: 1.266 ms
OpenCL Kernel execution time: 177.103 ms
OpenCL clEnqueueReadBuffer: 1.656 ms
the Kernel:
snippets of the cpp file (not including the GPU setup code) :
What am I doing wrong?
Thanks
FMaier
The text was updated successfully, but these errors were encountered: