-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running final binary as sudo #41
Comments
You get two different error codes:
OpenCV seems to not be able to handle small work-group sizes, or at least has a lower bound larger than the supported work-group size of VC4CL.
This seem to be that the call to |
Fails here:Mailbox::executeQPU on sending ioctl, everything hangs
During compilation:
Any ideas? >: QPULib examples work...(https://github.com/mn416/QPULib) |
Sometimes, if the code generated for a kernel does something completely wrong, then the QPUs get into a hanged state. Looks like this is happening here |
and how to deal with? is it compiler problem or my code? |
Most likely a compiler problem. I found an issue which also occurs in the kernel given, and will try to fix it. |
what exactly ? I could change my kernel maybe to skip this problem. |
This commit does not fix issue, still hangs But confirmed - it is wrong division, modified kernel works like a charm:
|
Yeah, the integer division (and modulo) is still wrong. I could not figure out the reason for this yesterday. |
does it mean float div will work? I need that for atan2, which contains same integer div (for offset) and inside atan2 y/x ... btw in you souce it is some comment there, maybe you do x/y ? |
Floating point division works, at least for the tests I ran, with worse-than-allowed accuracy in some cases. I don't know if the atan2 function works.. |
Yes, thats weird. Opencl on nvidia give a bit different results, then opencv-default(cpu?) does:
Kernel is
On RPI something is broken completely:
|
Used nvidia reference implementation (http://developer.download.nvidia.com/cg/atan2.html),on desktop it's ok.
On RPI it works as well:
Almost...negatives may mean it does not do "select" properly tooin this line: |
Depending on the compilation flags you specified, the NVIDIA code may use faster but inaccurate operations (e.g. due to
This looks like the
Thanks for the link, this might come in very handy, if I can figure out its license... |
Still can't make "select" work >: is it broken too ? Bcs I was dependant on in another kernel as well (that initial I simplified for test). |
I was using progs.build("-cl-opt-disable"); |
Okey, replaced "select" by a = fmod(a + pi2, pi2); |
Ok made custom select for float16, for integers must be even simplier:
btw this works 10 times faster, in original select i had to do convert_int16 because result was float16 on rpi, and that was 200ms instead 17ms now for kernel on nvidia. On RPI though no difference on speed (still not sure if original select was working at all, most likely not) |
Do I need to rebuild compiler and vc4cl if stdlib updated? or just reinstall stdlib (including pch) ? |
You need to re-trigger building of the PCH and BC for vc4cl-stdlib, which is done by the script located in |
Ok..so no compiler rebuild if pch is built as separated package like here https://github.com/alexzk1/vc4_stdlib_arch/blob/master/PKGBUILD ? Btw, building VC4C gives many such warnings, is it ok ? (gcc 8) Also: : warning: "_GNU_SOURCE" redefined I think that u kinda missing inline or so ... or maybe #ifdef or #undef |
Well, the warnings are both libstdc++ internal warnings. I can try to disable the warning, but I don't think I can do anything about fixing them. |
fmod(float16, float16) is still broken - pi hangs. Got some error more (works on desktop):
|
To write to any memory? Or kinda "system memory" ? Because opencl assumes, such written buffer must stay in GPU mem until explicity copied to system in C++ code (which is slow). |
There is just one memory. The GPU directly accesses the CPU memory, it does not have any of its own. For
On the host-side, since the memory is really shared, accessing a mapped buffer (e.g. via |
So...there is no way to beat that ? >: Usual technique was split task into couple kernels, and keep passing buffers between them. And only final one is copied to sys memory. So, I run 3 kernels. As you see on 2nd kernel example - load all data from 1st + calculations = 8ms. Storing data to pass to 3rd = 90ms extra. I don't get why loads are much faster then storing, if it uses same unit linearly. |
So what If you will check how that buffer was created, and if not accessed from host side (kinda gpu-only buffer) then you skip locks ? It is kernel's developer problem to keep all proper. |
No, that is a hardware limitation. Unless of course there is some other hardware component I don't know about, or there is some trick to be done with the known components...
As I wrote in the post above, loads (if the same parameter is not written into, e.g. by
How would you do that? At compilation time I don't know which memory is really accessed host-side (at least not for |
Do you take in account "restrict" keyword too ? As I understood, it tells compiler no overlaps are expected on this pointer. ..ok well, is any way to take lock once for couple vstore? For example, store 16 of float16 at once in 1 lock ? |
Currently no, I more or less assume all parameters to be restricted, which might not be right. But treating the correctly would decrease performance...
It is done. |
Well, I have loop there, which does vstore16 for float16 and loop itself is for 16 items. I was thinking to cache it, and store 16 of float16 (256 numbers) in 1 lock. |
Ah okay. |
Okey..tried this one (I will join in 1 piece code to get easier copy/paste):
It is slower then in a loop. Same - a bit slower then a loop >: |
How big is VPM cache? What if you will do copy to ram on explicit call of cl::copy (or cache overflow)? Otherwise, if kernels are chained, you will have same address out/in and can keep using VPM then. |
If I will lower amount of work items (i.e. will do deep nested loops in kernel), will it speed up vstore's somehow? |
64 rows (actually more, but only 64 can be addressed) with 16 words (32-bit) per row, but addressing works per row.
Using the VPM as cache and only writing back when the cache is full would be a useful optimization, but also very hard to realize (esp. dynamically, e.g. with loops). The cache space needs to be split into blocks per QPU (core) and for each core the number of remaining free cache lines need to be tracked...
This would be very useful, but the compiler cannot know if the input comes from another kernel or the host.
I doubt it, it will probably stay the same. |
Damn, then it kills all usage of GPU >: 1:10 ratio of calculation/storing result is crazy. |
Look here, this guy describes he gets bigger speeds: |
He introduces special functions to do asynchronous load and store. One thing though I saw in his code, or more accurately didn't see in his code, was mutex locking... |
Do you use TMU too ? Is it for load only? |
Yeah, for parameters which are only read and not written to, the TMU is used. That is the component I mentioned, which is mainly for reading colors for textures and can read in parallel |
Can you add non-opencl function, which will accept ARRAY of values or pointer (and it's size) and will do vstore in single lock? So usage can be like #ifdef V4C Will it speedup stores? |
What exactly do you mean? With the current code you may be able to do the following:
This uses internal VC4CLStdLib functions, but they might change at any time without warnings! |
Yes, I was thinking if using single mutex to store many values instead 1 will speed up process. vstore takes 1 value (variable), I was thinking to use array of variables, i.e. in my example it stores 256 floats at once. Also compiler should do some #define so can be recognized at compile time.
It should accept array of values (float, float8, float16 - any kind), size of this array, output start (as vstore do) and some "stride", so each next element of array is out to stride * index + pointer_ram (if it is vector, all elements are going out sequentially starting from calculated adress) |
Ok, tried that:
Gives next errors: |
I do not understand why this error is thrown, Did you declare the function before use? The error looks like it was not declared, and as OpenCL is basically C, any not declared function is assumed to be of type |
No, i didn't declare anything, including mutexes. Ok, forward declaration worked, but idea seems bad. 4 times slower total, than direct vstore16 |
For that I would need to know that the function is not declared. This is already handled by CLang by converting it to this strange bit-cast function call. From a VC4C point of view, this is just a bit-cast function call which is not supported.
I din't notice it before, but why do you write every element of the vector extra? You could also use |
Well, this works slightly slower (about 20%) then original vstore16 each.
|
Can you send me a full example kernel of the memory access you do? So I can better see what improvements I can make that will help you. |
Here hysterisis and non_maximum. You can check history of the file to see what I tried. I use github to copy-paste code to PI, so everything is there. |
I tried 1st kernel there - with all vstrore disabled it takes 35ms, once vstrores are enabled - takes 200+ms. Not sure why they did such chip >: just useless. |
...and getting
Is it problem of what ? >: Not supported features, rights, opencl?
Made just in case
and didnt work
The text was updated successfully, but these errors were encountered: