-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Execution: failed" when running simple dot product kernel #67
Comments
Unless you explicitly disabled This could be due to one of two problems:
Can you run the kernel in emulation mode and send the output (or at least the error)? |
Building VC4CL with The crash is in 65| void* clGetExtensionFunctionAddress(const char* name)
66| {
67+> return VC4CL_FUNC(clGetExtensionFunctionAddressForPlatform)(Platform::getVC4CLPlatform().toBase(), name);
68| } I believe it's because Let me know if I'm building VC4CL incorrectly. |
When Thus, to run
Any program which relies on the ICD loader (any program linking in |
Thanks! I'll give this a shot.
…On Thu, Mar 28, 2019, 10:03 AM doe300 ***@***.***> wrote:
When VC4CL is built in emulation mode (MOCK_HAL enabled), it is built
without ICD loader support.
Thus, to run clinfo with the emulation version, you need to make sure,
clinfo loads the libVC4CL.so as libOpenCL.so:
1. Create a symlink ln --symbolic libVC4CL.so libOpenCL.so within the
build/src directory
2. Run clinfo with LD_PRELOAD=path/to/libOpenCL.so clinfo
Any program which relies on the ICD loader (any program linking in
libOpenCL.so and not directly libVC4CL.so) will need to be started with
the LD_PRELOAD environment variable.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#67 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABPkp73NRf-o19xa3u26elW6dL7zxXdVks5vbPFcgaJpZM4cNJ2p>
.
|
Alright, with Compared to running directly on the device, with emulation, all work groups ran. Any other ideas about what could be going on here? Snipped output below (removed the middle part that was repetitive, full output here)
|
I tried it out myself too yesterday and it took a while but finished. Which is bad in this case, since it doesn't help to find the problem. So from the The One thing you could check: |
Your suspicions seem to be correct. Without any changes:
Changing the 30s timeout to 60s:
Changing the 60s timeout to 120s:
|
Ping. :) Any luck? |
Sorry, didn't have time to look into that. |
So it looks like for a work-group size of 12 (the maximum), the execution hangs, probably somewhere in A side note: |
* Improves a few optimizations * Adapts emulator to semaphore changes See doe300/VC4CL#67 Fixes: boost_compute/test_reduce_by_key OpenCL-CTS/basic/kernel_limit_constants
So, turns out I did semaphore access wrong, that is why the execution timed out. |
Hi. I tried to reproduce this bug, and I find that VC4CL in my Raspberry Pi 3 B+, has some float point precision errors. I did 2 tests: dot_product (the same as this issue) and a basic saxpy program ( dest[i] = src[i] * 3.14 ). In both cases I get different values than in the host: [0] Host: 3313.5571 Device: 3313.5571 The dot product has problems with local memory and computation. I don't know why. I checked your other issues but I didn't find a workaround. For example, if I do a simple operation, the final debug values are just wrong:
Without BUG: With BUG (the same with multiplication, etc): Results: What can be wrong? |
The difference between For the second part: |
I have created this repo: https://github.com/rNoz/opencl_embedded_tests Precision See the differences when using FACTOR=3.1415 or FACTOR=2.0. It is the expected behavior? It is the first time I read about ULP. I checked that OpenCL is less restrictive (less precision) than CUDA, and I assume in Embedded Profile it is even further. Any advice regarding doing "unit test"/bench check for embedded? $ FACTOR=3.1415 VECTOR=12 CHECK=1 sudo -E ./build/saxpy saxpy.cl
vector_len: 12
check results: true
factor: 3.141500
using platform.device: 0.0
saxpy.cl
operation: saxpy
=== 1 OpenCL platform(s) found: ===
-- 0 --
PROFILE = EMBEDDED_PROFILE
VERSION = OpenCL 1.2 VC4CL 0.4.9999
NAME = OpenCL for the Raspberry Pi VideoCore IV GPU
VENDOR = doe300
EXTENSIONS = cl_khr_il_program cl_khr_spir cl_khr_create_command_queue cl_altera_device_temperature cl_altera_live_object_tracking cl_khr_icd cl_vc4cl_performance_counters
=== 2 OpenCL device(s) found on platform:
-- 0 --
DEVICE_NAME = VideoCore IV GPU
DEVICE_VENDOR = Broadcom
DEVICE_VERSION = OpenCL 1.2 VC4CL 0.4.9999
DRIVER_VERSION = 0.4.9999
DEVICE_MAX_COMPUTE_UNITS = 1
DEVICE_MAX_CLOCK_FREQUENCY = 300
DEVICE_GLOBAL_MEM_SIZE = 67108864
DEVICE_MAX_WG_SIZE X=12,Y=12,Z=12
Creating context...
Creating command queue...
Creating program...
Building program from source...
attempting to create input buffer
attempting to create output buffer
attempting to create kernel
setting up kernel args cl_mem: 0x7f4800bc38
attempting to enqueue write buffer
attempting to enqueue kernel
Enqueue'd kerenel
time(ns):590937
Result:
[0] Host: 263.944977 Device: 263.944977
[1] Host: 123.895401 Device: 123.895393
[FAILURE] at index 1: 123.895401 != 123.895393
[2] Host: 246.010620 Device: 246.010605
[FAILURE] at index 2: 246.010620 != 246.010605
[FAILURE] at index 4: 286.394043 != 286.394012
[FAILURE] at index 6: 105.310226 != 105.310219
[FAILURE] at index 9: 174.029678 != 174.029663
computed 12 elements
$ FACTOR=2.0 VECTOR=12 CHECK=1 sudo -E ./build/saxpy saxpy.cl
vector_len: 12
check results: true
factor: 2.000000
using platform.device: 0.0
saxpy.cl
operation: saxpy
=== 1 OpenCL platform(s) found: ===
-- 0 --
PROFILE = EMBEDDED_PROFILE
VERSION = OpenCL 1.2 VC4CL 0.4.9999
NAME = OpenCL for the Raspberry Pi VideoCore IV GPU
VENDOR = doe300
EXTENSIONS = cl_khr_il_program cl_khr_spir cl_khr_create_command_queue cl_altera_device_temperature cl_altera_live_object_tracking cl_khr_icd cl_vc4cl_performance_counters
=== 2 OpenCL device(s) found on platform:
-- 0 --
DEVICE_NAME = VideoCore IV GPU
DEVICE_VENDOR = Broadcom
DEVICE_VERSION = OpenCL 1.2 VC4CL 0.4.9999
DRIVER_VERSION = 0.4.9999
DEVICE_MAX_COMPUTE_UNITS = 1
DEVICE_MAX_CLOCK_FREQUENCY = 300
DEVICE_GLOBAL_MEM_SIZE = 67108864
DEVICE_MAX_WG_SIZE X=12,Y=12,Z=12
Creating context...
Creating command queue...
Creating program...
Building program from source...
attempting to create input buffer
attempting to create output buffer
attempting to create kernel
setting up kernel args cl_mem: 0x7f5400bdc8
attempting to enqueue write buffer
attempting to enqueue kernel
Enqueue'd kerenel
time(ns):592813
Result:
[0] Host: 168.037552 Device: 168.037552
[1] Host: 78.876587 Device: 78.876587
[2] Host: 156.619843 Device: 156.619843
computed 12 elements Non deterministic behavior? I ran: $ CHECK=1 VECTOR=12 sudo -E ./build/vectors vecmul.cl
vector: 12
check results: true
using platform.device: 0.0
operation: vecmul
max wg size: 12
[0] OpenCL (2.00000) Host (2.00000)
[1] OpenCL (8.00000) Host (8.00000)
[2] OpenCL (18.00000) Host (18.00000)
[3] OpenCL (32.00000) Host (32.00000)
[8] OpenCL (162.00000) Host (162.00000)
[9] OpenCL (200.00000) Host (200.00000)
[10] OpenCL (242.00000) Host (242.00000)
[11] OpenCL (288.00000) Host (288.00000)
Everything seems to work fine!
$ CHECK=1 VECTOR=1024 sudo -E ./build/vectors vecmul.cl
...
Everything seems to work fine!
$ CHECK=1 VECTOR=102400 sudo -E ./build/vectors vecmul.cl
...
[FAILURE] [102371] OpenCL (20960051200.00000) Host (20960053248.00000)
[FAILURE] [102372] OpenCL (20960460800.00000) Host (20960462848.00000)
[FAILURE] [102373] OpenCL (20960870400.00000) Host (20960872448.00000)
[FAILURE] [102374] OpenCL (20961280000.00000) Host (20961282048.00000)
[FAILURE] [102375] OpenCL (20961689600.00000) Host (20961691648.00000)
[FAILURE] [102376] OpenCL (20962099200.00000) Host (20962101248.00000)
[102396] OpenCL (20970291200.00000) Host (20970291200.00000)
[102397] OpenCL (20970700800.00000) Host (20970700800.00000)
[102398] OpenCL (20971110400.00000) Host (20971110400.00000)
[102399] OpenCL (20971520000.00000) Host (20971520000.00000) And then, since this execution with all these failures, even the 1024 fails now: $ CHECK=1 VECTOR=1024 sudo -E ./build/vectors vecmul.cl
vector: 1024
check results: true
using platform.device: 0.0
operation: vecmul
max wg size: 12
[0] OpenCL (2.00000) Host (2.00000)
[1] OpenCL (8.00000) Host (8.00000)
[2] OpenCL (18.00000) Host (18.00000)
[3] OpenCL (32.00000) Host (32.00000)
[FAILURE] [16] OpenCL (3447362.00000) Host (578.00000)
[FAILURE] [17] OpenCL (3650184.00000) Host (648.00000)
[FAILURE] [18] OpenCL (3853010.00000) Host (722.00000)
[FAILURE] [19] OpenCL (4055840.00000) Host (800.00000)
[FAILURE] [20] OpenCL (4258674.00000) Host (882.00000)
[FAILURE] [21] OpenCL (4461512.00000) Host (968.00000)
[FAILURE] [22] OpenCL (4664354.00000) Host (1058.00000)
[FAILURE] [23] OpenCL (4867200.00000) Host (1152.00000)
[FAILURE] [24] OpenCL (5070050.00000) Host (1250.00000)
...
[FAILURE] [1020] OpenCL (209094672.00000) Host (2084882.00000)
[1021] OpenCL (209301504.00000) Host (2088968.00000)
[FAILURE] [1021] OpenCL (209301504.00000) Host (2088968.00000)
[1022] OpenCL (209508352.00000) Host (2093058.00000)
[FAILURE] [1022] OpenCL (209508352.00000) Host (2093058.00000)
[1023] OpenCL (209715200.00000) Host (2097152.00000)
[FAILURE] [1023] OpenCL (209715200.00000) Host (2097152.00000)
# with 128, to see that not always starts failing in 16:
$ CHECK=1 VECTOR=128 sudo -E ./build/vectors vecadd.cl
vector: 128
check results: true
using platform.device: 0.0
operation: vecadd
max wg size: 12
[0] OpenCL (3.00000) Host (3.00000)
[1] OpenCL (6.00000) Host (6.00000)
[2] OpenCL (9.00000) Host (9.00000)
[3] OpenCL (12.00000) Host (12.00000)
[FAILURE] [28] OpenCL (58.00000) Host (87.00000)
[FAILURE] [29] OpenCL (60.00000) Host (90.00000)
[FAILURE] [30] OpenCL (62.00000) Host (93.00000)
[FAILURE] [31] OpenCL (64.00000) Host (96.00000)
[124] OpenCL (375.00000) Host (375.00000)
[125] OpenCL (378.00000) Host (378.00000)
[126] OpenCL (381.00000) Host (381.00000)
[127] OpenCL (384.00000) Host (384.00000)
But small tests like 12 or 24 works: ❯ CHECK=1 VECTOR=24 sudo -E ./build/vectors vecmul.cl
vector: 24
check results: true
using platform.device: 0.0
operation: vecmul
max wg size: 12
[0] OpenCL (2.00000) Host (2.00000)
[1] OpenCL (8.00000) Host (8.00000)
[2] OpenCL (18.00000) Host (18.00000)
[3] OpenCL (32.00000) Host (32.00000)
[20] OpenCL (882.00000) Host (882.00000)
[21] OpenCL (968.00000) Host (968.00000)
[22] OpenCL (1058.00000) Host (1058.00000)
[23] OpenCL (1152.00000) Host (1152.00000)
Everything seems to work fine! It is interesting (but chaotic):
Maybe you know what is going on here. Local memory I understood that is a bug and you will work on it. If you need any more test from my side, please, let me know. Also, I don't know if the kernel/device/building process would affect. Thank you. |
Thanks for investing time to better analyse this error.
The difference here is that I don't know what is going wrong with the indeterministic results, I have to look further into that. |
So I can partially reproduce the behavior you are seeing: |
What can I do to help detecting the issue? Maybe executing low-level tests regarding memory? float ops? |
So for the float multiplication, I did some testing and it looks like the No, I don't think the build process has anything to do with that. About the memory issue: Did you run the code with the latest VC4C/VC4CL version? Or which version are you using? |
So it very much looks like the CPU uses the "normal" IEEE 754 round-to-nearest-even rounding mode, which is also the default rounding modes for the OpenCL 1.2 full profile. So to test for correctness, instead of the equality comparison a comparison with 1 ULP of allowed error has to be used. |
Thank you for your time. I contribute with further tests:
They differ partially. Important note: I started writting this issue 6h ago, but I wanted to be completely sure everything I wrote is correct, so, I re-run again the experiments.... oh my bad. Now, they differ, just like yesterday with the non deterministic behavior. Finally, I have decided to skip the inconsistent results from this morning and I will explore further. The main thing was that this morning I saw how it gave failures at the 11584 index, but I cannot reproduce anymore. I write the two sections with the only relevant (bus error, ULP): 6h ago experimentsRpi3B: $ CHECK=1 VECTOR=$(( 10240 * 1 + 512 * 2 + 128 * 2 + 64 + 2 )) sudo -E ./build/vectors vecmul.cl
vector: 11586
...
[11584] OpenCL (268424448.00000) Host (268424448.00000)
[11585] OpenCL (268470784.00000) Host (268470784.00000)
Everything seems to work fine!
$ CHECK=1 VECTOR=$(( 10240 * 1 + 512 * 2 + 128 * 2 + 64 + 3 )) sudo -E ./build/vectors vecmul.cl
...
[11585] OpenCL (268470784.00000) Host (268470784.00000)
[11586] OpenCL (268517120.00000) Host (268517152.00000)
[FAILURE] [11586] OpenCL (268517120.00000) Host (268517152.00000)
$ CHECK=1 VECTOR=$(( 10240 * 1 + 512 * 2 + 128 * 2 + 64 + 4 )) sudo -E ./build/vectors vecmul.cl
vector: 11588
...
[11586] OpenCL (268517120.00000) Host (268517152.00000)
[FAILURE] [11586] OpenCL (268517120.00000) Host (268517152.00000)
[11587] OpenCL (268563488.00000) Host (268563488.00000) Rpi3B+: $ CHECK=1 VECTOR=$(( 10240 * 1 + 512 * 2 + 128 * 2 + 64 )) sudo -E ./build/vectors vecmul.cl
vector: 11584
...
[11582] OpenCL (268331776.00000) Host (268331776.00000)
[11583] OpenCL (268378112.00000) Host (268378112.00000)
Everything seems to work fine!
$ CHECK=1 VECTOR=$(( 10240 * 1 + 512 * 2 + 128 * 2 + 64 + 1)) sudo -E ./build/vectors vecmul.cl
vector: 11585
check results: true
using platform.device: 0.0
operation: vecmul
max wg size: 12
[1] 1802 bus error CHECK=1 VECTOR=$(( 10240 * 1 + 512 * 2 + 128 * 2 + 64 + 1)) sudo -E vecmul.c
$ CHECK=1 VECTOR=$(( 10240 * 1 + 512 * 2 + 128 * 2 + 64 + 2 )) sudo -E ./build/vectors vecmul.cl
vector: 11586
...
[11584] OpenCL (268424448.00000) Host (268424448.00000)
[11585] OpenCL (268470784.00000) Host (268470784.00000)
Everything seems to work fine!
$ CHECK=1 VECTOR=$(( 10240 * 1 + 512 * 2 + 128 * 2 + 64 + 4 )) sudo -E ./build/vectors vecmul.cl
vector: 11588
...
[11585] OpenCL (268470784.00000) Host (268470784.00000)
[11586] OpenCL (268517120.00000) Host (268517152.00000)
[FAILURE] [11586] OpenCL (268517120.00000) Host (268517152.00000)
[11587] OpenCL (268563488.00000) Host (268563488.00000) It starts failing in the 11586:
Note that sometimes I get "bus error", maybe some problem when I use odd number of elements (alignment?), but it only affects the Rpi3B+, not the Rpi3B (it finishes). Also, we are quite far from the memory limits because All the failures may? be related with the floating point precision, but I would like to know how I can do the C test for this kernel. I was doing some tests using nextafterf/nexttowardf, but I didn't get it completely. Also, I found in your Wiki https://github.com/doe300/VC4CL/wiki/NumericalCompliance You refer to So, if I have the first two operations performed in the host (assigning values to the array) and the last one performed in the device (get, multiply, assign), what can I expect regarding the 1ULP? #include <math.h>
#include <stdio.h>
void main(){
float f0, f1, f;
long double d0, d1;
float floats[] = {
11587.0f,
23174.0f,
268540312.0f,
};
int nfloats = sizeof(floats)/sizeof(float);
for (int nfloat=0; nfloat<nfloats; ++nfloat){
float f0 = floats[nfloat];
int times = 3;
float fto = f0 * 1.10;
float fton = f0 - f0 * 0.10;
float fprev = nextafterf(f0, fton); // get the previous
float ffrom = fprev;
printf("Finding above %10.10f (starting with %10.10f)\n", f0, ffrom);
for (int i=0; i<times; ++i){
f = nextafterf(ffrom, fto);
printf("[%d] %12s f0 %10.10f f1 %10.10f => %f\n", i, "nextafterf", ffrom, fto, f);
ffrom = f;
fto = f * 1.10;
}
}
// ints to float
for (int i=0; i<200000; ++i){
float ffrom = (float)i;
if ((float)(int)ffrom != ffrom){
printf("not representable %10.10f (%d)\n", ffrom, i);
break;
}
}
float a = 11587.0;
float b = (11586+1) * 2.0f;
printf("a %10.10f\n", a);
printf("b %10.10f\n", b);
float c = a * b;
printf("c = a * b = %10.10f\n", c);
printf("double c = %10.10f\n", ((double)(11586 + 1)) * (((double)(11586 + 1) * 2.0d)));
} In the Rpi host:
With this, I am checking that in the host, the values 11587 and 23174 are correctly representable. Isn't it? Then, in the GPU, it fetches both values, multiplies and stores them, getting a final 268517120.0 compared with 268517152.0 in the host.
Maybe are "basic" questions but I would really appreciate to know how to calculate the host code to check this and more complex kernels. Last hour experimentsIn both cases, Rpi3B+ and Rpi3B, perform the next: $ for i in `seq 1 1 10`; do for offset in 2 4 8 16 32; do CHECK=1 VECTOR=$(( 10240 * 1 + 512 * 2 + 128 * 2 + 62 + $offset )) sudo -E ./build/vectors vecmul.cl | grep -E '(vector:|FAILURE)' | head; done; done
vector: 11584
vector: 11586
vector: 11590
[FAILURE] [11586] OpenCL (268517120.00000) Host (268517152.00000)
[FAILURE] [11588] OpenCL (268609824.00000) Host (268609856.00000)
vector: 11598
[FAILURE] [11586] OpenCL (268517120.00000) Host (268517152.00000)
[FAILURE] [11588] OpenCL (268609824.00000) Host (268609856.00000)
[FAILURE] [11594] OpenCL (268888032.00000) Host (268888064.00000)
[FAILURE] [11596] OpenCL (268980800.00000) Host (268980832.00000)
vector: 11614
[FAILURE] [11586] OpenCL (268517120.00000) Host (268517152.00000)
[FAILURE] [11588] OpenCL (268609824.00000) Host (268609856.00000)
[FAILURE] [11594] OpenCL (268888032.00000) Host (268888064.00000)
[FAILURE] [11596] OpenCL (268980800.00000) Host (268980832.00000)
[FAILURE] [11602] OpenCL (269259200.00000) Host (269259232.00000)
[FAILURE] [11604] OpenCL (269352032.00000) Host (269352064.00000)
[FAILURE] [11610] OpenCL (269630624.00000) Host (269630656.00000)
[FAILURE] [11612] OpenCL (269723520.00000) Host (269723552.00000)
... # all the same The only difference here is that the Rpi3B+ needs 3m52s to finish, while the Rpi3B 1m12s. Rebooted both and re-run again. Same results (+-4s). $ grep '^[^#]' /boot/config.txt
dtparam=spi=on
dtoverlay=w1-gpio
dtparam=audio=on With clinfo, the Platform/Device version are:
4c4
< Platform Version OpenCL 1.2 VC4CL 0.4.9999
---
> Platform Version OpenCL 1.2 VC4CL 0.4.138
14,15c14,15
< Device Version OpenCL 1.2 VC4CL 0.4.9999
< Driver Version 0.4.9999
---
> Device Version OpenCL 1.2 VC4CL 0.4.138
> Driver Version 0.4.138
22a23
> Available core IDs 0, 64
24c25
< Core Temperature (Altera) 42 C
---
> Core Temperature (Altera) 54 C
53c54
< Global memory size 79691776 (76MiB)
---
> Global memory size 67108864 (64MiB)
55c56
< Max memory allocation 79691776 (76MiB)
---
> Max memory allocation 67108864 (64MiB)
64c65
< Local memory size 79691776 (76MiB)
---
> Local memory size 67108864 (64MiB)
66c67
< Max constant buffer size 79691776 (76MiB)
---
> Max constant buffer size 67108864 (64MiB)
87a89
> My next test will be to install a full raspbian in the Rpi3B+ to discard issues with Manjaro Arm or its config files/drivers. I have a question for you, since I was having a look to some parts of your code. How did you do this massive work? I couldn't find any funding/project info. Do you work for free on VC4CL? Or did you extracted this as an open-source project but it is backed up/funded by real industry/projects where you get your salary/money from? To me, it seems a full-time job that you carried out for 2 years... |
You didn't miss something, I did. The floating point multiplication has to be correctly rounded (0 ULP), similar to the addition and subtraction according to the OpenCL 1.2 specification. I will update the wiki page with the correct allowed and the newly tested behavior (e.g. rounding mode).
The GPU sees (as values read from memory) the values you give into the buffer, they are only memory copied, so no conversion there.
Its the error of the operation (or of the result of the operation).
About the huge run-time difference: Maybe the debug build is so much slower? But the difference is really huge, so I am not sure this is the sole reason. Yes, the total error of e.g. an algorithm is calculated by adding up the allowed/actual errors of the operations performed.
It started out as the programming part of my Masters Thesis pretty much 3 years ago, 2 years ago I published it first on github with the finalization of my thesis and from then I basically continue it as hobby. I sadly do not get paid for it and have to work an actual job, this is also why sometimes the progress is slower than I hope ;) |
Using the most recent commits from VC4CLStdLib, VC4C, and VC4CL (while commenting out the offending lines from #66 ), I'm running into a failure when running the code at https://github.com/kernhanda/opencl_dot_product.
To repro:
Output:
clinfo
works as expected:The text was updated successfully, but these errors were encountered: