-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Numba GPU backend #38
base: main
Are you sure you want to change the base?
Conversation
@mlazzarin thanks for this. If I understand correctly, numba is 2x slower than cupy in the low qubits regime, even if the dry run is smaller, correct? Another possible issues if rocm, I dont' believe numba supports it. |
Yes, even if I'm not sure whether my implementation is optimal or not. The problem is that some operations like moving arrays from host to device while casting to a different type are a bit unpractical, so maybe there is some overhead in my implementation caused by memory management more than the actual kernel. Also, some features (available in Cupy) are not supported, e.g. operations like Also, I'm not sure how to port the multi-qubit kernels, as array creation and array methods are unsupported https://numba.readthedocs.io/en/stable/cuda/cudapysupported.html#numpy-support . Regarding the dry run, with Cupy we compile all kernels but only a few are actually used. My guess is that Numba compiles only those that are actually used, so the dry run overhead is smaller. We may still fix this with Cupy by using many
It seems like the support was dropped https://numba.readthedocs.io/en/stable/reference/deprecation.html?highlight=rocm#dropping-support-for-the-rocm-target . |
Thanks. Indeed my next question is which container do we use for the arrays in GPU. |
I used the
The state vector is directly allocated in device memory. The same should be true for gates. Actually, I noticed that I forgot to manually cast some arrays from host to device (there were some Numba performance warnings). I fixed it in the last commit, following the Cupy backend, but the results are a bit worse than before (maybe it's due to a sub-optimal implementation of the |
Codecov Report
@@ Coverage Diff @@
## main #38 +/- ##
============================================
- Coverage 100.00% 72.84% -27.16%
============================================
Files 9 11 +2
Lines 983 1370 +387
============================================
+ Hits 983 998 +15
- Misses 0 372 +372
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
In this PR I implemented a GPU backend that uses Numba. The goal is to assess its performance and see if it may be a good alternative to Cupy.
This new backend is only a partial implementation, as the density matrix calls are still to be fixed. Moreover, some operations are not supported (e.g. arithmetic operations between a device array and and int or float numbers in host code). This breaks some functionalities of Qibo, but we can still run benchmarks.
As always fine-tuning may change these results, but Cupy seems faster, with a substantial advantage in circuits with a small number of qubits. Numba, instead, has a smaller compilation overhead. Note that for this benchmark I deactivated the compilation during the import (for cupy). EDIT: for both cupy and numba I used 1024 threads per block.
qft
variational
supremacy
bv
qv