Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Numba GPU backend #38

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Add Numba GPU backend #38

wants to merge 10 commits into from

Conversation

mlazzarin
Copy link
Contributor

@mlazzarin mlazzarin commented Nov 7, 2021

In this PR I implemented a GPU backend that uses Numba. The goal is to assess its performance and see if it may be a good alternative to Cupy.

This new backend is only a partial implementation, as the density matrix calls are still to be fixed. Moreover, some operations are not supported (e.g. arithmetic operations between a device array and and int or float numbers in host code). This breaks some functionalities of Qibo, but we can still run benchmarks.

As always fine-tuning may change these results, but Cupy seems faster, with a substantial advantage in circuits with a small number of qubits. Numba, instead, has a smaller compilation overhead. Note that for this benchmark I deactivated the compilation during the import (for cupy). EDIT: for both cupy and numba I used 1024 threads per block.

qft
nqubits Simulation time cupy Simulation time numba_gpu Dry run overhead cupy Dry run overhead numba_gpu
3 0.00038 0.00085 0.9151764750480652 0.39372308254241944
4 0.00061 0.00132 0.9182806253433228 0.22470632791519166
5 0.00088 0.00180 0.9194815993309021 0.22513166666030884
6 0.00119 0.00245 0.9178803920745849 0.22589269876480103
7 0.00155 0.00314 0.9203890085220336 0.22932177782058716
8 0.00209 0.00397 0.9176576495170593 0.22936758995056153
9 0.00254 0.00474 0.9229723453521729 0.23387645483016967
10 0.00299 0.00573 0.9197258353233337 0.23524923324584962
11 0.00380 0.00688 0.9223066568374634 0.2391204357147217
12 0.00455 0.00801 0.9196613073348999 0.24190555810928344
13 0.00508 0.00963 0.922712767124176 0.24445995092391967
14 0.00599 0.01076 0.923897922039032 0.24830663204193115
15 0.00684 0.01181 0.9168319225311279 0.25066194534301756
16 0.00779 0.01375 0.9255693435668946 0.25664674043655394
17 0.00865 0.01540 0.9193841457366944 0.25849658250808716
18 0.01021 0.01706 0.9206625461578369 0.2662771582603455
19 0.01165 0.01940 0.9283653974533081 0.270539379119873
20 0.01474 0.02163 0.9286629796028137 0.2747878789901733
21 0.02116 0.02864 0.9329722881317138 0.2743494987487793
22 0.03240 0.04036 0.9313127994537354 0.2810977458953857
23 0.05469 0.06387 0.9304193496704102 0.287307071685791
24 0.10079 0.11123 0.9303715705871582 0.29200992584228513
25 0.19675 0.20855 0.9393755912780761 0.29770574569702146
26 0.40241 0.41607 0.9422370910644531 0.30370230674743653
27 0.83491 0.85256 0.9445147037506103 0.3073667049407959
28 1.75634 1.77792 0.937534475326538 0.311820888519287
29 3.69874 3.73336 0.9295771598815916 0.3057109355926513
30 7.82428 7.89393 0.9280393600463865 0.2857973575592041
variational
nqubits Simulation time cupy Simulation time numba_gpu Dry run overhead cupy Dry run overhead numba_gpu
3 0.00033 0.00092 0.9150363087654114 0.22908270359039307
4 0.00046 0.00129 0.9197784900665283 0.1870747208595276
5 0.00052 0.00144 0.9180811166763305 0.1875584363937378
6 0.00067 0.00181 0.9185503363609314 0.18921416997909546
7 0.00071 0.00201 0.9182541370391846 0.1906563401222229
8 0.00084 0.00237 0.9137699723243713 0.19081240892410278
9 0.00093 0.00255 0.9161326646804809 0.19328233003616332
10 0.00109 0.00287 0.9196553349494934 0.19483915567398072
11 0.00125 0.00309 0.917859923839569 0.19484854936599733
12 0.00138 0.00346 0.9205231308937073 0.19646294116973878
13 0.00146 0.00369 0.9118951082229614 0.1988992929458618
14 0.00163 0.00400 0.9169878005981446 0.19912371635437012
15 0.00166 0.00410 0.918331241607666 0.19701597690582276
16 0.00183 0.00463 0.9174162983894348 0.20104413032531737
17 0.00196 0.00486 0.9211397647857666 0.20119340419769288
18 0.00222 0.00538 0.9172272682189941 0.20273176431655884
19 0.00300 0.00609 0.9149930238723755 0.20416228771209716
20 0.00427 0.00745 0.9210782289505005 0.20260899066925048
21 0.00686 0.01060 0.9215971469879151 0.20592575073242186
22 0.01225 0.01668 0.9182041645050049 0.20579996109008789
23 0.02296 0.02772 0.9179378986358643 0.20579948425292968
24 0.04586 0.05155 0.9217719078063965 0.2065950393676758
25 0.09224 0.09876 0.9224301338195801 0.2060943603515625
26 0.19064 0.19917 0.9178159713745118 0.20902915000915528
27 0.38908 0.40134 0.9189221858978271 0.20647945404052737
28 0.81006 0.82714 0.9233675956726074 0.20700583457946775
29 1.65995 1.69524 0.9113995552062988 0.19528136253356942
30 3.45581 3.52614 0.9119315624237059 0.17198967933654785
supremacy
nqubits Simulation time cupy Simulation time numba_gpu Dry run overhead cupy Dry run overhead numba_gpu
3 0.00040 0.00108 0.9176580905914307 0.27566167116165163
4 0.00052 0.00137 0.9113833189010621 0.28220831155776976
5 0.00064 0.00166 0.9142333269119263 0.279436194896698
6 0.00074 0.00204 0.9112993955612183 0.282186484336853
7 0.00087 0.00225 0.9120476841926575 0.28242557048797606
8 0.00100 0.00249 0.9143061161041259 0.2817721128463745
9 0.00110 0.00279 0.9137215971946716 0.28554747104644773
10 0.00133 0.00317 0.9120726585388184 0.283611798286438
11 0.00151 0.00347 0.9124491095542908 0.286725652217865
12 0.00166 0.00375 0.9178303599357605 0.2862760066986084
13 0.00176 0.00399 0.9218409419059753 0.28840529918670654
14 0.00193 0.00426 0.9154765367507934 0.2936093330383301
15 0.00206 0.00483 0.9136180758476258 0.2893231511116028
16 0.00216 0.00495 0.9176264524459838 0.29177074432373046
17 0.00228 0.00544 0.914232587814331 0.29025317430496217
18 0.00280 0.00579 0.9172996520996094 0.29468594789505004
19 0.00373 0.00673 0.9167650580406189 0.294001567363739
20 0.00536 0.00848 0.9170765042304992 0.29374865293502805
21 0.00857 0.01238 0.9198744297027588 0.296476411819458
22 0.01543 0.02003 0.9161161422729492 0.29425773620605467
23 0.02923 0.03409 0.9173520088195801 0.2934697151184082
24 0.05768 0.06308 0.9222365856170655 0.2961618423461914
25 0.11734 0.12410 0.9197750568389893 0.29944143295288084
26 0.24299 0.25171 0.9249423980712891 0.2987250328063965
27 0.49692 0.50981 0.9189109802246094 0.29301977157592773
28 1.02467 1.04238 0.9177283287048339 0.29071445465087886
29 2.11942 2.15624 0.9144937992095947 0.28010339736938494
30 4.41276 4.48546 0.9086195945739748 0.2575901508331295
bv
nqubits Simulation time cupy Simulation time numba_gpu Dry run overhead cupy Dry run overhead numba_gpu
3 0.00032 0.00093 0.9139754295349121 0.3507705569267273
4 0.00041 0.00121 0.9115342020988464 0.22122472524642944
5 0.00051 0.00146 0.914733064174652 0.22314796447753907
6 0.00062 0.00172 0.9129144430160523 0.22320787906646727
7 0.00071 0.00199 0.9078160285949707 0.22330312728881835
8 0.00081 0.00225 0.9152618408203125 0.2229264497756958
9 0.00090 0.00254 0.9098666787147522 0.225579571723938
10 0.00105 0.00281 0.911300265789032 0.22629990577697753
11 0.00122 0.00305 0.9165602326393127 0.22596946954727173
12 0.00138 0.00337 0.9166404366493225 0.22831435203552247
13 0.00145 0.00354 0.9150871872901917 0.22561317682266235
14 0.00154 0.00395 0.9260912299156189 0.22945905923843385
15 0.00167 0.00418 0.919365406036377 0.2293491005897522
16 0.00180 0.00439 0.9197435140609741 0.2288865327835083
17 0.00191 0.00485 0.9181190609931946 0.23270596265792848
18 0.00219 0.00537 0.9156004905700683 0.2308575391769409
19 0.00307 0.00608 0.9218876481056213 0.23130202293395996
20 0.00450 0.00760 0.9182038068771362 0.23100180625915528
21 0.00733 0.01105 0.9204535007476806 0.23305702209472656
22 0.01308 0.01767 0.9231822490692139 0.23313617706298828
23 0.02499 0.02984 0.9191287517547607 0.23281116485595704
24 0.04978 0.05538 0.9155448913574219 0.23374557495117188
25 0.10107 0.10794 0.9192240715026856 0.23277368545532226
26 0.20784 0.21690 0.9138583660125732 0.23343496322631835
27 0.42845 0.44144 0.9158064842224121 0.22987332344055178
28 0.88517 0.90415 0.9168127059936524 0.23018059730529783
29 1.82972 1.86796 0.9208312034606934 0.2157127857208252
30 3.78194 3.86051 0.9319783210754395 0.19238963127136222
qv
nqubits Simulation time cupy Simulation time numba_gpu Dry run overhead cupy Dry run overhead numba_gpu
3 0.00043 0.00121 0.918164074420929 0.23456774950027465
4 0.00077 0.00213 0.9135611414909363 0.19359103441238404
5 0.00080 0.00217 0.9190158128738404 0.1949317216873169
6 0.00116 0.00314 0.918745243549347 0.1980982542037964
7 0.00122 0.00320 0.9178489446640015 0.201424241065979
8 0.00154 0.00417 0.917098069190979 0.20440617799758912
9 0.00163 0.00410 0.9172372579574585 0.20315908193588256
10 0.00202 0.00507 0.9184269189834595 0.20798860788345336
11 0.00217 0.00518 0.9266609072685241 0.2086155652999878
12 0.00252 0.00611 0.9215991616249084 0.21337292194366456
13 0.00261 0.00634 0.9198028922080994 0.21415339708328246
14 0.00305 0.00716 0.9234309792518616 0.21626044511795045
15 0.00313 0.00728 0.9193844795227051 0.219775390625
16 0.00346 0.00828 0.9214813590049744 0.22230437994003296
17 0.00350 0.00819 0.9205222845077514 0.22081170082092286
18 0.00436 0.00969 0.9216543436050415 0.22643284797668456
19 0.00564 0.01031 0.9230634093284606 0.22503520250320436
20 0.00882 0.01366 0.922971248626709 0.22885727882385254
21 0.01343 0.01895 0.9241061687469483 0.22834982872009277
22 0.02547 0.03212 0.9267126083374023 0.32801294326782227
23 0.04640 0.05347 0.9259225368499756 0.32662138938903806
24 0.09506 0.10307 0.9291239261627198 0.3327006816864014
25 0.18773 0.19687 0.9264794826507569 0.33083815574645997
26 0.39634 0.40834 0.9251072406768799 0.33890252113342284
27 0.79671 0.81265 0.9206322193145752 0.332155704498291
28 1.70813 1.72997 0.9240665435791016 0.3364196300506592
29 3.37052 3.41272 0.9188987255096435 0.323496437072754
30 7.29080 7.37778 0.9191365242004395 0.29671454429626465

@scarrazza
Copy link
Member

@mlazzarin thanks for this. If I understand correctly, numba is 2x slower than cupy in the low qubits regime, even if the dry run is smaller, correct? Another possible issues if rocm, I dont' believe numba supports it.

@mlazzarin
Copy link
Contributor Author

If I understand correctly, numba is 2x slower than cupy in the low qubits regime, even if the dry run is smaller, correct?

Yes, even if I'm not sure whether my implementation is optimal or not. The problem is that some operations like moving arrays from host to device while casting to a different type are a bit unpractical, so maybe there is some overhead in my implementation caused by memory management more than the actual kernel. Also, some features (available in Cupy) are not supported, e.g. operations like device_array + float, device_array * float, device_array ** int, flatten operations etc, so a fully working implementation would require additional work.

Also, I'm not sure how to port the multi-qubit kernels, as array creation and array methods are unsupported https://numba.readthedocs.io/en/stable/cuda/cudapysupported.html#numpy-support .

Regarding the dry run, with Cupy we compile all kernels but only a few are actually used. My guess is that Numba compiles only those that are actually used, so the dry run overhead is smaller. We may still fix this with Cupy by using many RawKernels instead of one RawModule.

Another possible issues if rocm, I dont' believe numba supports it.

It seems like the support was dropped https://numba.readthedocs.io/en/stable/reference/deprecation.html?highlight=rocm#dropping-support-for-the-rocm-target .

@scarrazza
Copy link
Member

Thanks. Indeed my next question is which container do we use for the arrays in GPU.
I assume this numbers allocate all objects with numpy and then numba takes care of the copy process.

@mlazzarin
Copy link
Contributor Author

Thanks. Indeed my next question is which container do we use for the arrays in GPU.

I used the numba.cuda.cudadrv.devicearray.DeviceNDArray class (https://numba.readthedocs.io/en/stable/cuda/memory.html#device-arrays).

I assume this numbers allocate all objects with numpy and then numba takes care of the copy process.

The state vector is directly allocated in device memory. The same should be true for gates.
By the way, I implemented a manual casting operation from host to device and viceversa, in analogy with our implementation with Cupy.
If the manual casting is not performed, Numba will take care of the copy process of host arrays, but it will move data back and forth on each kernel launch.

Actually, I noticed that I forgot to manually cast some arrays from host to device (there were some Numba performance warnings). I fixed it in the last commit, following the Cupy backend, but the results are a bit worse than before (maybe it's due to a sub-optimal implementation of the cast() method)

@codecov
Copy link

codecov bot commented Nov 8, 2021

Codecov Report

Merging #38 (ff705b1) into main (f0778c3) will decrease coverage by 27.15%.
The diff coverage is 6.29%.

Impacted file tree graph

@@             Coverage Diff              @@
##              main      #38       +/-   ##
============================================
- Coverage   100.00%   72.84%   -27.16%     
============================================
  Files            9       11        +2     
  Lines          983     1370      +387     
============================================
+ Hits           983      998       +15     
- Misses           0      372      +372     
Flag Coverage Δ
unittests 72.84% <6.29%> (-27.16%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/qibojit/custom_operators/cu_gates.py 0.00% <0.00%> (ø)
src/qibojit/custom_operators/cu_ops.py 0.00% <0.00%> (ø)
src/qibojit/tests/conftest.py 100.00% <ø> (ø)
src/qibojit/tests/test_ops.py 100.00% <ø> (ø)
src/qibojit/custom_operators/backends.py 49.47% <12.72%> (-50.53%) ⬇️
src/qibojit/custom_operators/__init__.py 97.76% <73.33%> (-2.24%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f0778c3...ff705b1. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do not merge experimental Experimental implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants