-
Notifications
You must be signed in to change notification settings - Fork 163
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
doc(gpu): add how to page about running on GPU
- Loading branch information
1 parent
96da25c
commit 33a7e9f
Showing
2 changed files
with
217 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,216 @@ | ||
TFHE-rs now includes a GPU backend, featuring a CUDA implementation for performing integer arithmetics on encrypted data. | ||
In what follows, a simple tutorial is introduced: it shows how to update your existing program to use GPU acceleration, or how to start a new one using GPU. | ||
|
||
# Prerequisites | ||
- Cuda version >= 10 | ||
- Compute Capability >= 3.0 | ||
- [gcc](https://gcc.gnu.org/) >= 8.0 - check this [page](https://gist.github.com/ax3l/9489132) for more details about nvcc/gcc compatible versions | ||
- [cmake](https://cmake.org/) >= 3.24 | ||
- Rust version - check this [page](rust_configuration.md) | ||
|
||
# Importing to your project | ||
To use the `TFHE-rs GPU backend` in your project, you first need to add it as a dependency in your `Cargo.toml`. | ||
|
||
If you are using an `x86` machine: | ||
```toml | ||
tfhe = { version = "0.5.0", features = [ "boolean", "shortint", "integer", "x86_64-unix", "gpu" ] } | ||
``` | ||
|
||
If you are using an `ARM` machine: | ||
```toml | ||
tfhe = { version = "0.5.0", features = [ "boolean", "shortint", "integer", "aarch64-unix", "gpu" ] } | ||
``` | ||
|
||
|
||
{% hint style="success" %} | ||
When running code that uses `TFHE-rs`, it is highly recommended to run in release mode with cargo's `--release` flag to have the best possible performance | ||
{% endhint %} | ||
|
||
## Supported platforms | ||
|
||
TFHE-rs GPU backend is supported on Linux (x86, aarch64). | ||
|
||
| OS | x86 | aarch64 | | ||
| ------- |---------------|------------------| | ||
| Linux | `x86_64-unix` | `aarch64-unix`\* | | ||
| macOS | Unsupported | Unsupported\* | | ||
| Windows | Unsupported | Unsupported | | ||
|
||
|
||
# A first example | ||
## Configuring and creating keys. | ||
In comparison with the [CPU example](../getting_started/quick_start), the only difference lies into the key creation, which is detailed [here](#Setting-the-keys) | ||
|
||
Here is a full example (combining the client and server parts): | ||
|
||
```rust | ||
use tfhe::{ConfigBuilder, set_server_key, FheUint8, ClientKey, CompressedServerKey}; | ||
use tfhe::prelude::*; | ||
|
||
fn main() { | ||
|
||
let config = ConfigBuilder::default().build(); | ||
|
||
let client_key= ClientKey::generate(config); | ||
let compressed_server_key = CompressedServerKey::new(&client_key); | ||
|
||
let gpu_key = compressed_server_key.decompress_to_gpu(); | ||
|
||
let clear_a = 27u8; | ||
let clear_b = 128u8; | ||
|
||
let a = FheUint8::encrypt(clear_a, &client_key); | ||
let b = FheUint8::encrypt(clear_b, &client_key); | ||
|
||
//Server-side | ||
|
||
set_server_key(gpu_key); | ||
let result = a + b; | ||
|
||
//Client-side | ||
let decrypted_result: u8 = result.decrypt(&client_key); | ||
|
||
let clear_result = clear_a + clear_b; | ||
|
||
assert_eq!(decrypted_result, clear_result); | ||
} | ||
``` | ||
|
||
|
||
|
||
|
||
## Setting the keys | ||
The configuration of the key is different from the CPU. More precisely, if both client and server keys are still generated by the Client (which is assumed to run on a CPU), the server key has then to be decompressed by the Server to be converted into the right format. | ||
To do so, the server should run this function: ```decompressed_to_gpu()```. | ||
From then on, there is no difference between the CPU and the GPU. | ||
|
||
|
||
## Encrypting data | ||
On the client-side, the method to encrypt the data is exactly the same than the CPU one, i.e.: | ||
```Rust | ||
let clear_a = 27u8; | ||
let clear_b = 128u8; | ||
|
||
let a = FheUint8::encrypt(clear_a, &client_key); | ||
let b = FheUint8::encrypt(clear_b, &client_key); | ||
``` | ||
|
||
## Computation. | ||
The server must first set its keys up, like in the CPU, with: ``` set_server_key(gpu_key);``` . | ||
Then, homomorphic computations are done with the same code than the one described [here](../getting_started/operations). | ||
|
||
```Rust | ||
//Server-side | ||
set_server_key(gpu_key); | ||
let result = a + b; | ||
|
||
//Client-side | ||
let decrypted_result: u8 = result.decrypt(&client_key); | ||
|
||
let clear_result = clear_a + clear_b; | ||
|
||
assert_eq!(decrypted_result, clear_result); | ||
``` | ||
|
||
## Decryption. | ||
Finally, the client gets the decrypted results by computing: | ||
|
||
```Rust | ||
let decrypted_result: u8 = result.decrypt(&client_key); | ||
``` | ||
## Improving performance. | ||
TFHE-rs includes the possibility to leverage the high number of threads given by a GPU. | ||
To do so, the configuration should be updated with ```Rust let config = ConfigBuilder::with_custom_parameters(PARAM_MULTI_BIT_MESSAGE_2_CARRY_2_GROUP_3_KS_PBS, None).build();``` | ||
The complete example becomes: | ||
|
||
```rust | ||
use tfhe::{ConfigBuilder, set_server_key, FheUint8, ClientKey, CompressedServerKey}; | ||
use tfhe::prelude::*; | ||
use tfhe::shortint::parameters::PARAM_MULTI_BIT_MESSAGE_2_CARRY_2_GROUP_3_KS_PBS; | ||
|
||
fn main() { | ||
|
||
let config = ConfigBuilder::with_custom_parameters(PARAM_MULTI_BIT_MESSAGE_2_CARRY_2_GROUP_3_KS_PBS, None).build(); | ||
|
||
let client_key= ClientKey::generate(config); | ||
let compressed_server_key = CompressedServerKey::new(&client_key); | ||
|
||
let gpu_key = compressed_server_key.decompress_to_gpu(); | ||
|
||
let clear_a = 27u8; | ||
let clear_b = 128u8; | ||
|
||
let a = FheUint8::encrypt(clear_a, &client_key); | ||
let b = FheUint8::encrypt(clear_b, &client_key); | ||
|
||
//Server-side | ||
|
||
set_server_key(gpu_key); | ||
let result = a + b; | ||
|
||
//Client-side | ||
let decrypted_result: u8 = result.decrypt(&client_key); | ||
|
||
let clear_result = clear_a + clear_b; | ||
|
||
assert_eq!(decrypted_result, clear_result); | ||
} | ||
``` | ||
|
||
|
||
# List of available operations | ||
|
||
The GPU backend includes the following operations: | ||
| name | symbol | `Enc`/`Enc` | `Enc`/ `Int` | | ||
|-----------------------|----------------|--------------------------|--------------------------| | ||
| Neg | `-` | :heavy_check_mark: | N/A | | ||
| Add | `+` | :heavy_check_mark: | :heavy_check_mark: | | ||
| Sub | `-` | :heavy_check_mark: | :heavy_check_mark: | | ||
| Mul | `*` | :heavy_check_mark: | :heavy_check_mark: | | ||
| Div | `/` | :heavy_multiplication_x: | :heavy_multiplication_x: | | ||
| Rem | `%` | :heavy_multiplication_x: | :heavy_multiplication_x: | | ||
| Not | `!` | :heavy_check_mark: | N/A | | ||
| BitAnd | `&` | :heavy_check_mark: | :heavy_check_mark: | | ||
| BitOr | `\|` | :heavy_check_mark: | :heavy_check_mark: | | ||
| BitXor | `^` | :heavy_check_mark: | :heavy_check_mark: | | ||
| Shr | `>>` | :heavy_multiplication_x: | :heavy_check_mark: | | ||
| Shl | `<<` | :heavy_multiplication_x: | :heavy_check_mark: | | ||
| Rotate right | `rotate_right` | :heavy_multiplication_x: | :heavy_check_mark: | | ||
| Rotate left | `rotate_left` | :heavy_multiplication_x: | :heavy_check_mark: | | ||
| Min | `min` | :heavy_check_mark: | :heavy_check_mark: | | ||
| Max | `max` | :heavy_check_mark: | :heavy_check_mark: | | ||
| Greater than | `gt` | :heavy_check_mark: | :heavy_check_mark: | | ||
| Greater or equal than | `ge` | :heavy_check_mark: | :heavy_check_mark: | | ||
| Lower than | `lt` | :heavy_check_mark: | :heavy_check_mark: | | ||
| Lower or equal than | `le` | :heavy_check_mark: | :heavy_check_mark: | | ||
| Equal | `eq` | :heavy_check_mark: | :heavy_check_mark: | | ||
| Cast (into dest type) | `cast_into` | :heavy_multiplication_x: | N/A | | ||
| Cast (from src type) | `cast_from` | :heavy_multiplication_x: | N/A | | ||
| Ternary operator | `if_then_else` | :heavy_check_mark: | :heavy_multiplication_x: | | ||
|
||
{% hint style="info" %} | ||
All operations follow the same syntax than the one described in [here](../getting_started/operations.md). | ||
{% endhint %} | ||
|
||
|
||
|
||
### Benchmarks | ||
The tables below contain benchmarks for homomorphic operations running on a single V100 from AWS (p3.2xlarge machines), with the default parameters: | ||
|
||
| Operation \ Size | FheUint8 | FheUint16 | FheUint32 | FheUint64 | FheUint128 | FheUint256 | | ||
|------------------|-----------|-----------|-----------|-----------|------------|------------| | ||
| cuda_add | 103.33 ms | 129.26 ms | 156.83 ms | 186.99 ms | 320.96 ms | 528.15 ms | | ||
| cuda_bitand | 26.11 ms | 26.21 ms | 26.63 ms | 27.24 ms | 43.07 ms | 65.01 ms | | ||
| cuda_bitor | 26.1 ms | 26.21 ms | 26.57 ms | 27.23 ms | 43.05 ms | 65.0 ms | | ||
| cuda_bitxor | 26.08 ms | 26.21 ms | 26.57 ms | 27.25 ms | 43.06 ms | 65.07 ms | | ||
| cuda_eq | 52.82 ms | 53.0 ms | 79.4 ms | 79.58 ms | 96.37 ms | 145.25 ms | | ||
| cuda_ge | 104.7 ms | 130.23 ms | 156.19 ms | 183.2 ms | 213.43 ms | 288.76 ms | | ||
| cuda_gt | 104.93 ms | 130.2 ms | 156.33 ms | 183.38 ms | 213.47 ms | 288.8 ms | | ||
| cuda_le | 105.14 ms | 130.47 ms | 156.48 ms | 183.44 ms | 213.33 ms | 288.75 ms | | ||
| cuda_lt | 104.73 ms | 130.23 ms | 156.2 ms | 183.14 ms | 213.33 ms | 288.74 ms | | ||
| cuda_max | 156.7 ms | 182.65 ms | 210.74 ms | 251.78 ms | 316.9 ms | 442.71 ms | | ||
| cuda_min | 156.85 ms | 182.67 ms | 210.39 ms | 252.02 ms | 316.96 ms | 442.95 ms | | ||
| cuda_mul | 219.73 ms | 302.11 ms | 465.91 ms | 955.66 ms | 2.71 s | 9.15 s | | ||
| cuda_ne | 52.72 ms | 52.91 ms | 79.28 ms | 79.59 ms | 96.37 ms | 145.36 ms | | ||
| cuda_neg | 103.26 ms | 129.4 ms | 157.19 ms | 187.09 ms | 321.27 ms | 530.11 ms | | ||
| cuda_sub | 103.34 ms | 129.42 ms | 156.87 ms | 187.01 ms | 321.04 ms | 528.13 ms | |