doc(gpu): add how to page about running on GPU

zama-ai · Jan 22, 2024 · 33a7e9f · 33a7e9f
1 parent 96da25c
commit 33a7e9f
Show file tree

Hide file tree

Showing 2 changed files with 217 additions and 0 deletions.
diff --git a/tfhe/docs/SUMMARY.md b/tfhe/docs/SUMMARY.md
@@ -14,6 +14,7 @@
 * [Homomorphic Case Changing on Ascii String](tutorials/ascii_fhe_string.md)
 
 ## How To
+* [Run on GPU](how_to/run_on_gpu.md)
 * [Configure Rust](how_to/rust_configuration.md)
 * [Detect Overflow](how_to/overflow_operations.md)
 * [Serialize/Deserialize](how_to/serialization.md)

diff --git a/tfhe/docs/how_to/run_on_gpu.md b/tfhe/docs/how_to/run_on_gpu.md
@@ -0,0 +1,216 @@
+TFHE-rs now includes a GPU backend, featuring a CUDA implementation for performing integer arithmetics on encrypted data.
+In what follows, a simple tutorial is introduced: it shows how to update your existing program to use GPU acceleration, or how to start a new one using GPU.
+
+# Prerequisites
+- Cuda version >= 10
+- Compute Capability >= 3.0
+- [gcc](https://gcc.gnu.org/) >= 8.0 - check this [page](https://gist.github.com/ax3l/9489132) for more details about nvcc/gcc compatible versions
+- [cmake](https://cmake.org/) >= 3.24
+- Rust version - check this [page](rust_configuration.md)
+
+# Importing to your project
+To use the `TFHE-rs GPU backend` in your project, you first need to add it as a dependency in your `Cargo.toml`.
+
+If you are using an `x86` machine:
+```toml
+tfhe = { version = "0.5.0", features = [ "boolean", "shortint", "integer", "x86_64-unix", "gpu" ] }
+```
+
+If you are using an `ARM` machine:
+```toml
+tfhe = { version = "0.5.0", features = [ "boolean", "shortint", "integer", "aarch64-unix", "gpu" ] }
+```
+
+
+{% hint style="success" %}
+When running code that uses `TFHE-rs`, it is highly recommended to run in release mode with cargo's `--release` flag to have the best possible performance
+{% endhint %}
+
+## Supported platforms
+
+TFHE-rs GPU backend is supported on Linux (x86, aarch64).
+
+| OS      | x86           | aarch64          |
+| ------- |---------------|------------------|
+| Linux   | `x86_64-unix` | `aarch64-unix`\* |
+| macOS   | Unsupported   | Unsupported\*    |
+| Windows | Unsupported   | Unsupported      |
+
+
+# A first example
+## Configuring and creating keys.
+In comparison with the [CPU example](../getting_started/quick_start), the only difference lies into the key creation, which is detailed [here](#Setting-the-keys)
+
+Here is a full example (combining the client and server parts):
+
+```rust
+use tfhe::{ConfigBuilder, set_server_key, FheUint8, ClientKey, CompressedServerKey};
+use tfhe::prelude::*;
+
+fn main() {
+
+    let config = ConfigBuilder::default().build();
+
+    let client_key= ClientKey::generate(config);
+    let compressed_server_key = CompressedServerKey::new(&client_key);
+
+    let gpu_key = compressed_server_key.decompress_to_gpu();
+
+    let clear_a = 27u8;
+    let clear_b = 128u8;
+
+    let a = FheUint8::encrypt(clear_a, &client_key);
+    let b = FheUint8::encrypt(clear_b, &client_key);
+
+    //Server-side
+
+    set_server_key(gpu_key);
+    let result = a + b;
+
+    //Client-side
+    let decrypted_result: u8 = result.decrypt(&client_key);
+
+    let clear_result = clear_a + clear_b;
+
+    assert_eq!(decrypted_result, clear_result);
+}
+```
+
+
+
+
+## Setting the keys
+The configuration of the key is different from the CPU. More precisely, if both client and server keys are still generated by the Client (which is assumed to run on a CPU), the server key has then to be decompressed by the Server to be converted into the right format. 
+To do so, the server should run this function: ```decompressed_to_gpu()```. 
+From then on, there is no difference between the CPU and the GPU.
+
+
+## Encrypting data
+On the client-side, the method to encrypt the data is exactly the same than the CPU one, i.e.:
+```Rust
+    let clear_a = 27u8;
+    let clear_b = 128u8;
+
+    let a = FheUint8::encrypt(clear_a, &client_key);
+    let b = FheUint8::encrypt(clear_b, &client_key);
+```
+
+## Computation.
+The server must first set its keys up, like in the CPU, with: ``` set_server_key(gpu_key);``` .
+Then, homomorphic computations are done with the same code than the one described [here](../getting_started/operations).
+
+```Rust
+    //Server-side
+    set_server_key(gpu_key);
+    let result = a + b;
+
+    //Client-side
+    let decrypted_result: u8 = result.decrypt(&client_key);
+
+    let clear_result = clear_a + clear_b;
+
+    assert_eq!(decrypted_result, clear_result);
+```
+
+## Decryption.
+Finally, the client gets the decrypted results by computing:
+
+```Rust
+    let decrypted_result: u8 = result.decrypt(&client_key);
+``` 
+## Improving performance.
+TFHE-rs includes the possibility to leverage the high number of threads given by a GPU.
+To do so, the configuration should be updated with ```Rust let config = ConfigBuilder::with_custom_parameters(PARAM_MULTI_BIT_MESSAGE_2_CARRY_2_GROUP_3_KS_PBS, None).build();```
+The complete example becomes:
+
+```rust 
+use tfhe::{ConfigBuilder, set_server_key, FheUint8, ClientKey, CompressedServerKey};
+use tfhe::prelude::*;
+use tfhe::shortint::parameters::PARAM_MULTI_BIT_MESSAGE_2_CARRY_2_GROUP_3_KS_PBS;
+
+fn main() {
+
+    let config = ConfigBuilder::with_custom_parameters(PARAM_MULTI_BIT_MESSAGE_2_CARRY_2_GROUP_3_KS_PBS, None).build();
+
+    let client_key= ClientKey::generate(config);
+    let compressed_server_key = CompressedServerKey::new(&client_key);
+
+    let gpu_key = compressed_server_key.decompress_to_gpu();
+
+    let clear_a = 27u8;
+    let clear_b = 128u8;
+
+    let a = FheUint8::encrypt(clear_a, &client_key);
+    let b = FheUint8::encrypt(clear_b, &client_key);
+
+    //Server-side
+
+    set_server_key(gpu_key);
+    let result = a + b;
+
+    //Client-side
+    let decrypted_result: u8 = result.decrypt(&client_key);
+
+    let clear_result = clear_a + clear_b;
+
+    assert_eq!(decrypted_result, clear_result);
+}
+```
+
+
+# List of available operations
+
+The GPU backend includes the following operations: 
+| name                  | symbol         | `Enc`/`Enc`              | `Enc`/ `Int`             |
+|-----------------------|----------------|--------------------------|--------------------------|
+| Neg                   | `-`            | :heavy_check_mark:       | N/A                      |
+| Add                   | `+`            | :heavy_check_mark:       | :heavy_check_mark:       |
+| Sub                   | `-`            | :heavy_check_mark:       | :heavy_check_mark:       |
+| Mul                   | `*`            | :heavy_check_mark:       | :heavy_check_mark:       |
+| Div                   | `/`            | :heavy_multiplication_x: | :heavy_multiplication_x: |
+| Rem                   | `%`            | :heavy_multiplication_x: | :heavy_multiplication_x: |
+| Not                   | `!`            | :heavy_check_mark:       | N/A                      |
+| BitAnd                | `&`            | :heavy_check_mark:       | :heavy_check_mark:       |
+| BitOr                 | `\|`           | :heavy_check_mark:       | :heavy_check_mark:       |
+| BitXor                | `^`            | :heavy_check_mark:       | :heavy_check_mark:       |
+| Shr                   | `>>`           | :heavy_multiplication_x: | :heavy_check_mark:       |
+| Shl                   | `<<`           | :heavy_multiplication_x: | :heavy_check_mark:       |
+| Rotate right          | `rotate_right` | :heavy_multiplication_x: | :heavy_check_mark:       |
+| Rotate left           | `rotate_left`  | :heavy_multiplication_x: | :heavy_check_mark:       |
+| Min                   | `min`          | :heavy_check_mark:       | :heavy_check_mark:       |
+| Max                   | `max`          | :heavy_check_mark:       | :heavy_check_mark:       |
+| Greater than          | `gt`           | :heavy_check_mark:       | :heavy_check_mark:       |
+| Greater or equal than | `ge`           | :heavy_check_mark:       | :heavy_check_mark:       |
+| Lower than            | `lt`           | :heavy_check_mark:       | :heavy_check_mark:       |
+| Lower or equal than   | `le`           | :heavy_check_mark:       | :heavy_check_mark:       |
+| Equal                 | `eq`           | :heavy_check_mark:       | :heavy_check_mark:       |
+| Cast (into dest type) | `cast_into`    | :heavy_multiplication_x: | N/A                      |
+| Cast (from src type)  | `cast_from`    | :heavy_multiplication_x: | N/A                      |
+| Ternary operator      | `if_then_else` | :heavy_check_mark:       | :heavy_multiplication_x: |
+
+{% hint style="info" %}
+All operations follow the same syntax than the one described in [here](../getting_started/operations.md).
+{% endhint %}
+
+
+
+### Benchmarks
+The tables below contain benchmarks for homomorphic operations running on a single V100 from AWS (p3.2xlarge machines), with the default parameters:
+
+| Operation \ Size | FheUint8  | FheUint16 | FheUint32 | FheUint64 | FheUint128 | FheUint256 |
+|------------------|-----------|-----------|-----------|-----------|------------|------------|
+| cuda_add         | 103.33 ms | 129.26 ms | 156.83 ms | 186.99 ms | 320.96 ms  | 528.15 ms  |
+| cuda_bitand      | 26.11 ms  | 26.21 ms  | 26.63 ms  | 27.24 ms  | 43.07 ms   | 65.01 ms   |
+| cuda_bitor       | 26.1 ms   | 26.21 ms  | 26.57 ms  | 27.23 ms  | 43.05 ms   | 65.0 ms    |
+| cuda_bitxor      | 26.08 ms  | 26.21 ms  | 26.57 ms  | 27.25 ms  | 43.06 ms   | 65.07 ms   |
+| cuda_eq          | 52.82 ms  | 53.0 ms   | 79.4 ms   | 79.58 ms  | 96.37 ms   | 145.25 ms  |
+| cuda_ge          | 104.7 ms  | 130.23 ms | 156.19 ms | 183.2 ms  | 213.43 ms  | 288.76 ms  |
+| cuda_gt          | 104.93 ms | 130.2 ms  | 156.33 ms | 183.38 ms | 213.47 ms  | 288.8 ms   |
+| cuda_le          | 105.14 ms | 130.47 ms | 156.48 ms | 183.44 ms | 213.33 ms  | 288.75 ms  |
+| cuda_lt          | 104.73 ms | 130.23 ms | 156.2 ms  | 183.14 ms | 213.33 ms  | 288.74 ms  |
+| cuda_max         | 156.7 ms  | 182.65 ms | 210.74 ms | 251.78 ms | 316.9 ms   | 442.71 ms  |
+| cuda_min         | 156.85 ms | 182.67 ms | 210.39 ms | 252.02 ms | 316.96 ms  | 442.95 ms  |
+| cuda_mul         | 219.73 ms | 302.11 ms | 465.91 ms | 955.66 ms | 2.71 s     | 9.15 s     |
+| cuda_ne          | 52.72 ms  | 52.91 ms  | 79.28 ms  | 79.59 ms  | 96.37 ms   | 145.36 ms  |
+| cuda_neg         | 103.26 ms | 129.4 ms  | 157.19 ms | 187.09 ms | 321.27 ms  | 530.11 ms  |
+| cuda_sub         | 103.34 ms | 129.42 ms | 156.87 ms | 187.01 ms | 321.04 ms  | 528.13 ms  |