We will demonstrate how to transform an existing DL code into an FL application step-by-step:
- Show a baseline training script
- How to modify an existing training script using DL2FL Client API
- How to modify a structured script using DL2FL decorator
- How to modify a PyTorch Lightning script using DL2FL Lightning Client API
If you have multi GPU please refer to the following examples:
- How to modify a PyTorch DDP training script using DL2FL Client API
- How to modify a PyTorch Lightning DDP training script using DL2FL Lightning Client API
Please install the requirements first, it is suggested to install inside a virtual environment:
pip install -r requirements.txt
Please also configure the job templates folder:
nvflare config -jt ../../../../job_templates/
nvflare job list_templates
Each example has different requirements:
Example name | minimum requirements |
---|---|
Show a baseline training script | 1 CPU or 1 GPU* |
How to modify an existing training script using DL2FL Client API | 1 CPU or 1 GPU* |
How to modify a structured script using DL2FL decorator | 1 CPU or 1 GPU* |
How to modify a PyTorch Lightning script using DL2FL Lightning Client API | 1 CPU or 1 GPU* |
How to modify a PyTorch DDP training script using DL2FL Client API | 2 GPUs |
How to modify a PyTorch Lightning DDP training script using DL2FL Lightning Client API | 2 CPUs or 2 GPUs** |
* it depends on you use device=cpu
or device=cuda
** it depends on whether torch.cuda.is_available()
is True or not
We take a CIFAR10 example directly from PyTorch website and do the following cleanups to get cifar10_original.py:
- Remove the comments
- Move the definition of Convolutional Neural Network to net.py
- Wrap the whole code inside a main method (https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods)
- Add the ability to run on GPU to speed up the training process (optional)
You can run the baseline using
python3 ./code/cifar10_original.py
It will run for 2 epochs. Then we will see something like this:
Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified
[1, 2000] loss: 2.127
[1, 4000] loss: 1.826
[1, 6000] loss: 1.667
[1, 8000] loss: 1.568
[1, 10000] loss: 1.503
[1, 12000] loss: 1.455
[2, 2000] loss: 1.386
[2, 4000] loss: 1.362
[2, 6000] loss: 1.348
[2, 8000] loss: 1.329
[2, 10000] loss: 1.327
[2, 12000] loss: 1.275
Finished Training
Accuracy of the network on the 10000 test images: 55 %
Now we have a CIFAR10 DL training code, let's transform it to FL with NVFLARE Client API.
We made the following changes:
- Import NVFlare Client API:
import nvflare.client as flare
- Initialize NVFlare Client API:
flare.init()
- Receive aggregated/global FLModel from NVFlare side each round:
input_model = flare.receive()
- Load the received aggregated/global model weights into the model structure:
net.load_state_dict(input_model.params)
- Wrap evaluation logic into a method to re-use for evaluation on both trained and received aggregated/global model
- Evaluate on received aggregated/global model to get the metrics for model selection
- Construct the FLModel to be returned to the NVFlare side:
output_model = flare.FLModel(xxx)
- Send the model back to NVFlare:
flare.send(output_model)
Optional: Change the data path to an absolute path and use ./prepare_data.sh
to download data
The modified code can be found in ./code/cifar10_fl.py
After we modify our training script, we need to put it into a job structure so that NVFlare system knows how to deploy and run the job.
Please refer to JOB CLI tutorial on how to generate a job easily from our existing job templates.
We choose the sag_pt job template and run the following command to create the job:
nvflare job create -force -j ./jobs/client_api -w sag_pt -sd ./code/ \
-f config_fed_client.conf app_script=cifar10_fl.py
Then we can run it using the NVFlare Simulator:
bash ./prepare_data.sh
nvflare simulator -n 2 -t 2 ./jobs/client_api -w client_api_workspace
Congratulations! You have finished an FL training!
The above case shows how you can change an existing DL code to FL.
Usually, people have already put their codes into "train", "evaluate", and "test" methods, so they can reuse them. In that case, the NVFlare DL2FL decorator is the way to go.
To structure the code, we make the following changes to ./code/cifar10_original.py:
- Wrap training logic into a
train
method - Wrap evaluation logic into an
evaluate
method - Call train method and evaluate method
The result is ./code/cifar10_structured_original.py
To modify this structured code to be used in FL. We made the following changes:
- Import NVFlare Client API:
import nvflare.client as flare
- Initialize NVFlare Client API:
flare.init()
- Modify the
train
method:- Decorate with
@flare.train
- Take additional argument in the beginning
- Load the received aggregated/global model weights into the model structure:
net.load_state_dict(input_model.params)
- Return an FLModel object
- Decorate with
- Add an
fl_evaluate
method:- Decorate with
@flare.evaluate
- The first argument is input FLModel
- Return a float number of metric
- Decorate with
- Receive aggregated/global FLModel from NVFlare side each round:
input_model = flare.receive()
- Call
fl_evaluate
method before training to get metrics on the received aggregated/global model
Optional: Change the data path to an absolute path and use ./prepare_data.sh
to download data
The modified code can be found in ./code/cifar10_structured_fl.py
We choose the sag_pt job template and run the following command to create the job:
nvflare job create -force -j ./jobs/decorator -w sag_pt -sd ./code/ -f config_fed_client.conf app_script=cifar10_structured_fl.py
Then we can run it using the NVFlare simulator:
bash ./prepare_data.sh
nvflare simulator -n 2 -t 2 ./jobs/decorator -w decorator_workspace
Transform CIFAR10 PyTorch Lightning training code to FL with NVFLARE Client lightning integration API
If you are using PyTorch Lightning to write your training scripts, you can use our NVFlare lightning client API to convert it into FL.
Given a CIFAR10 PyTorch Lightning code example: ./code/cifar10_lightning_original.py. Notice we wrap the Net class into LightningModule: LitNet class
You can run it using
python3 ./code/cifar10_lightning_original.py
To transform the existing code to FL training code, we made the following changes:
- Import NVFlare Lightning Client API:
import nvflare.client.lightning as flare
- Patch the PyTorch Lightning trainer
flare.patch(trainer)
- Receive aggregated/global FLModel from NVFlare side each round:
input_model = flare.receive()
- Call trainer.evaluate() method to evaluate newly received aggregated/global model. The resulting evaluation metric will be used for the best model selection
The modified code can be found in ./code/cifar10_lightning_fl.py
Then we can create the job using sag_pt template:
nvflare job create -force -j ./jobs/lightning -w sag_pt -sd ./code/ \
-f config_fed_client.conf app_script=cifar10_lightning_fl.py \
-f config_fed_server.conf key_metric=val_acc_epoch model_class_path=lit_net.LitNet
Note that we pass the "key_metric"="val_acc_epoch" (this name originates from the code here) which means the validation accuracy for that epoch.
And we use "lit_net.LitNet" instead of "net.Net" for model class.
Then we run it using the NVFlare simulator:
bash ./prepare_data.sh
nvflare simulator -n 2 -t 2 ./jobs/lightning -w lightning_workspace
We follow the official PyTorch documentation and write a ./code/cifar10_ddp_original.py.
Note that we wrap the evaluation logic into a method for better usability.
It can be run using the torch distributed run:
python3 -m torch.distributed.run --nnodes=1 --nproc_per_node=2 --master_port=6666 ./code/cifar10_ddp_original.py
To modify this multi-GPU code to be used in FL. We made the following changes:
- Import NVFlare Client API:
import nvflare.client as flare
- Initialize NVFlare Client API:
flare.init()
- Receive aggregated/global FLModel from NVFlare side each round:
input_model = flare.receive()
- Load the received aggregated/global model weights into the model structure:
net.load_state_dict(input_model.params)
- Evaluate on received aggregated/global model to get the metrics for model selection
- Construct the FLModel to be returned to the NVFlare side:
output_model = flare.FLModel(xxx)
- Send the model back to NVFlare:
flare.send(output_model)
Note that we only do flare receive and send on the first process (rank 0). Because all the worker processes launched by torch distributed will have the same model in the end, we don't need to send duplicate models back.
The modified code can be found in ./code/cifar10_ddp_fl.py
We can create the job using the following command:
nvflare job create -force -j ./jobs/client_api_ddp -w sag_pt_deploy_map -sd ./code/ \
-f app_1/config_fed_client.conf script="python3 -m torch.distributed.run --nnodes\=1 --nproc_per_node\=2 --master_port\=7777 custom/cifar10_ddp_fl.py" \
-f app_2/config_fed_client.conf script="python3 -m torch.distributed.run --nnodes\=1 --nproc_per_node\=2 --master_port\=8888 custom/cifar10_ddp_fl.py"
Then we run it using the NVFlare simulator:
bash ./prepare_data.sh
nvflare simulator -n 2 -t 2 ./jobs/client_api_ddp -w client_api_ddp_workspace
This will start 2 clients and each client will start 2 worker processes.
Note that you might need to change the "master_port" in the "config_fed_client.conf" if those ports are already taken on your machine.
Transform CIFAR10 PyTorch Lightning + ddp training code to FL with NVFLARE Client lightning integration API
After we finish the single GPU case, we will show how to convert multi GPU training as well.
We just need to change the Trainer initialize to add extra options: strategy="ddp", devices=2
The modified Lightning + DPP code can be found in ./code/cifar10_lightning_ddp_original.py
You can execute it using:
python3 ./code/cifar10_lightning_ddp_original.py
The modified FL code can be found in ./code/cifar10_lightning_ddp_fl.py
Then we can create the job using sag_pt template:
nvflare job create -force -j ./jobs/lightning_ddp -w sag_pt -sd ./code/ \
-f config_fed_client.conf app_script=cifar10_lightning_ddp_fl.py \
-f config_fed_server.conf key_metric=val_acc_epoch model_class_path=lit_net.LitNet
Note that we pass the "key_metric"="val_acc_epoch" (this name originates from the code here) which means the validation accuracy for that epoch.
And we use "lit_net.LitNet" instead of "net.Net" for model class.
Then we run it using the NVFlare simulator:
bash ./prepare_data.sh
nvflare simulator -n 2 -t 2 ./jobs/lightning_ddp -w lightning_ddp_workspace