PyTorch Deep Learning to Federated Learning transition with NVFlare

We will demonstrate how to transform an existing DL code into an FL application step-by-step:

Show a baseline training script
How to modify an existing training script using DL2FL Client API
How to modify a structured script using DL2FL decorator
How to modify a PyTorch Lightning script using DL2FL Lightning Client API

If you have multi GPU please refer to the following examples:

How to modify a PyTorch DDP training script using DL2FL Client API
How to modify a PyTorch Lightning DDP training script using DL2FL Lightning Client API

Software Requirements

Please install the requirements first, it is suggested to install inside a virtual environment:

pip install -r requirements.txt

Please also configure the job templates folder:

nvflare config -jt ../../../../job_templates/
nvflare job list_templates

Minimum Hardware Requirements

Each example has different requirements:

Example name	minimum requirements
Show a baseline training script	1 CPU or 1 GPU*
How to modify an existing training script using DL2FL Client API	1 CPU or 1 GPU*
How to modify a structured script using DL2FL decorator	1 CPU or 1 GPU*
How to modify a PyTorch Lightning script using DL2FL Lightning Client API	1 CPU or 1 GPU*
How to modify a PyTorch DDP training script using DL2FL Client API	2 GPUs
How to modify a PyTorch Lightning DDP training script using DL2FL Lightning Client API	2 CPUs or 2 GPUs**

* it depends on you use device=cpu or device=cuda ** it depends on whether torch.cuda.is_available() is True or not

The baseline

We take a CIFAR10 example directly from PyTorch website and do the following cleanups to get cifar10_original.py:

Remove the comments
Move the definition of Convolutional Neural Network to net.py
Wrap the whole code inside a main method (https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods)
Add the ability to run on GPU to speed up the training process (optional)

You can run the baseline using

python3 ./code/cifar10_original.py

It will run for 2 epochs. Then we will see something like this:

Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified
[1,  2000] loss: 2.127
[1,  4000] loss: 1.826
[1,  6000] loss: 1.667
[1,  8000] loss: 1.568
[1, 10000] loss: 1.503
[1, 12000] loss: 1.455
[2,  2000] loss: 1.386
[2,  4000] loss: 1.362
[2,  6000] loss: 1.348
[2,  8000] loss: 1.329
[2, 10000] loss: 1.327
[2, 12000] loss: 1.275
Finished Training
Accuracy of the network on the 10000 test images: 55 %

Transform CIFAR10 DL training code to FL including best model selection using Client API

Now we have a CIFAR10 DL training code, let's transform it to FL with NVFLARE Client API.

We made the following changes:

Import NVFlare Client API: import nvflare.client as flare
Initialize NVFlare Client API: flare.init()
Receive aggregated/global FLModel from NVFlare side each round: input_model = flare.receive()
Load the received aggregated/global model weights into the model structure: net.load_state_dict(input_model.params)
Wrap evaluation logic into a method to re-use for evaluation on both trained and received aggregated/global model
Evaluate on received aggregated/global model to get the metrics for model selection
Construct the FLModel to be returned to the NVFlare side: output_model = flare.FLModel(xxx)
Send the model back to NVFlare: flare.send(output_model)

Optional: Change the data path to an absolute path and use ./prepare_data.sh to download data

The modified code can be found in ./code/cifar10_fl.py

After we modify our training script, we need to put it into a job structure so that NVFlare system knows how to deploy and run the job.

Please refer to JOB CLI tutorial on how to generate a job easily from our existing job templates.

We choose the sag_pt job template and run the following command to create the job:

nvflare job create -force -j ./jobs/client_api -w sag_pt -sd ./code/ \
    -f config_fed_client.conf app_script=cifar10_fl.py

Then we can run it using the NVFlare Simulator:

bash ./prepare_data.sh
nvflare simulator -n 2 -t 2 ./jobs/client_api -w client_api_workspace

Congratulations! You have finished an FL training!

The Decorator use case

The above case shows how you can change an existing DL code to FL.

Usually, people have already put their codes into "train", "evaluate", and "test" methods, so they can reuse them. In that case, the NVFlare DL2FL decorator is the way to go.

To structure the code, we make the following changes to ./code/cifar10_original.py:

Wrap training logic into a train method
Wrap evaluation logic into an evaluate method
Call train method and evaluate method

The result is ./code/cifar10_structured_original.py

To modify this structured code to be used in FL. We made the following changes:

Import NVFlare Client API: import nvflare.client as flare
Initialize NVFlare Client API: flare.init()
Modify the train method:
- Decorate with @flare.train
- Take additional argument in the beginning
- Load the received aggregated/global model weights into the model structure: net.load_state_dict(input_model.params)
- Return an FLModel object
Add an fl_evaluate method:
- Decorate with @flare.evaluate
- The first argument is input FLModel
- Return a float number of metric
Receive aggregated/global FLModel from NVFlare side each round: input_model = flare.receive()
Call fl_evaluate method before training to get metrics on the received aggregated/global model

Optional: Change the data path to an absolute path and use ./prepare_data.sh to download data

The modified code can be found in ./code/cifar10_structured_fl.py

We choose the sag_pt job template and run the following command to create the job:

nvflare job create -force -j ./jobs/decorator -w sag_pt -sd ./code/ -f config_fed_client.conf app_script=cifar10_structured_fl.py

Then we can run it using the NVFlare simulator:

bash ./prepare_data.sh
nvflare simulator -n 2 -t 2 ./jobs/decorator -w decorator_workspace

Transform CIFAR10 PyTorch Lightning training code to FL with NVFLARE Client lightning integration API

If you are using PyTorch Lightning to write your training scripts, you can use our NVFlare lightning client API to convert it into FL.

Given a CIFAR10 PyTorch Lightning code example: ./code/cifar10_lightning_original.py. Notice we wrap the Net class into LightningModule: LitNet class

You can run it using

python3 ./code/cifar10_lightning_original.py

To transform the existing code to FL training code, we made the following changes:

Import NVFlare Lightning Client API: import nvflare.client.lightning as flare
Patch the PyTorch Lightning trainer flare.patch(trainer)
Receive aggregated/global FLModel from NVFlare side each round: input_model = flare.receive()
Call trainer.evaluate() method to evaluate newly received aggregated/global model. The resulting evaluation metric will be used for the best model selection

The modified code can be found in ./code/cifar10_lightning_fl.py

Then we can create the job using sag_pt template:

nvflare job create -force -j ./jobs/lightning -w sag_pt -sd ./code/ \
    -f config_fed_client.conf app_script=cifar10_lightning_fl.py \
    -f config_fed_server.conf key_metric=val_acc_epoch model_class_path=lit_net.LitNet

Note that we pass the "key_metric"="val_acc_epoch" (this name originates from the code here) which means the validation accuracy for that epoch.

And we use "lit_net.LitNet" instead of "net.Net" for model class.

Then we run it using the NVFlare simulator:

bash ./prepare_data.sh
nvflare simulator -n 2 -t 2 ./jobs/lightning -w lightning_workspace

Transform CIFAR10 PyTorch + DDP training code to FL using Client API

We follow the official PyTorch documentation and write a ./code/cifar10_ddp_original.py.

Note that we wrap the evaluation logic into a method for better usability.

It can be run using the torch distributed run:

python3 -m torch.distributed.run --nnodes=1 --nproc_per_node=2 --master_port=6666 ./code/cifar10_ddp_original.py

To modify this multi-GPU code to be used in FL. We made the following changes:

Import NVFlare Client API: import nvflare.client as flare
Initialize NVFlare Client API: flare.init()
Receive aggregated/global FLModel from NVFlare side each round: input_model = flare.receive()
Load the received aggregated/global model weights into the model structure: net.load_state_dict(input_model.params)
Evaluate on received aggregated/global model to get the metrics for model selection
Construct the FLModel to be returned to the NVFlare side: output_model = flare.FLModel(xxx)
Send the model back to NVFlare: flare.send(output_model)

Note that we only do flare receive and send on the first process (rank 0). Because all the worker processes launched by torch distributed will have the same model in the end, we don't need to send duplicate models back.

The modified code can be found in ./code/cifar10_ddp_fl.py

We can create the job using the following command:

nvflare job create -force -j ./jobs/client_api_ddp -w sag_pt_deploy_map -sd ./code/ \
    -f app_1/config_fed_client.conf script="python3 -m torch.distributed.run --nnodes\=1 --nproc_per_node\=2 --master_port\=7777 custom/cifar10_ddp_fl.py" \
    -f app_2/config_fed_client.conf script="python3 -m torch.distributed.run --nnodes\=1 --nproc_per_node\=2 --master_port\=8888 custom/cifar10_ddp_fl.py"

Then we run it using the NVFlare simulator:

bash ./prepare_data.sh
nvflare simulator -n 2 -t 2 ./jobs/client_api_ddp -w client_api_ddp_workspace

This will start 2 clients and each client will start 2 worker processes.

Note that you might need to change the "master_port" in the "config_fed_client.conf" if those ports are already taken on your machine.

Transform CIFAR10 PyTorch Lightning + ddp training code to FL with NVFLARE Client lightning integration API

After we finish the single GPU case, we will show how to convert multi GPU training as well.

We just need to change the Trainer initialize to add extra options: strategy="ddp", devices=2

The modified Lightning + DPP code can be found in ./code/cifar10_lightning_ddp_original.py

You can execute it using:

python3 ./code/cifar10_lightning_ddp_original.py

The modified FL code can be found in ./code/cifar10_lightning_ddp_fl.py

Then we can create the job using sag_pt template:

nvflare job create -force -j ./jobs/lightning_ddp -w sag_pt -sd ./code/ \
    -f config_fed_client.conf app_script=cifar10_lightning_ddp_fl.py \
    -f config_fed_server.conf key_metric=val_acc_epoch model_class_path=lit_net.LitNet

Note that we pass the "key_metric"="val_acc_epoch" (this name originates from the code here) which means the validation accuracy for that epoch.

And we use "lit_net.LitNet" instead of "net.Net" for model class.

Then we run it using the NVFlare simulator:

bash ./prepare_data.sh
nvflare simulator -n 2 -t 2 ./jobs/lightning_ddp -w lightning_ddp_workspace

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

PyTorch Deep Learning to Federated Learning transition with NVFlare

Software Requirements

Minimum Hardware Requirements

The baseline

Transform CIFAR10 DL training code to FL including best model selection using Client API

The Decorator use case

Transform CIFAR10 PyTorch Lightning training code to FL with NVFLARE Client lightning integration API

Transform CIFAR10 PyTorch + DDP training code to FL using Client API

Transform CIFAR10 PyTorch Lightning + ddp training code to FL with NVFLARE Client lightning integration API

Files

README.md

Latest commit

History

README.md

File metadata and controls

PyTorch Deep Learning to Federated Learning transition with NVFlare

Software Requirements

Minimum Hardware Requirements

The baseline

Transform CIFAR10 DL training code to FL including best model selection using Client API

The Decorator use case

Transform CIFAR10 PyTorch Lightning training code to FL with NVFLARE Client lightning integration API

Transform CIFAR10 PyTorch + DDP training code to FL using Client API

Transform CIFAR10 PyTorch Lightning + ddp training code to FL with NVFLARE Client lightning integration API