PaddlePaddle Tutorial for BOSS Workshop 2018

PaddlePaddle (PArallel Distributed Deep LEarning) is an easy-to-use, efficient, flexible and scalable deep learning platform, which is originally developed by Baidu scientists and engineers for the purpose of applying deep learning to many products at Baidu.

Fluid is the latest version of PaddlePaddle, it describes the model for training or inference using the representation of "Program".

PaddlePaddle Elastic Deep Learning (EDL) is a clustering project which leverages PaddlePaddle training jobs to be scalable and fault-tolerant. EDL will greatly boost the parallel distributed training jobs and make good use of cluster computing power.

EDL is based on the full fault-tolerant feature of PaddlePaddle, it uses a Kubernetes controller to manage the cluster training jobs and an auto-scaler to scale the job's computing resources.

Tutorial Outline

Introduction

At the introduction session, we will introduce:
- PaddlePaddle Fluid design overview.
- Fluid Distributed Training.
- Why we develop PaddlePaddle EDL and how we implement it.
Hands-on Tutorial

We have some hands-on tutorials after each introduction session so that all the audience can use PaddlePaddle and ask some questions while using PaddlePaddle:
- Training models using PaddlePaddle Fluid in a Jupyter Notebook (PaddlePaddle Book).
- Launch a Distributed Training Job on your laptop.
- Launch the EDL training job on a Kubernetes cluster.
Intended audience

People who are interested in deep learning system architecture.

Prerequisites

Install Docker
Install Git
Install kubectl
A Kubernetes cluster which version is 1.7.x:
- minikube would launch a kubernetes cluster locally.
- kops would launch a Kuberntes cluster on AWS.
- We will also prepare a public Kubernetes cluster via Cloud if you don't have an AWS account that you can submit the EDL training jobs using the public cluster.

Resources

PaddlePaddle
PaddlePaddle Book
PaddlePaddle EDL

Part-1 Training Models on Your Laptop using PaddlePaddle

PaddlePaddle Book

Please checkout PaddlePaddle Book, steps to run the training process and example output.

Launch a Distributed Training Job on Your Laptop

Launch the PaddlePaddle Production Docker Container:

> git clone https://github.com/PaddlePaddle/edl.git
> cd edl/example/fluid
> docker run --name paddle -d -it -v $PWD:/work paddlepaddle/paddle /bin/bash

Split training data into multiple parts:

> docker exec -it paddle /bin/bash
> cd /work
> python recognize_digits.py prepare

would split the mnist data into multiple parts as follows:

./dataset/mnist/
./dataset/mnist/mnist-train-00000.pickle
./dataset/mnist/mnist-train-00001.pickle
./dataset/mnist/mnist-train-00002.pickle
./dataset/mnist/mnist-train-00003.pickle
...

Luanch two PServer instances and two Trainer instances:

Start PServer instance:

> docker exec -it paddle /bin/bash
> cd /work
> PADDLE_PSERVER_EPS=127.0.0.1:6789 \
  PADDLE_TRAINERS=2 \
  PADDLE_TRAINING_ROLE=PSERVER \
  PADDLE_CURRENT_ENDPOINT=127.0.0.1:6789 \
  python recognize_digits.py train

Start Trainer instance which trainer_id=0:

> docker exec -it paddle /bin/bash
> cd /work
> PADDLE_PSERVER_EPS=127.0.0.1:6789 \
  PADDLE_TRAINERS=2 \
  PADDLE_TRAINING_ROLE=TRAINER \
  PADDLE_TRAINER_ID=0 \
  python recognize_digits.py train

Start Trainer instance which trainer_id=1:

> docker exec -it paddle /bin/bash
> cd /work
> PADDLE_PSERVER_EPS=127.0.0.1:6789 \
  PADDLE_TRAINERS=2 \
  PADDLE_TRAINING_ROLE=TRAINER \
  PADDLE_TRAINER_ID=1 \
  python recognize_digits.py train

The Trainer instance would print the training logs as follows:

append file for current trainer: dataset/mnist/mnist-train-00000.pickle
append file for current trainer: dataset/mnist/mnist-train-00002.pickle
append file for current trainer: dataset/mnist/mnist-train-00004.pickle
append file for current trainer: dataset/mnist/mnist-train-00006.pickle
append file for current trainer: dataset/mnist/mnist-train-00008.pickle
append file for current trainer: dataset/mnist/mnist-train-00010.pickle
append file for current trainer: dataset/mnist/mnist-train-00012.pickle
append file for current trainer: dataset/mnist/mnist-train-00014.pickle
('processing file: ', 'dataset/mnist/mnist-train-00000.pickle')
Epoch: 0, Batch: 10, Test Loss: 0.24518635296, Acc: 0.923899995804

Inference

Execut the following command to load the models and infer the input image img/infer_3.png:

> docker exec -it paddle /bin/bash -c "cd /work && python recognize_digits.py infer"

The inference result is as follows:

('Inference result of img/infer_3.png is: ', 3)

Part-2: Launch the PaddlePaddle EDL Training Jobs on a Kubernetes Cluster

Please note, EDL only support the early PaddlePaddle version so the fault-tolerant model is written with PaddlePaddle v2 API.

Configure kubectl

If you start up a Kubernetes instance by minikube or kops, the kubectl configuration would be ready when the cluster is available, for the other approach, you can contact the administrator to fetch the configuration file.

Deploy EDL Components

NOTE: there is only one EDL controller in a Kubernetes cluster, so if you're using a public cluster, you can skip this step.

Create a paddlecloud namespace to run EDL components
```
> kubectl create namespace paddlecloud
```
(Optional) Configure RBAC for EDL controller so that it would have the cluster admin permission.

If you launch a Kubernetes cluster by kops on AWS, the default authenticating policy is RBAC, so this step is necessary:
```
kubectl create -f k8s/rbac_admin.yaml
```

Create TPR "Training-Job"

kubectl create -f k8s/thirdpartyresource.yaml

To verify the creation of the resource, run the following command:

kubectl describe ThirdPartyResource training-job

Deploy EDL controller

kubectl create -f k8s/edl_controller.yaml

Launch the EDL Training Jobs

Edit the local training program to be able to run with distributed mode

It's easy to update your local training program to be running with distributing mode:

Dataset

Pre-process the dataset to convert to RecordIO format, We have done this in the Docker image paddlepaddle/edl-example using dataset.covert API as follows:
```
dataset.common.convert('/data/recordio/imikolov/', dataset.imikolov.train(word_dict, 5), 5000, 'imikolov-train')"
```
This would generate many recordio files on /data/recordio/imikolov folder, and we have prepared these files on Docker image paddlepaddle/edl-example.

Pass the etcd_endpoint to the Trainer object so that Trainer would know it's a fault-tolerant distributed training job.

trainer = paddle.trainer.SGD(cost,
                              parameters,
                              adam_optimizer,
                              is_local=False,
                              pserver_spec=etcd_endpoint,
                              use_etcd=True)

Use cloud_reader which is a master_client instance can fetch the training data from the task queue.

trainer.train(
    paddle.batch(cloud_reader([TRAIN_FILES_PATH], etcd_endpoint), 32),
    num_passes=30,
    event_handler=event_handler)

Run the monitor program

Please open a new tab in your terminal program and run the monitor Python script example/collector.py:

> cd edl/example
> docker run --rm -it -v $HOME/.kube/config:/root/.kube/config -v $PWD:/work paddlepaddle/edl-example python /work/collector.py

And you can see the following metrics:

SUBMITTED-JOBS    PENDING-JOBS    RUNNING-TRAINERS    CPU-UTILS
0    0    -    18.40%
0    0    -    18.40%
0    0    -    18.40%
...

Deploy EDL Training Jobs

kubectl create -f example/examplejob.yaml

Deploy Multiple Training Jobs and Check the Monitor Logs

You can edit the YAML file and change the name field so that you can submit multiple training jobs. For example, I submited three jobs which name is example, example1 and example2, the monitor logs is as follows:

SUBMITED-JOBS    PENDING-JOBS    RUNNING-TRAINERS    CPU-UTILS
0    0    -    18.40%
0    0    -    18.40%
1    1    example:0    23.40%
1    0    example:10    54.40%
1    0    example:10    54.40%
2    0    example:10|example1:5    80.40%
2    0    example:10|example1:8    86.40%
2    0    example:10|example1:8    86.40%
2    0    example:10|example1:8    86.40%
2    0    example:10|example1:8    86.40%
3    1    example2:0|example:10|example1:8    86.40%
3    1    example2:0|example:10|example1:8    86.40%
3    1    example2:0|example:5|example1:4    68.40%
3    1    example2:0|example:3|example1:4    68.40%
3    0    example2:4|example:3|example1:4    88.40%
3    0    example2:4|example:3|example1:4    88.40%

At the begging, then there is no training job in the cluster except some Kubernetes system components, so the CPU utilization is 18.40%.
After submitting the training job example, the CPU utilization rise to 54.40%, because of the max-instances in the YAML file is 10, so the running trainers is 10.
After submitting the training job example1, the CPU utilization rose to 86.40%.
While we submitting the training job example2, there is no more resource for it, so EDL auto-scaller would scale down the other jobs' trainer process, and eventually the running trainers of example dropped down to 3, example1 dropped down to 4 and no pending jobs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

boss_tutorial.md

boss_tutorial.md

PaddlePaddle Tutorial for BOSS Workshop 2018

Tutorial Outline

Prerequisites

Resources

Part-1 Training Models on Your Laptop using PaddlePaddle

PaddlePaddle Book

Launch a Distributed Training Job on Your Laptop

Part-2: Launch the PaddlePaddle EDL Training Jobs on a Kubernetes Cluster

Configure kubectl

Deploy EDL Components

Launch the EDL Training Jobs

Files

boss_tutorial.md

Latest commit

History

boss_tutorial.md

File metadata and controls

PaddlePaddle Tutorial for BOSS Workshop 2018

Tutorial Outline

Prerequisites

Resources

Part-1 Training Models on Your Laptop using PaddlePaddle

PaddlePaddle Book

Launch a Distributed Training Job on Your Laptop

Part-2: Launch the PaddlePaddle EDL Training Jobs on a Kubernetes Cluster

Configure kubectl

Deploy EDL Components

Launch the EDL Training Jobs