Skip to content

Latest commit

 

History

History
301 lines (222 loc) · 10.6 KB

boss_tutorial.md

File metadata and controls

301 lines (222 loc) · 10.6 KB

PaddlePaddle Tutorial for BOSS Workshop 2018

PaddlePaddle (PArallel Distributed Deep LEarning) is an easy-to-use, efficient, flexible and scalable deep learning platform, which is originally developed by Baidu scientists and engineers for the purpose of applying deep learning to many products at Baidu.

Fluid is the latest version of PaddlePaddle, it describes the model for training or inference using the representation of "Program".

PaddlePaddle Elastic Deep Learning (EDL) is a clustering project which leverages PaddlePaddle training jobs to be scalable and fault-tolerant. EDL will greatly boost the parallel distributed training jobs and make good use of cluster computing power.

EDL is based on the full fault-tolerant feature of PaddlePaddle, it uses a Kubernetes controller to manage the cluster training jobs and an auto-scaler to scale the job's computing resources.

Tutorial Outline

  • Introduction

    At the introduction session, we will introduce:

    • PaddlePaddle Fluid design overview.
    • Fluid Distributed Training.
    • Why we develop PaddlePaddle EDL and how we implement it.
  • Hands-on Tutorial

    We have some hands-on tutorials after each introduction session so that all the audience can use PaddlePaddle and ask some questions while using PaddlePaddle:

    • Training models using PaddlePaddle Fluid in a Jupyter Notebook (PaddlePaddle Book).
    • Launch a Distributed Training Job on your laptop.
    • Launch the EDL training job on a Kubernetes cluster.
  • Intended audience

    People who are interested in deep learning system architecture.

Prerequisites

Resources

Part-1 Training Models on Your Laptop using PaddlePaddle

PaddlePaddle Book

Please checkout PaddlePaddle Book, steps to run the training process and example output.

Launch a Distributed Training Job on Your Laptop

  1. Launch the PaddlePaddle Production Docker Container:

    > git clone https://github.com/PaddlePaddle/edl.git
    > cd edl/example/fluid
    > docker run --name paddle -d -it -v $PWD:/work paddlepaddle/paddle /bin/bash
  2. Split training data into multiple parts:

    > docker exec -it paddle /bin/bash
    > cd /work
    > python recognize_digits.py prepare

    would split the mnist data into multiple parts as follows:

    ./dataset/mnist/
    ./dataset/mnist/mnist-train-00000.pickle
    ./dataset/mnist/mnist-train-00001.pickle
    ./dataset/mnist/mnist-train-00002.pickle
    ./dataset/mnist/mnist-train-00003.pickle
    ...
  3. Luanch two PServer instances and two Trainer instances:

    Start PServer instance:

    > docker exec -it paddle /bin/bash
    > cd /work
    > PADDLE_PSERVER_EPS=127.0.0.1:6789 \
      PADDLE_TRAINERS=2 \
      PADDLE_TRAINING_ROLE=PSERVER \
      PADDLE_CURRENT_ENDPOINT=127.0.0.1:6789 \
      python recognize_digits.py train

    Start Trainer instance which trainer_id=0:

    > docker exec -it paddle /bin/bash
    > cd /work
    > PADDLE_PSERVER_EPS=127.0.0.1:6789 \
      PADDLE_TRAINERS=2 \
      PADDLE_TRAINING_ROLE=TRAINER \
      PADDLE_TRAINER_ID=0 \
      python recognize_digits.py train

    Start Trainer instance which trainer_id=1:

    > docker exec -it paddle /bin/bash
    > cd /work
    > PADDLE_PSERVER_EPS=127.0.0.1:6789 \
      PADDLE_TRAINERS=2 \
      PADDLE_TRAINING_ROLE=TRAINER \
      PADDLE_TRAINER_ID=1 \
      python recognize_digits.py train

    The Trainer instance would print the training logs as follows:

    append file for current trainer: dataset/mnist/mnist-train-00000.pickle
    append file for current trainer: dataset/mnist/mnist-train-00002.pickle
    append file for current trainer: dataset/mnist/mnist-train-00004.pickle
    append file for current trainer: dataset/mnist/mnist-train-00006.pickle
    append file for current trainer: dataset/mnist/mnist-train-00008.pickle
    append file for current trainer: dataset/mnist/mnist-train-00010.pickle
    append file for current trainer: dataset/mnist/mnist-train-00012.pickle
    append file for current trainer: dataset/mnist/mnist-train-00014.pickle
    ('processing file: ', 'dataset/mnist/mnist-train-00000.pickle')
    Epoch: 0, Batch: 10, Test Loss: 0.24518635296, Acc: 0.923899995804 
    
  4. Inference

    Execut the following command to load the models and infer the input image img/infer_3.png:

    > docker exec -it paddle /bin/bash -c "cd /work && python recognize_digits.py infer"

    The inference result is as follows:

    ('Inference result of img/infer_3.png is: ', 3)
    

Part-2: Launch the PaddlePaddle EDL Training Jobs on a Kubernetes Cluster

Please note, EDL only support the early PaddlePaddle version so the fault-tolerant model is written with PaddlePaddle v2 API.

Configure kubectl

If you start up a Kubernetes instance by minikube or kops, the kubectl configuration would be ready when the cluster is available, for the other approach, you can contact the administrator to fetch the configuration file.

Deploy EDL Components

NOTE: there is only one EDL controller in a Kubernetes cluster, so if you're using a public cluster, you can skip this step.

  1. Create a paddlecloud namespace to run EDL components

    > kubectl create namespace paddlecloud
  2. (Optional) Configure RBAC for EDL controller so that it would have the cluster admin permission.

    If you launch a Kubernetes cluster by kops on AWS, the default authenticating policy is RBAC, so this step is necessary:

    kubectl create -f k8s/rbac_admin.yaml
  3. Create TPR "Training-Job"

    kubectl create -f k8s/thirdpartyresource.yaml

    To verify the creation of the resource, run the following command:

    kubectl describe ThirdPartyResource training-job
  • Deploy EDL controller

    kubectl create -f k8s/edl_controller.yaml

Launch the EDL Training Jobs

  1. Edit the local training program to be able to run with distributed mode

    It's easy to update your local training program to be running with distributing mode:

    • Dataset

      Pre-process the dataset to convert to RecordIO format, We have done this in the Docker image paddlepaddle/edl-example using dataset.covert API as follows:

      dataset.common.convert('/data/recordio/imikolov/', dataset.imikolov.train(word_dict, 5), 5000, 'imikolov-train')"

      This would generate many recordio files on /data/recordio/imikolov folder, and we have prepared these files on Docker image paddlepaddle/edl-example.

    • Pass the etcd_endpoint to the Trainer object so that Trainer would know it's a fault-tolerant distributed training job.

      trainer = paddle.trainer.SGD(cost,
                                    parameters,
                                    adam_optimizer,
                                    is_local=False,
                                    pserver_spec=etcd_endpoint,
                                    use_etcd=True)
    • Use cloud_reader which is a master_client instance can fetch the training data from the task queue.

      trainer.train(
          paddle.batch(cloud_reader([TRAIN_FILES_PATH], etcd_endpoint), 32),
          num_passes=30,
          event_handler=event_handler)
  2. Run the monitor program

    Please open a new tab in your terminal program and run the monitor Python script example/collector.py:

    > cd edl/example
    > docker run --rm -it -v $HOME/.kube/config:/root/.kube/config -v $PWD:/work paddlepaddle/edl-example python /work/collector.py

    And you can see the following metrics:

    SUBMITTED-JOBS    PENDING-JOBS    RUNNING-TRAINERS    CPU-UTILS
    0    0    -    18.40%
    0    0    -    18.40%
    0    0    -    18.40%
    ...
    
  3. Deploy EDL Training Jobs

    kubectl create -f example/examplejob.yaml
  4. Deploy Multiple Training Jobs and Check the Monitor Logs

    You can edit the YAML file and change the name field so that you can submit multiple training jobs. For example, I submited three jobs which name is example, example1 and example2, the monitor logs is as follows:

    SUBMITED-JOBS    PENDING-JOBS    RUNNING-TRAINERS    CPU-UTILS
    0    0    -    18.40%
    0    0    -    18.40%
    1    1    example:0    23.40%
    1    0    example:10    54.40%
    1    0    example:10    54.40%
    2    0    example:10|example1:5    80.40%
    2    0    example:10|example1:8    86.40%
    2    0    example:10|example1:8    86.40%
    2    0    example:10|example1:8    86.40%
    2    0    example:10|example1:8    86.40%
    3    1    example2:0|example:10|example1:8    86.40%
    3    1    example2:0|example:10|example1:8    86.40%
    3    1    example2:0|example:5|example1:4    68.40%
    3    1    example2:0|example:3|example1:4    68.40%
    3    0    example2:4|example:3|example1:4    88.40%
    3    0    example2:4|example:3|example1:4    88.40%
    
  • At the begging, then there is no training job in the cluster except some Kubernetes system components, so the CPU utilization is 18.40%.
  • After submitting the training job example, the CPU utilization rise to 54.40%, because of the max-instances in the YAML file is 10, so the running trainers is 10.
  • After submitting the training job example1, the CPU utilization rose to 86.40%.
  • While we submitting the training job example2, there is no more resource for it, so EDL auto-scaller would scale down the other jobs' trainer process, and eventually the running trainers of example dropped down to 3, example1 dropped down to 4 and no pending jobs.