Skip to content

Commit

Permalink
Add distill readme and adjust the documents. (#96)
Browse files Browse the repository at this point in the history
  • Loading branch information
gongweibao authored May 20, 2020
1 parent 8bed522 commit c47e164
Show file tree
Hide file tree
Showing 6 changed files with 59 additions and 24 deletions.
45 changes: 21 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,14 @@

<img src="https://github.com/elasticdeeplearning/artwork/blob/master/horizontal/color/edl-horizontal-color.png" width="500" style="display:inline;vertical-align:middle;padding:2%">

EDL is an Elastic Deep Learning framework designed to help deep learning cloud service providers to build cluster cloud services using deep learning frameworks such as PaddlePaddle and TensorFlow. EDL includes a Kubernetes controller, PaddlePaddle auto-scaler, which changes the number of processes of distributed jobs to the idle hardware resource in the cluster, and a new fault-tolerable architecture.
EDL is an Elastic Deep Learning framework designed to help deep learning cloud service providers to build cluster cloud services using deep learning framework PaddlePaddle.

EDL includes two parts:

1. A Kubernetes controller for the elastic scheduling of distributed
deep learning jobs and tools for adjusting manually.

1. Making PaddlePaddle a fault-tolerable deep learning framework with usability API for job management.

EDL is an incubation-stage project of the [LF AI Foundation](https://lfai.foundation).

Expand All @@ -18,33 +25,23 @@ For more about the project EDL, please refer to this [invited blog
post](https://kubernetes.io/blog/2017/12/paddle-paddle-fluid-elastic-learning/)
on the Kubernetes official blog.

EDL includes two parts:

1. a Kubernetes controller for the elastic scheduling of distributed
deep learning jobs, and

1. making PaddlePaddle a fault-tolerable deep learning framework.
This directory contains the Kubernetes controller. For more
information about fault-tolerance, please refer to the
[design](./doc/fault_tolerance.md).

We deployed EDL on a real Kubernetes cluster, dlnel.com, opened for
graduate students of Tsinghua University. The performance test report
of EDL on this cluster is
[here](https://github.com/PaddlePaddle/cloud/blob/develop/doc/edl/experiment/README.md).

## Tutorials

- [Usage](./doc/usage.md)
- [How to Build EDL Component](./doc/build.md)
- [Run CTR Training and Deployment on Baidu Cloud](./example/ctr/deploy_ctr_on_baidu_cloud_cn.rst)
- [Run EDL distill training demo on Kubernetes or a single node](./example/distill/README.md)
- [Run Elastic Deep Learning Demo on a sinle node](./example/collective/README.md)

## Design Docs
- Collective communication pattern
- [Fault-Tolerant Training in PaddlePaddle](./doc/fault_tolerance.md).
- [Elastic Deep Learning Design Doc:compute engine](./doc/edl_collective_design_doc.md).
- [Elastic Deep Learning Design Doc:Scheduler](./doc/edl_design_doc.md).
- [Run Elastic Deep Learning Demo on a sinle node](./doc/collective_demo.md).
- A scheduler on Kubernetes:
- [Scheduler](./doc/edl_design_doc.md)
- EDL framework on PaddlePaddle:
- [Fault-Tolerant Training in PaddlePaddle](./doc/fault_tolerance.md)
- [EDL framework](./doc/edl_collective_design_doc.md)
- [EDL Distillation](./doc/edl_distill_design_doc.md)

## Experiments:

- [Auto-scaling Experiment](https://github.com/PaddlePaddle/cloud/blob/develop/doc/edl/experiment/README.md)
- [Distill training on Resnet50](./doc/experiment/distill_resnet50.md)

## FAQ

Expand Down
24 changes: 24 additions & 0 deletions doc/edl_distill_design_doc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Introduction
Distilling the Knowledge in a Neural Network[<sup>1</sup>](#r_1) is a different type of training used to transfer the knowledge from the cumbersome models(teachers) to a small model(student) that is more suitable for deployment.

EDL Distillation is a large scale and universal solution for knowledge distillation.

- Decouple the teacher and student models
- They can run in the same or different nodes and transfer knowledge via network even on heterogeneous machines.
Use Distillation on resnet50 as an example: The teachers(Resnet101 for example) can be deployed on P4 GPU cards since they compute forward network generally and the student can be deployed on v100 GPU cards since they need more GPU memory.

- It's flexible and efficient.
- Teachers and students can be adjusted elastically in training by the resource utilization
- Easier to use and deploy.
- Few lines need to change.
- End to end use. We release the Kubernetes' deployment solution for you.

# Design
## Architecture
## Student
## Teacher
## Reader
## Balancer

## Reference
1. <div id="r_1">[Distilling the Knowledge in a Neural Network](https://arxiv.org/pdf/1503.02531.pdf)</div>
2 changes: 2 additions & 0 deletions doc/experiment/distill_resnet50.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Distill experiment on resnet50
TBD
File renamed without changes.
File renamed without changes.
12 changes: 12 additions & 0 deletions example/distill/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Purpose
This article illustrates how to run distill demo on Kubernetes cluster or one single machine.

## On Kubernetes

We have built the docker images for you and you can start a demo on Kubernetes immediately:

1. Get the yaml files from: `edl/example/distill/k8s/`
2. Use kubectl to create resources from them, such as `kubectl create -f student.yaml`

## On a single node
TBD

0 comments on commit c47e164

Please sign in to comment.