Add distill readme and adjust the documents. (#96)

elasticdeeplearning · May 20, 2020 · c47e164 · c47e164
1 parent 8bed522
commit c47e164
Show file tree

Hide file tree

Showing 6 changed files with 59 additions and 24 deletions.
diff --git a/README.md b/README.md
@@ -2,7 +2,14 @@
 
 <img src="https://github.com/elasticdeeplearning/artwork/blob/master/horizontal/color/edl-horizontal-color.png" width="500" style="display:inline;vertical-align:middle;padding:2%">
 
-EDL is an Elastic Deep Learning framework designed to help deep learning cloud service providers to build cluster cloud services using deep learning frameworks such as PaddlePaddle and TensorFlow. EDL includes a Kubernetes controller, PaddlePaddle auto-scaler, which changes the number of processes of distributed jobs to the idle hardware resource in the cluster, and a new fault-tolerable architecture.
+EDL is an Elastic Deep Learning framework designed to help deep learning cloud service providers to build cluster cloud services using deep learning framework PaddlePaddle.
+
+EDL includes two parts:
+
+1. A Kubernetes controller for the elastic scheduling of distributed
+   deep learning jobs and tools for adjusting manually.
+
+1. Making PaddlePaddle a fault-tolerable deep learning framework with usability API for job management.
 
 EDL is an incubation-stage project of the [LF AI Foundation](https://lfai.foundation).
 
@@ -18,33 +25,23 @@ For more about the project EDL, please refer to this [invited blog
 post](https://kubernetes.io/blog/2017/12/paddle-paddle-fluid-elastic-learning/)
 on the Kubernetes official blog.
 
-EDL includes two parts:
-
-1. a Kubernetes controller for the elastic scheduling of distributed
-   deep learning jobs, and
-
-1. making PaddlePaddle a fault-tolerable deep learning framework.
-   This directory contains the Kubernetes controller.  For more
-   information about fault-tolerance, please refer to the
-   [design](./doc/fault_tolerance.md).
-
-We deployed EDL on a real Kubernetes cluster, dlnel.com, opened for
-graduate students of Tsinghua University.  The performance test report
-of EDL on this cluster is
-[here](https://github.com/PaddlePaddle/cloud/blob/develop/doc/edl/experiment/README.md).
-
 ## Tutorials
-
-- [Usage](./doc/usage.md)
-- [How to Build EDL Component](./doc/build.md)
 - [Run CTR Training and Deployment on Baidu Cloud](./example/ctr/deploy_ctr_on_baidu_cloud_cn.rst)
+- [Run EDL distill training demo on Kubernetes or a single node](./example/distill/README.md)
+- [Run Elastic Deep Learning Demo on a sinle node](./example/collective/README.md)
 
 ## Design Docs
-- Collective communication pattern
-  -  [Fault-Tolerant Training in PaddlePaddle](./doc/fault_tolerance.md).
-  -  [Elastic Deep Learning Design Doc:compute engine](./doc/edl_collective_design_doc.md).
-  -  [Elastic Deep Learning Design Doc:Scheduler](./doc/edl_design_doc.md).
-  -  [Run Elastic Deep Learning Demo on a sinle node](./doc/collective_demo.md).
+- A scheduler on Kubernetes:
+  -  [Scheduler](./doc/edl_design_doc.md)
+- EDL framework on PaddlePaddle:
+  -  [Fault-Tolerant Training in PaddlePaddle](./doc/fault_tolerance.md)
+  -  [EDL framework](./doc/edl_collective_design_doc.md)
+  -  [EDL Distillation](./doc/edl_distill_design_doc.md)
+
+## Experiments:
+
+- [Auto-scaling Experiment](https://github.com/PaddlePaddle/cloud/blob/develop/doc/edl/experiment/README.md)
+- [Distill training on Resnet50](./doc/experiment/distill_resnet50.md)
 
 ## FAQ
 

diff --git a/doc/edl_distill_design_doc.md b/doc/edl_distill_design_doc.md
@@ -0,0 +1,24 @@
+# Introduction
+Distilling the Knowledge in a Neural Network[<sup>1</sup>](#r_1) is a different type of training used to transfer the knowledge from the cumbersome models(teachers) to a small model(student) that is more suitable for deployment.
+
+EDL Distillation is a large scale and universal solution for knowledge distillation. 
+
+- Decouple the teacher and student models
+  - They can run in the same or different nodes and transfer knowledge via network even on heterogeneous machines.            
+     Use Distillation on resnet50 as an example: The teachers(Resnet101 for example) can be deployed on P4 GPU cards since they compute forward network generally and the student can be deployed on v100 GPU cards since they need more GPU memory.   
+
+- It's flexible and efficient.
+  - Teachers and students can be adjusted elastically in training by the resource utilization  
+- Easier to use and deploy.
+  - Few lines need to change.
+  - End to end use. We release the Kubernetes' deployment solution for you. 
+
+# Design
+## Architecture
+## Student
+## Teacher
+## Reader
+## Balancer
+
+## Reference
+1. <div id="r_1">[Distilling the Knowledge in a Neural Network](https://arxiv.org/pdf/1503.02531.pdf)</div>
diff --git a/doc/experiment/distill_resnet50.md b/doc/experiment/distill_resnet50.md
@@ -0,0 +1,2 @@
+# Distill experiment on resnet50
+TBD
diff --git a/doc/collective_demo.md → example/collective/README.md b/doc/collective_demo.md → example/collective/README.md
diff --git a/doc/collective_demo_cn.md → example/collective/README_cn.md b/doc/collective_demo_cn.md → example/collective/README_cn.md
diff --git a/example/distill/README.md b/example/distill/README.md
@@ -0,0 +1,12 @@
+# Purpose
+This article illustrates how to run distill demo on Kubernetes cluster or one single machine.
+
+## On Kubernetes
+
+We have built the docker images for you and you can start a demo on Kubernetes immediately:
+
+1. Get the yaml files from: `edl/example/distill/k8s/`
+2. Use kubectl to create resources from them, such as `kubectl create -f student.yaml`  
+
+## On a single node
+TBD