Skip to content

Commit

Permalink
Add fault tolerance document. (#65)
Browse files Browse the repository at this point in the history
  • Loading branch information
gongweibao authored Apr 7, 2020
1 parent b0e13c8 commit 0634c5d
Show file tree
Hide file tree
Showing 3 changed files with 128 additions and 1 deletion.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ EDL includes two parts:
1. making PaddlePaddle a fault-tolerable deep learning framework.
This directory contains the Kubernetes controller. For more
information about fault-tolerance, please refer to the
[design](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/v2/design/cluster_train).
[design](./doc/fault_tolerance.md).

We deployed EDL on a real Kubernetes cluster, dlnel.com, opened for
graduate students of Tsinghua University. The performance test report
Expand Down
66 changes: 66 additions & 0 deletions doc/fault_tolerance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Fault tolerance for sync training
## Design
In the process of training, we may meet that one or more trainers crash. We use checkpoints to continue training.

There may be several design-tricks for it:

1. How does Paddle save checkpoint itself?
Paddle implements `save_persistables` to save all persistable variables.

2. How to save user's Python frontend logic?
Such as current epoch number, step number in an epoch, and the data slice and offset and so on.

3. How to save checkpoints?
- Which trainer saves the checkpoint?
If there are many trainers, the trainer who `rank`==0 will do it.

- Where do we save the checkpoint?
It can be saved to the local file system, but eventually, it should be saved to a file-system that can be seen by all trainers such as a distributed HDFS.

- How to guarantee the checkpoint's integrity and correctness?
It's a process to save a file and it's not an atomic action but `rm` `rename` `mv` and others should be.
We can use it and don't change any checkpoint when it's written with a version number. All checkpoints will be saved to the file system with an increment version number. The interface generates a temporary checkpoint file and then `rename` it to valid when it has done.

- when is the checkpoint saved?
Now the trainer saves checkpoint every epoch and it need not save the data offset, it's very simple. Of course, this method is not friendly when an epoch takes a too long time. We will implement a step level(time-limited) checkpoint interface the next version.

## Interface
There are two interfaces `save_check_point` and `load_check_point` to save/load a checkpoint.
There are two arguments should be careful:

1. fs:
It's an abstract interface to file system and there are two implementations: local file system and HDFS.
You can implement the member function of this class to use the checkpoint interface.

2. train_status:
Now there is only one member variable `epoch_no` and there will be more variables here after 0.2 version.

## Example
1.save_check_point:

```
if trainer_id == 0:
saved_status = TrainStatus(pass_id)
if args.checkpoint:
if not os.path.isdir(args.checkpoint):
os.makedirs(args.checkpoint)
print("save_check_point:{}".format(args.checkpoint))
fleet.save_check_point(executor=exe, train_status=saved_status,
path=args.checkpoint, fs=fs)#, main_program=fleet._origin_program)
```

2.load_check_point:

```
if args.checkpoint is not None:
tmp_s = fleet.load_check_point(exe, args.checkpoint, fs=fs, trainer_id=trainer_id)
if tmp_s is not None:
train_status = tmp_s
for pass_id in range(train_status.next(), params["num_epochs"]):
train()
```

# Async training
TBD
61 changes: 61 additions & 0 deletions doc/fault_tolerance_cn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# 同步训练的FaultTolerance
## 设计思路
在训练的过程中我们可能会碰到因为各种的问题造成的训练单个(或者多个)trainer挂掉的问题。我们采用checkpoint的方式记录当前状态,保证重启之后训练任务能够正常运行。
这里边可能有几个地方需要考虑:

1. Paddle本身的checkpoint
Paddle本身提供`save_persistables `保存所有持久的变量。

2. 用户python端逻辑的checkpoint问题
主要是当前epoch number,数据切分方法和位置等。

3. checkpoint保存的问题
- 谁来保存
如果有多个trainer节点,我们一般会选择rank=0的trainer来负责保存checkpoint

- 保存的位置
可以保存到本地,但是最终要保存到重启任务能够看到的文件系统里边,如分布式的HDFS文件系统。

- 如何确保checkpoint的正确性
保存文件一个持续性的过程,不是一个原子性的过程,不能保证事务性。但是一般的文件系统的操作`mv` `rename` `rm` 是。
可以利用这个特点,对已经保存的checkpoint不变,递增当前的 checkpoint的版本号,先写入一个临时文件,完成之后再rename成一个有效文件名的checkpoint。

- 何时保存
我们现在推荐的方式是每一个epoch保存一次。因为一个epoch完成之后,可以认为两个epoch数据上没有关系。这样我们只需要保存当前的epoch号就可以了,不用保存当前的文件逻辑切分和位置等。减少了复杂度。当然,这种方式对一个epoch过大的的不友好。我们准备以后的版本开发step级别(时间)的checkpoint

## 接口介绍
Paddle提供`save_check_point``load_check_point`两种方式来存、读checkpoint。
其中有两个参数需要注意一下:
1.fs
这个是我们对文件系统的抽线,目前的实现有两种:本地和远程HDFS。您可以实现自己的`FS`类来实现保存和读取checkpoint的功能

2.train_status
目前该类只有`epoch_no`的类变量,0.2以后的版本将尝试增加用户自定义的member等更多的值。

## 使用样例
1. save_check_point的样例:

```
if trainer_id == 0:
saved_status = TrainStatus(pass_id)
if args.checkpoint:
if not os.path.isdir(args.checkpoint):
os.makedirs(args.checkpoint)
print("save_check_point:{}".format(args.checkpoint))
fleet.save_check_point(executor=exe, train_status=saved_status,
path=args.checkpoint, fs=fs)#, main_program=fleet._origin_program)
```

2. load_check_point的样例:

```
if args.checkpoint is not None:
tmp_s = fleet.load_check_point(exe, args.checkpoint, fs=fs, trainer_id=trainer_id)
if tmp_s is not None:
train_status = tmp_s
```


# 异步训练的FaultTolerance
TBD

0 comments on commit 0634c5d

Please sign in to comment.