Skip to content

Commit

Permalink
ADD:ConnectionException handle
Browse files Browse the repository at this point in the history
  • Loading branch information
scharoun committed Jun 10, 2017
1 parent 9f0a3be commit c402e12
Show file tree
Hide file tree
Showing 22 changed files with 44 additions and 22 deletions.
3 changes: 2 additions & 1 deletion bin/local_optimizer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@ echo "#########################################################################"
# thread number
thread_num=10

# use current machine as master
# use current machine as master, if an ConnectException occurs in slave.log,
# try to set master_host=127.0.0.1
master_host=$(hostname)

# if you run more than one training tasks on the same host at the same time,
Expand Down
3 changes: 2 additions & 1 deletion demo/ffm/binary_classification/local_optimizer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
# thread number
thread_num=1

# use current machine as master
# use current machine as master, if an ConnectException occurs in slave.log,
# try to set master_host=127.0.0.1
master_host=$(hostname)

# if you run more than one training tasks on the same host at the same time,
Expand Down
3 changes: 2 additions & 1 deletion demo/ffm/regression/local_optimizer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
# thread number
thread_num=1

# use current machine as master
# use current machine as master, if an ConnectException occurs in slave.log,
# try to set master_host=127.0.0.1
master_host=$(hostname)

# if you run more than one training tasks on the same host at the same time,
Expand Down
3 changes: 2 additions & 1 deletion demo/fm/binary_classification/local_optimizer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
# thread number
thread_num=1

# use current machine as master
# use current machine as master, if an ConnectException occurs in slave.log,
# try to set master_host=127.0.0.1
master_host=$(hostname)

# if you run more than one training tasks on the same host at the same time,
Expand Down
3 changes: 2 additions & 1 deletion demo/fm/regression/local_optimizer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
# thread number
thread_num=1

# use current machine as master
# use current machine as master, if an ConnectException occurs in slave.log,
# try to set master_host=127.0.0.1
master_host=$(hostname)

# if you run more than one training tasks on the same host at the same time,
Expand Down
3 changes: 2 additions & 1 deletion demo/gbdt/binary_classification/local_optimizer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@
# thread number
thread_num=4

# use current machine as master
# use current machine as master, if an ConnectException occurs in slave.log,
# try to set master_host=127.0.0.1
master_host=$(hostname)

# if you run more than one training tasks on the same host at the same time,
Expand Down
3 changes: 2 additions & 1 deletion demo/gbdt/multiclass_classification/local_optimizer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@
# thread number
thread_num=1

# use current machine as master
# use current machine as master, if an ConnectException occurs in slave.log,
# try to set master_host=127.0.0.1
master_host=$(hostname)

# if you run more than one training tasks on the same host at the same time,
Expand Down
3 changes: 2 additions & 1 deletion demo/gbdt/regression_l2/local_optimizer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@
# thread number
thread_num=2

# use current machine as master
# use current machine as master, if an ConnectException occurs in slave.log,
# try to set master_host=127.0.0.1
master_host=$(hostname)

# if you run more than one training tasks on the same host at the same time,
Expand Down
3 changes: 2 additions & 1 deletion demo/gbhmlr/binary_classification/local_optimizer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
# thread number
thread_num=1

# use current machine as master
# use current machine as master, if an ConnectException occurs in slave.log,
# try to set master_host=127.0.0.1
master_host=$(hostname)

# if you run more than one training tasks on the same host at the same time,
Expand Down
3 changes: 2 additions & 1 deletion demo/gbhmlr/regression/local_optimizer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
# thread number
thread_num=1

# use current machine as master
# use current machine as master, if an ConnectException occurs in slave.log,
# try to set master_host=127.0.0.1
master_host=$(hostname)

# if you run more than one training tasks on the same host at the same time,
Expand Down
3 changes: 2 additions & 1 deletion demo/gbhsdt/binary_classification/local_optimizer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
# thread number
thread_num=1

# use current machine as master
# use current machine as master, if an ConnectException occurs in slave.log,
# try to set master_host=127.0.0.1
master_host=$(hostname)

# if you run more than one training tasks on the same host at the same time,
Expand Down
3 changes: 2 additions & 1 deletion demo/gbhsdt/regression/local_optimizer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
# thread number
thread_num=1

# use current machine as master
# use current machine as master, if an ConnectException occurs in slave.log,
# try to set master_host=127.0.0.1
master_host=$(hostname)

# if you run more than one training tasks on the same host at the same time,
Expand Down
3 changes: 2 additions & 1 deletion demo/gbmlr/binary_classification/local_optimizer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
# thread number
thread_num=1

# use current machine as master
# use current machine as master, if an ConnectException occurs in slave.log,
# try to set master_host=127.0.0.1
master_host=$(hostname)

# if you run more than one training tasks on the same host at the same time,
Expand Down
3 changes: 2 additions & 1 deletion demo/gbmlr/regression/local_optimizer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
# thread number
thread_num=1

# use current machine as master
# use current machine as master, if an ConnectException occurs in slave.log,
# try to set master_host=127.0.0.1
master_host=$(hostname)

# if you run more than one training tasks on the same host at the same time,
Expand Down
3 changes: 2 additions & 1 deletion demo/gbsdt/binary_classification/local_optimizer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
# thread number
thread_num=1

# use current machine as master
# use current machine as master, if an ConnectException occurs in slave.log,
# try to set master_host=127.0.0.1
master_host=$(hostname)

# if you run more than one training tasks on the same host at the same time,
Expand Down
3 changes: 2 additions & 1 deletion demo/gbsdt/regression/local_optimizer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
# thread number
thread_num=1

# use current machine as master
# use current machine as master, if an ConnectException occurs in slave.log,
# try to set master_host=127.0.0.1
master_host=$(hostname)

# if you run more than one training tasks on the same host at the same time,
Expand Down
3 changes: 2 additions & 1 deletion demo/linear/binary_classification/local_optimizer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
# thread number
thread_num=1

# use current machine as master
# use current machine as master, if an ConnectException occurs in slave.log,
# try to set master_host=127.0.0.1
master_host=$(hostname)

# if you run more than one training tasks on the same host at the same time,
Expand Down
3 changes: 2 additions & 1 deletion demo/linear/regression/local_optimizer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
# thread number
thread_num=1

# use current machine as master
# use current machine as master, if an ConnectException occurs in slave.log,
# try to set master_host=127.0.0.1
master_host=$(hostname)

# if you run more than one training tasks on the same host at the same time,
Expand Down
3 changes: 2 additions & 1 deletion demo/multiclass_linear/local_optimizer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
# thread number
thread_num=1

# use current machine as master
# use current machine as master, if an ConnectException occurs in slave.log,
# try to set master_host=127.0.0.1
master_host=$(hostname)

# if you run more than one training tasks on the same host at the same time,
Expand Down
2 changes: 1 addition & 1 deletion docs/running_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ The [configurations](../config/model) for our models mainly consist of four part

Logs in ytk-learn are very useful. You can monitor task procedure, see importance information such as evaluation results and find detailed error information when the program is not running as you expected.

After starting up training, you can use ```tail -f log/master.log``` to watch process, most errors and exceptions are printed in this log file. If training is blocked or nothing about error or exception can be found in master.log, you need to check slave.log or slave_error.log. In the spark/hadoop yarn, you can use ``` yarn logs -applicationId your_application_id``` command to get slave's logs.
After starting up training, you can use ```tail -f log/master.log``` to watch process, most errors and exceptions are printed in this log file. If training is blocked or nothing about error or exception can be found in ```master.log```, you must to check ```slave.log``` or ```slave_error.log```, if there is a ```ConnectionException```, you can try to set ```master_host``` to be ```127.0.0.1```. In the spark/hadoop yarn, you can use ``` yarn logs -applicationId your_application_id``` command to get slave's logs.

| log file | details | relevant scripts |
| :---------------------------- | :--------------------------------------- | :--------------------------------------- |
Expand Down
3 changes: 2 additions & 1 deletion experiment/higgs/local_optimizer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@
# thread number
thread_num=16

# use current machine as master
# use current machine as master, if an ConnectException occurs in slave.log,
# try to set master_host=127.0.0.1
master_host=$(hostname)

# if you run more than one training tasks on the same host at the same time,
Expand Down
4 changes: 3 additions & 1 deletion src/main/java/com/fenbi/ytklearn/worker/TrainWorker.java
Original file line number Diff line number Diff line change
Expand Up @@ -223,7 +223,9 @@ public void run() {
}
} finally {
try {
comm.close(errorCode);
if (comm != null) {
comm.close(errorCode);
}
} catch (Mp4jException e) {
errorCode = 1;
LOG.error("comm close exception!", e);
Expand Down

0 comments on commit c402e12

Please sign in to comment.