Skip to content

Latest commit

 

History

History
639 lines (600 loc) · 29.7 KB

File metadata and controls

639 lines (600 loc) · 29.7 KB

Run with single-machine-and-single-GPU.py

> CUDA_VISIBLE_DEVICES=0 python single-machine-and-single-GPU.py
Using cuda:0 device
NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
  )
)
Epoch 1
-------------------------------
loss: 2.301383  [    0/60000]
loss: 2.295246  [ 6400/60000]
loss: 2.276515  [12800/60000]
loss: 2.269020  [19200/60000]
loss: 2.255433  [25600/60000]
loss: 2.228429  [32000/60000]
loss: 2.239701  [38400/60000]
loss: 2.209971  [44800/60000]
loss: 2.211788  [51200/60000]
loss: 2.186936  [57600/60000]
Test Error: 
 Accuracy: 40.1%, Avg loss: 2.176335 

Epoch 2
-------------------------------
loss: 2.190498  [    0/60000]
loss: 2.180624  [ 6400/60000]
loss: 2.132775  [12800/60000]
loss: 2.141592  [19200/60000]
loss: 2.098609  [25600/60000]
loss: 2.047074  [32000/60000]
loss: 2.078812  [38400/60000]
loss: 2.011228  [44800/60000]
loss: 2.022019  [51200/60000]
loss: 1.951299  [57600/60000]
Test Error: 
 Accuracy: 53.9%, Avg loss: 1.944823 

Epoch 3
-------------------------------
loss: 1.983202  [    0/60000]
loss: 1.948037  [ 6400/60000]
loss: 1.844122  [12800/60000]
loss: 1.870956  [19200/60000]
loss: 1.765571  [25600/60000]
loss: 1.721192  [32000/60000]
loss: 1.745432  [38400/60000]
loss: 1.654244  [44800/60000]
loss: 1.677890  [51200/60000]
loss: 1.565888  [57600/60000]
Test Error: 
 Accuracy: 59.3%, Avg loss: 1.578995 

Epoch 4
-------------------------------
loss: 1.649577  [    0/60000]
loss: 1.605777  [ 6400/60000]
loss: 1.460091  [12800/60000]
loss: 1.516236  [19200/60000]
loss: 1.398032  [25600/60000]
loss: 1.399217  [32000/60000]
loss: 1.410613  [38400/60000]
loss: 1.345814  [44800/60000]
loss: 1.375249  [51200/60000]
loss: 1.267737  [57600/60000]
Test Error: 
 Accuracy: 62.4%, Avg loss: 1.292355 

Epoch 5
-------------------------------
loss: 1.374084  [    0/60000]
loss: 1.351198  [ 6400/60000]
loss: 1.185489  [12800/60000]
loss: 1.276066  [19200/60000]
loss: 1.152795  [25600/60000]
loss: 1.188806  [32000/60000]
loss: 1.201429  [38400/60000]
loss: 1.153891  [44800/60000]
loss: 1.187127  [51200/60000]
loss: 1.096417  [57600/60000]
Test Error: 
 Accuracy: 64.1%, Avg loss: 1.116014 

Done!
Saved PyTorch Model State to model.pth

Run with single-machine-and-multi-GPU-DataParallel.py

> CUDA_VISIBLE_DEVICES=0,1,2,3 python single-machine-and-multi-GPU-DataParallel.py
n_gpu: 4
DataParallel(
  (module): NeuralNetwork(
    (flatten): Flatten(start_dim=1, end_dim=-1)
    (linear_relu_stack): Sequential(
      (0): Linear(in_features=784, out_features=512, bias=True)
      (1): ReLU()
      (2): Linear(in_features=512, out_features=512, bias=True)
      (3): ReLU()
      (4): Linear(in_features=512, out_features=10, bias=True)
    )
  )
)
Epoch 1
-------------------------------
loss: 2.309276  [    0/60000]
loss: 2.290961  [ 6400/60000]
loss: 2.278524  [12800/60000]
loss: 2.272659  [19200/60000]
loss: 2.253739  [25600/60000]
loss: 2.233879  [32000/60000]
loss: 2.235425  [38400/60000]
loss: 2.205171  [44800/60000]
loss: 2.200310  [51200/60000]
loss: 2.169638  [57600/60000]
Test Error: 
 Accuracy: 47.2%, Avg loss: 2.165897 

Epoch 2
-------------------------------
loss: 2.173666  [    0/60000]
loss: 2.161774  [ 6400/60000]
loss: 2.110973  [12800/60000]
loss: 2.131320  [19200/60000]
loss: 2.078964  [25600/60000]
loss: 2.024526  [32000/60000]
loss: 2.052748  [38400/60000]
loss: 1.970439  [44800/60000]
loss: 1.975696  [51200/60000]
loss: 1.909384  [57600/60000]
Test Error: 
 Accuracy: 57.7%, Avg loss: 1.903836 

Epoch 3
-------------------------------
loss: 1.928953  [    0/60000]
loss: 1.899612  [ 6400/60000]
loss: 1.783553  [12800/60000]
loss: 1.838050  [19200/60000]
loss: 1.723950  [25600/60000]
loss: 1.664515  [32000/60000]
loss: 1.696275  [38400/60000]
loss: 1.578931  [44800/60000]
loss: 1.612579  [51200/60000]
loss: 1.513682  [57600/60000]
Test Error: 
 Accuracy: 61.9%, Avg loss: 1.525344 

Epoch 4
-------------------------------
loss: 1.584913  [    0/60000]
loss: 1.551337  [ 6400/60000]
loss: 1.395768  [12800/60000]
loss: 1.487152  [19200/60000]
loss: 1.364975  [25600/60000]
loss: 1.348510  [32000/60000]
loss: 1.371712  [38400/60000]
loss: 1.277878  [44800/60000]
loss: 1.325432  [51200/60000]
loss: 1.229618  [57600/60000]
Test Error: 
 Accuracy: 63.4%, Avg loss: 1.253269 

Epoch 5
-------------------------------
loss: 1.329453  [    0/60000]
loss: 1.310928  [ 6400/60000]
loss: 1.138112  [12800/60000]
loss: 1.259237  [19200/60000]
loss: 1.132849  [25600/60000]
loss: 1.149312  [32000/60000]
loss: 1.177834  [38400/60000]
loss: 1.097068  [44800/60000]
loss: 1.144713  [51200/60000]
loss: 1.068396  [57600/60000]
Test Error: 
 Accuracy: 65.0%, Avg loss: 1.087378 

Done!
Saved PyTorch Model State to model.pth

Run with single-machine-and-multi-GPU-DistributedDataParallel-launch.py

We use 2 machines to run, the IP is 192.168.1.105 (master), 192.168.1.106. Each machine has 4 GPUs.

Note: In order to display more NCCL information, we can set it with the following script, which helps us to find the bug of DDP when writing code.

export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

Machine 0 (master, IP: 192.168.1.105):

> CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --nnodes 2 --node_rank 0 --master_addr='192.168.1.105' --master_port='12345' single-machine-and-multi-GPU-DistributedDataParallel-launch.py 
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Using device: cuda:3
local rank: 3, global rank: 3, world size: 8
Using device: cuda:2
Using device: cuda:1
local rank: 1, global rank: 1, world size: 8
local rank: 2, global rank: 2, world size: 8
Using device: cuda:0
local rank: 0, global rank: 0, world size: 8
tesla-105:1475:1475 [0] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.105<0>
tesla-105:1475:1475 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

tesla-105:1475:1475 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-105:1475:1475 [0] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.105<0>
tesla-105:1475:1475 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.1
tesla-105:1477:1477 [1] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.105<0>
tesla-105:1477:1477 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

tesla-105:1477:1477 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-105:1477:1477 [1] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.105<0>
tesla-105:1477:1477 [1] NCCL INFO Using network Socket
tesla-105:1481:1481 [3] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.105<0>
tesla-105:1481:1481 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

tesla-105:1481:1481 [3] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-105:1481:1481 [3] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.105<0>
tesla-105:1481:1481 [3] NCCL INFO Using network Socket
tesla-105:1480:1480 [2] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.105<0>
tesla-105:1480:1480 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

tesla-105:1480:1480 [2] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-105:1480:1480 [2] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.105<0>
tesla-105:1480:1480 [2] NCCL INFO Using network Socket
tesla-105:1481:2165 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-105:1481:2165 [3] NCCL INFO Trees [0] -1/-1/-1->3->2|2->3->-1/-1/-1 [1] -1/-1/-1->3->2|2->3->-1/-1/-1
tesla-105:1481:2165 [3] NCCL INFO Setting affinity for GPU 3 to ff,c00ffc00
tesla-105:1475:2146 [0] NCCL INFO Channel 00/02 :    0   1   2   3   4   5   6   7
tesla-105:1477:2150 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-105:1477:2150 [1] NCCL INFO Trees [0] 2/4/-1->1->0|0->1->2/4/-1 [1] 2/-1/-1->1->0|0->1->2/-1/-1
tesla-105:1477:2150 [1] NCCL INFO Setting affinity for GPU 1 to 3ff003ff
tesla-105:1475:2146 [0] NCCL INFO Channel 01/02 :    0   1   2   3   4   5   6   7
tesla-105:1475:2146 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-105:1480:2169 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-105:1480:2169 [2] NCCL INFO Trees [0] 3/-1/-1->2->1|1->2->3/-1/-1 [1] 3/-1/-1->2->1|1->2->3/-1/-1
tesla-105:1480:2169 [2] NCCL INFO Setting affinity for GPU 2 to ff,c00ffc00
tesla-105:1475:2146 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->5|5->0->1/-1/-1
tesla-105:1475:2146 [0] NCCL INFO Setting affinity for GPU 0 to 3ff003ff
tesla-105:1481:2165 [3] NCCL INFO Channel 00 : 3[b1000] -> 4[18000] [send] via NET/Socket/0
tesla-105:1480:2169 [2] NCCL INFO Channel 00 : 2[af000] -> 3[b1000] via P2P/IPC
tesla-105:1477:2150 [1] NCCL INFO Channel 00 : 1[1a000] -> 2[af000] via direct shared memory
tesla-105:1475:2146 [0] NCCL INFO Channel 00 : 7[b1000] -> 0[18000] [receive] via NET/Socket/0
tesla-105:1480:2169 [2] NCCL INFO Channel 00 : 2[af000] -> 1[1a000] via direct shared memory
tesla-105:1475:2146 [0] NCCL INFO Channel 00 : 0[18000] -> 1[1a000] via P2P/IPC
tesla-105:1477:2150 [1] NCCL INFO Channel 00 : 4[18000] -> 1[1a000] [receive] via NET/Socket/0
tesla-105:1477:2150 [1] NCCL INFO Channel 00 : 1[1a000] -> 0[18000] via P2P/IPC
tesla-105:1475:2146 [0] NCCL INFO Channel 01 : 7[b1000] -> 0[18000] [receive] via NET/Socket/0
tesla-105:1475:2146 [0] NCCL INFO Channel 01 : 0[18000] -> 1[1a000] via P2P/IPC
tesla-105:1481:2165 [3] NCCL INFO Channel 00 : 3[b1000] -> 2[af000] via P2P/IPC
tesla-105:1477:2150 [1] NCCL INFO Channel 00 : 1[1a000] -> 4[18000] [send] via NET/Socket/0
tesla-105:1481:2165 [3] NCCL INFO Channel 01 : 3[b1000] -> 4[18000] [send] via NET/Socket/0
tesla-105:1480:2169 [2] NCCL INFO Channel 01 : 2[af000] -> 3[b1000] via P2P/IPC
tesla-105:1477:2150 [1] NCCL INFO Channel 01 : 1[1a000] -> 2[af000] via direct shared memory
tesla-105:1481:2165 [3] NCCL INFO Channel 01 : 3[b1000] -> 2[af000] via P2P/IPC
tesla-105:1480:2169 [2] NCCL INFO Channel 01 : 2[af000] -> 1[1a000] via direct shared memory
tesla-105:1481:2165 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-105:1481:2165 [3] NCCL INFO comm 0x7fd634001060 rank 3 nranks 8 cudaDev 3 busId b1000 - Init COMPLETE
tesla-105:1475:2146 [0] NCCL INFO Channel 01 : 0[18000] -> 5[1a000] [send] via NET/Socket/0
tesla-105:1477:2150 [1] NCCL INFO Channel 01 : 1[1a000] -> 0[18000] via P2P/IPC
tesla-105:1477:2150 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-105:1477:2150 [1] NCCL INFO comm 0x7f30b4001060 rank 1 nranks 8 cudaDev 1 busId 1a000 - Init COMPLETE
tesla-105:1480:2169 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-105:1480:2169 [2] NCCL INFO comm 0x7f37d4001060 rank 2 nranks 8 cudaDev 2 busId af000 - Init COMPLETE
tesla-105:1475:2146 [0] NCCL INFO Channel 01 : 5[1a000] -> 0[18000] [receive] via NET/Socket/0
tesla-105:1475:2146 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-105:1475:2146 [0] NCCL INFO comm 0x7f9e54001060 rank 0 nranks 8 cudaDev 0 busId 18000 - Init COMPLETE
tesla-105:1475:1475 [0] NCCL INFO Launch mode Parallel
DistributedDataParallel(
  (module): NeuralNetwork(
    (flatten): Flatten(start_dim=1, end_dim=-1)
    (linear_relu_stack): Sequential(
      (0): Linear(in_features=784, out_features=512, bias=True)
      (1): ReLU()
      (2): Linear(in_features=512, out_features=512, bias=True)
      (3): ReLU()
      (4): Linear(in_features=512, out_features=10, bias=True)
    )
  )
)
Epoch 1
-------------------------------
loss: 2.294374  [    0/60000]
loss: 2.301075  [  800/60000]
loss: 2.315739  [ 1600/60000]
loss: 2.299692  [ 2400/60000]
loss: 2.258646  [ 3200/60000]
loss: 2.252302  [ 4000/60000]
loss: 2.218223  [ 4800/60000]
loss: 2.126724  [ 5600/60000]
loss: 2.174220  [ 6400/60000]
loss: 2.177455  [ 7200/60000]
Test Error: 
 Accuracy: 4.1%, Avg loss: 2.166388 

Epoch 2
-------------------------------
loss: 2.136480  [    0/60000]
loss: 2.127040  [  800/60000]
loss: 2.118551  [ 1600/60000]
loss: 2.051364  [ 2400/60000]
loss: 2.076279  [ 3200/60000]
loss: 2.002108  [ 4000/60000]
loss: 2.075573  [ 4800/60000]
loss: 1.959522  [ 5600/60000]
loss: 1.861534  [ 6400/60000]
loss: 1.872814  [ 7200/60000]
Test Error: 
 Accuracy: 7.2%, Avg loss: 1.908959 

Epoch 3
-------------------------------
loss: 2.081742  [    0/60000]
loss: 1.841850  [  800/60000]
loss: 1.939971  [ 1600/60000]
loss: 1.684577  [ 2400/60000]
loss: 1.648371  [ 3200/60000]
loss: 1.774270  [ 4000/60000]
loss: 1.552769  [ 4800/60000]
loss: 1.508346  [ 5600/60000]
loss: 1.516589  [ 6400/60000]
loss: 1.481997  [ 7200/60000]
Test Error: 
 Accuracy: 7.8%, Avg loss: 1.533547 

Epoch 4
-------------------------------
loss: 1.625404  [    0/60000]
loss: 1.543570  [  800/60000]
loss: 1.428792  [ 1600/60000]
loss: 1.446484  [ 2400/60000]
loss: 1.841029  [ 3200/60000]
loss: 1.320562  [ 4000/60000]
loss: 1.511142  [ 4800/60000]
loss: 1.444456  [ 5600/60000]
loss: 1.570060  [ 6400/60000]
loss: 1.482602  [ 7200/60000]
Test Error: 
 Accuracy: 8.0%, Avg loss: 1.256674 

Epoch 5
-------------------------------
loss: 1.064455  [    0/60000]
loss: 1.233810  [  800/60000]
loss: 1.168940  [ 1600/60000]
loss: 1.227281  [ 2400/60000]
loss: 1.437644  [ 3200/60000]
loss: 1.195065  [ 4000/60000]
loss: 1.305991  [ 4800/60000]
loss: 1.258441  [ 5600/60000]
loss: 0.970569  [ 6400/60000]
loss: 1.698888  [ 7200/60000]
Test Error: 
 Accuracy: 8.2%, Avg loss: 1.083617 

Done!
Saved PyTorch Model State to model.pth

Machine 1 (IP: 192.168.1.106):

> CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --nnodes 2 --node_rank 1 --master_addr='192.168.1.105' --master_port='12345' single-machine-and-multi-GPU-DistributedDataParallel-launch.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Using device: cuda:0
Using device: cuda:1

local rank: 1, global rank: 5, world size: 8
local rank: 0, global rank: 4, world size: 8
Using device: cuda:2
local rank: 2, global rank: 6, world size: 8
Using device: cuda:3
local rank: 3, global rank: 7, world size: 8
tesla-106:1942:1942 [1] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.106<0>
tesla-106:1942:1942 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

tesla-106:1942:1942 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-106:1942:1942 [1] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.106<0>
tesla-106:1942:1942 [1] NCCL INFO Using network Socket
tesla-106:1988:1988 [3] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.106<0>
tesla-106:1988:1988 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

tesla-106:1988:1988 [3] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-106:1988:1988 [3] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.106<0>
tesla-106:1988:1988 [3] NCCL INFO Using network Socket
tesla-106:1943:1943 [2] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.106<0>
tesla-106:1943:1943 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

tesla-106:1943:1943 [2] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-106:1943:1943 [2] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.106<0>
tesla-106:1943:1943 [2] NCCL INFO Using network Socket
tesla-106:1940:1940 [0] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.106<0>
tesla-106:1940:1940 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

tesla-106:1940:1940 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-106:1940:1940 [0] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.106<0>
tesla-106:1940:1940 [0] NCCL INFO Using network Socket
tesla-106:1988:2787 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-106:1988:2787 [3] NCCL INFO Trees [0] -1/-1/-1->7->6|6->7->-1/-1/-1 [1] -1/-1/-1->7->6|6->7->-1/-1/-1
tesla-106:1988:2787 [3] NCCL INFO Setting affinity for GPU 3 to ff,c00ffc00
tesla-106:1943:2821 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-106:1943:2821 [2] NCCL INFO Trees [0] 7/-1/-1->6->5|5->6->7/-1/-1 [1] 7/-1/-1->6->5|5->6->7/-1/-1
tesla-106:1943:2821 [2] NCCL INFO Setting affinity for GPU 2 to ff,c00ffc00
tesla-106:1942:2786 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-106:1942:2786 [1] NCCL INFO Trees [0] 6/-1/-1->5->4|4->5->6/-1/-1 [1] 6/0/-1->5->4|4->5->6/0/-1
tesla-106:1942:2786 [1] NCCL INFO Setting affinity for GPU 1 to 3ff003ff
tesla-106:1940:2831 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-106:1940:2831 [0] NCCL INFO Trees [0] 5/-1/-1->4->1|1->4->5/-1/-1 [1] 5/-1/-1->4->-1|-1->4->5/-1/-1
tesla-106:1942:2786 [1] NCCL INFO Channel 00 : 5[1a000] -> 6[af000] via direct shared memory
tesla-106:1940:2831 [0] NCCL INFO Setting affinity for GPU 0 to 3ff003ff
tesla-106:1943:2821 [2] NCCL INFO Channel 00 : 6[af000] -> 7[b1000] via P2P/IPC
tesla-106:1988:2787 [3] NCCL INFO Channel 00 : 7[b1000] -> 0[18000] [send] via NET/Socket/0
tesla-106:1940:2831 [0] NCCL INFO Channel 00 : 3[b1000] -> 4[18000] [receive] via NET/Socket/0
tesla-106:1940:2831 [0] NCCL INFO Channel 00 : 4[18000] -> 5[1a000] via P2P/IPC
tesla-106:1988:2787 [3] NCCL INFO Channel 00 : 7[b1000] -> 6[af000] via P2P/IPC
tesla-106:1942:2786 [1] NCCL INFO Channel 00 : 5[1a000] -> 4[18000] via P2P/IPC
tesla-106:1940:2831 [0] NCCL INFO Channel 00 : 4[18000] -> 1[1a000] [send] via NET/Socket/0
tesla-106:1940:2831 [0] NCCL INFO Channel 00 : 1[1a000] -> 4[18000] [receive] via NET/Socket/0
tesla-106:1988:2787 [3] NCCL INFO Channel 01 : 7[b1000] -> 0[18000] [send] via NET/Socket/0
tesla-106:1943:2821 [2] NCCL INFO Channel 00 : 6[af000] -> 5[1a000] via direct shared memory
tesla-106:1942:2786 [1] NCCL INFO Channel 01 : 5[1a000] -> 6[af000] via direct shared memory
tesla-106:1943:2821 [2] NCCL INFO Channel 01 : 6[af000] -> 7[b1000] via P2P/IPC
tesla-106:1988:2787 [3] NCCL INFO Channel 01 : 7[b1000] -> 6[af000] via P2P/IPC
tesla-106:1988:2787 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-106:1988:2787 [3] NCCL INFO comm 0x7fbb14001060 rank 7 nranks 8 cudaDev 3 busId b1000 - Init COMPLETE
tesla-106:1943:2821 [2] NCCL INFO Channel 01 : 6[af000] -> 5[1a000] via direct shared memory
tesla-106:1940:2831 [0] NCCL INFO Channel 01 : 3[b1000] -> 4[18000] [receive] via NET/Socket/0
tesla-106:1940:2831 [0] NCCL INFO Channel 01 : 4[18000] -> 5[1a000] via P2P/IPC
tesla-106:1942:2786 [1] NCCL INFO Channel 01 : 0[18000] -> 5[1a000] [receive] via NET/Socket/0
tesla-106:1943:2821 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-106:1943:2821 [2] NCCL INFO comm 0x7f6fec001060 rank 6 nranks 8 cudaDev 2 busId af000 - Init COMPLETE
tesla-106:1942:2786 [1] NCCL INFO Channel 01 : 5[1a000] -> 4[18000] via P2P/IPC
tesla-106:1940:2831 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-106:1940:2831 [0] NCCL INFO comm 0x7f5550001060 rank 4 nranks 8 cudaDev 0 busId 18000 - Init COMPLETE
tesla-106:1942:2786 [1] NCCL INFO Channel 01 : 5[1a000] -> 0[18000] [send] via NET/Socket/0
tesla-106:1942:2786 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-106:1942:2786 [1] NCCL INFO comm 0x7f75d4001060 rank 5 nranks 8 cudaDev 1 busId 1a000 - Init COMPLETE

Run with single-machine-and-multi-GPU-DistributedDataParallel-mp.py

> CUDA_VISIBLE_DEVICES=0,1,2,3 python single-machine-and-multi-GPU-DistributedDataParallel-mp.py --nodes 1 --ngpus_per_node 4
Using device: cuda:3
local rank: 3, global rank: 3, world size: 4
Using device: cuda:1
Using device: cuda:2
local rank: 2, global rank: 2, world size: 4
Using device: cuda:0
local rank: 1, global rank: 1, world size: 4
local rank: 0, global rank: 0, world size: 4
tesla-106:13395:13395 [0] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.106<0>
tesla-106:13395:13395 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

tesla-106:13395:13395 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-106:13395:13395 [0] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.106<0>
tesla-106:13395:13395 [0] NCCL INFO Using network Socket
tesla-106:13398:13398 [3] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.106<0>
tesla-106:13397:13397 [2] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.106<0>
tesla-106:13397:13397 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

tesla-106:13397:13397 [2] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-106:13397:13397 [2] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.106<0>
tesla-106:13398:13398 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

tesla-106:13397:13397 [2] NCCL INFO Using network Socket
tesla-106:13398:13398 [3] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-106:13398:13398 [3] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.106<0>
tesla-106:13398:13398 [3] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.1
tesla-106:13396:13396 [1] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.106<0>
tesla-106:13396:13396 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

tesla-106:13396:13396 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-106:13396:13396 [1] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.106<0>
tesla-106:13396:13396 [1] NCCL INFO Using network Socket
tesla-106:13395:14211 [0] NCCL INFO Channel 00/04 :    0   1   2   3
tesla-106:13395:14211 [0] NCCL INFO Channel 01/04 :    0   3   2   1
tesla-106:13395:14211 [0] NCCL INFO Channel 02/04 :    0   1   2   3
tesla-106:13395:14211 [0] NCCL INFO Channel 03/04 :    0   3   2   1
tesla-106:13395:14211 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
tesla-106:13395:14211 [0] NCCL INFO Trees [0] 2/-1/-1->0->-1|-1->0->2/-1/-1 [1] 2/-1/-1->0->1|1->0->2/-1/-1 [2] 2/-1/-1->0->-1|-1->0->2/-1/-1 [3] 2/-1/-1->0->1|1->0->2/-1/-1
tesla-106:13395:14211 [0] NCCL INFO Setting affinity for GPU 0 to 3ff003ff
tesla-106:13398:14208 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
tesla-106:13398:14208 [3] NCCL INFO Trees [0] 1/-1/-1->3->2|2->3->1/-1/-1 [1] 1/-1/-1->3->-1|-1->3->1/-1/-1 [2] 1/-1/-1->3->2|2->3->1/-1/-1 [3] 1/-1/-1->3->-1|-1->3->1/-1/-1
tesla-106:13398:14208 [3] NCCL INFO Setting affinity for GPU 3 to ff,c00ffc00
tesla-106:13397:14207 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
tesla-106:13397:14207 [2] NCCL INFO Trees [0] 3/-1/-1->2->0|0->2->3/-1/-1 [1] -1/-1/-1->2->0|0->2->-1/-1/-1 [2] 3/-1/-1->2->0|0->2->3/-1/-1 [3] -1/-1/-1->2->0|0->2->-1/-1/-1
tesla-106:13397:14207 [2] NCCL INFO Setting affinity for GPU 2 to ff,c00ffc00
tesla-106:13396:14213 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
tesla-106:13398:14208 [3] NCCL INFO Channel 00 : 3[b1000] -> 0[18000] via direct shared memory
tesla-106:13397:14207 [2] NCCL INFO Channel 00 : 2[af000] -> 3[b1000] via P2P/IPC
tesla-106:13396:14213 [1] NCCL INFO Trees [0] -1/-1/-1->1->3|3->1->-1/-1/-1 [1] 0/-1/-1->1->3|3->1->0/-1/-1 [2] -1/-1/-1->1->3|3->1->-1/-1/-1 [3] 0/-1/-1->1->3|3->1->0/-1/-1
tesla-106:13396:14213 [1] NCCL INFO Setting affinity for GPU 1 to 3ff003ff
tesla-106:13396:14213 [1] NCCL INFO Channel 00 : 1[1a000] -> 2[af000] via direct shared memory
tesla-106:13397:14207 [2] NCCL INFO Channel 00 : 2[af000] -> 0[18000] via direct shared memory
tesla-106:13395:14211 [0] NCCL INFO Channel 00 : 0[18000] -> 1[1a000] via P2P/IPC
tesla-106:13396:14213 [1] NCCL INFO Channel 00 : 1[1a000] -> 3[b1000] via direct shared memory
tesla-106:13398:14208 [3] NCCL INFO Channel 00 : 3[b1000] -> 2[af000] via P2P/IPC
tesla-106:13398:14208 [3] NCCL INFO Channel 00 : 3[b1000] -> 1[1a000] via direct shared memory
tesla-106:13398:14208 [3] NCCL INFO Channel 01 : 3[b1000] -> 2[af000] via P2P/IPC
tesla-106:13396:14213 [1] NCCL INFO Channel 01 : 1[1a000] -> 0[18000] via P2P/IPC
tesla-106:13395:14211 [0] NCCL INFO Channel 00 : 0[18000] -> 2[af000] via direct shared memory
tesla-106:13397:14207 [2] NCCL INFO Channel 01 : 2[af000] -> 1[1a000] via direct shared memory
tesla-106:13397:14207 [2] NCCL INFO Channel 01 : 2[af000] -> 0[18000] via direct shared memory
tesla-106:13395:14211 [0] NCCL INFO Channel 01 : 0[18000] -> 3[b1000] via direct shared memory
tesla-106:13396:14213 [1] NCCL INFO Channel 01 : 1[1a000] -> 3[b1000] via direct shared memory
tesla-106:13398:14208 [3] NCCL INFO Channel 01 : 3[b1000] -> 1[1a000] via direct shared memory
tesla-106:13395:14211 [0] NCCL INFO Channel 01 : 0[18000] -> 1[1a000] via P2P/IPC
tesla-106:13395:14211 [0] NCCL INFO Channel 01 : 0[18000] -> 2[af000] via direct shared memory
tesla-106:13398:14208 [3] NCCL INFO Channel 02 : 3[b1000] -> 0[18000] via direct shared memory
tesla-106:13396:14213 [1] NCCL INFO Channel 02 : 1[1a000] -> 2[af000] via direct shared memory
tesla-106:13395:14211 [0] NCCL INFO Channel 02 : 0[18000] -> 1[1a000] via P2P/IPC
tesla-106:13397:14207 [2] NCCL INFO Channel 02 : 2[af000] -> 3[b1000] via P2P/IPC
tesla-106:13396:14213 [1] NCCL INFO Channel 02 : 1[1a000] -> 3[b1000] via direct shared memory
tesla-106:13397:14207 [2] NCCL INFO Channel 02 : 2[af000] -> 0[18000] via direct shared memory
tesla-106:13398:14208 [3] NCCL INFO Channel 02 : 3[b1000] -> 2[af000] via P2P/IPC
tesla-106:13398:14208 [3] NCCL INFO Channel 02 : 3[b1000] -> 1[1a000] via direct shared memory
tesla-106:13395:14211 [0] NCCL INFO Channel 02 : 0[18000] -> 2[af000] via direct shared memory
tesla-106:13398:14208 [3] NCCL INFO Channel 03 : 3[b1000] -> 2[af000] via P2P/IPC
tesla-106:13395:14211 [0] NCCL INFO Channel 03 : 0[18000] -> 3[b1000] via direct shared memory
tesla-106:13397:14207 [2] NCCL INFO Channel 03 : 2[af000] -> 1[1a000] via direct shared memory
tesla-106:13396:14213 [1] NCCL INFO Channel 03 : 1[1a000] -> 0[18000] via P2P/IPC
tesla-106:13397:14207 [2] NCCL INFO Channel 03 : 2[af000] -> 0[18000] via direct shared memory
tesla-106:13395:14211 [0] NCCL INFO Channel 03 : 0[18000] -> 1[1a000] via P2P/IPC
tesla-106:13395:14211 [0] NCCL INFO Channel 03 : 0[18000] -> 2[af000] via direct shared memory
tesla-106:13396:14213 [1] NCCL INFO Channel 03 : 1[1a000] -> 3[b1000] via direct shared memory
tesla-106:13398:14208 [3] NCCL INFO Channel 03 : 3[b1000] -> 1[1a000] via direct shared memory
tesla-106:13395:14211 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
tesla-106:13395:14211 [0] NCCL INFO comm 0x7fdc40001060 rank 0 nranks 4 cudaDev 0 busId 18000 - Init COMPLETE
tesla-106:13395:13395 [0] NCCL INFO Launch mode Parallel
DistributedDataParallel(
  (module): NeuralNetwork(
    (flatten): Flatten(start_dim=1, end_dim=-1)
    (linear_relu_stack): Sequential(
      (0): Linear(in_features=784, out_features=512, bias=True)
      (1): ReLU()
      (2): Linear(in_features=512, out_features=512, bias=True)
      (3): ReLU()
      (4): Linear(in_features=512, out_features=10, bias=True)
    )
  )
)
Epoch 1
-------------------------------
tesla-106:13397:14207 [2] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
tesla-106:13397:14207 [2] NCCL INFO comm 0x7f04c8001060 rank 2 nranks 4 cudaDev 2 busId af000 - Init COMPLETE
tesla-106:13396:14213 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
tesla-106:13398:14208 [3] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
tesla-106:13398:14208 [3] NCCL INFO comm 0x7fd174001060 rank 3 nranks 4 cudaDev 3 busId b1000 - Init COMPLETE
tesla-106:13396:14213 [1] NCCL INFO comm 0x7f3330001060 rank 1 nranks 4 cudaDev 1 busId 1a000 - Init COMPLETE
loss: 2.301109  [    0/60000]
loss: 2.293916  [ 1600/60000]
loss: 2.302621  [ 3200/60000]
loss: 2.259469  [ 4800/60000]
loss: 2.252404  [ 6400/60000]
loss: 2.239343  [ 8000/60000]
loss: 2.185202  [ 9600/60000]
loss: 2.140033  [11200/60000]
loss: 2.158100  [12800/60000]
loss: 2.119617  [14400/60000]
Test Error: 
 Accuracy: 12.0%, Avg loss: 2.138575 

Epoch 2
-------------------------------
loss: 2.120426  [    0/60000]
loss: 2.120949  [ 1600/60000]
loss: 2.135741  [ 3200/60000]
loss: 2.021280  [ 4800/60000]
loss: 2.076435  [ 6400/60000]
loss: 1.989785  [ 8000/60000]
loss: 2.057000  [ 9600/60000]
loss: 1.840006  [11200/60000]
loss: 1.772224  [12800/60000]
loss: 1.738061  [14400/60000]
Test Error: 
 Accuracy: 13.5%, Avg loss: 1.840712 

Epoch 3
-------------------------------
loss: 1.917364  [    0/60000]
loss: 1.730065  [ 1600/60000]
loss: 1.906000  [ 3200/60000]
loss: 1.718702  [ 4800/60000]
loss: 1.486567  [ 6400/60000]
loss: 1.610462  [ 8000/60000]
loss: 1.431992  [ 9600/60000]
loss: 1.478280  [11200/60000]
loss: 1.497222  [12800/60000]
loss: 1.386750  [14400/60000]
Test Error: 
 Accuracy: 15.1%, Avg loss: 1.478247 

Epoch 4
-------------------------------
loss: 1.452221  [    0/60000]
loss: 1.571878  [ 1600/60000]
loss: 1.406897  [ 3200/60000]
loss: 1.460781  [ 4800/60000]
loss: 1.586754  [ 6400/60000]
loss: 1.300083  [ 8000/60000]
loss: 1.295014  [ 9600/60000]
loss: 1.321493  [11200/60000]
loss: 1.395649  [12800/60000]
loss: 1.349784  [14400/60000]
Test Error: 
 Accuracy: 15.9%, Avg loss: 1.227023 

Epoch 5
-------------------------------
loss: 1.091690  [    0/60000]
loss: 1.106918  [ 1600/60000]
loss: 1.163208  [ 3200/60000]
loss: 1.215325  [ 4800/60000]
loss: 1.357648  [ 6400/60000]
loss: 1.262445  [ 8000/60000]
loss: 1.171132  [ 9600/60000]
loss: 1.208320  [11200/60000]
loss: 0.778282  [12800/60000]
loss: 1.311920  [14400/60000]
Test Error: 
 Accuracy: 16.4%, Avg loss: 1.068742 

Done!
Saved PyTorch Model State to model.pth