You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Aug 28, 2021. It is now read-only.
I was trying to run your go engine on my server, which has about 120 GiB memory.
It went all right until I tried to train with provided dataset.
The output are as followed:
[root@localhost darkforestGo]# ./train.sh
{
nstep = 3,
optim = "supervised",
loss = "policy",
progress = false,
nthread = 4,
model_name = "model-12-parallel-384-n-output-bn",
data_augmentation = true,
actor = "policy",
nGPU = 1,
sampling = "replay",
intermediate_step = 50,
userank = true,
alpha = 0.05,
num_forward_models = 2048,
batchsize = 256,
epoch_size_test = 128000,
feature_type = "extended",
epoch_size = 128000,
datasource = "kgs"
}
fm_init: function: 0x4076e7c8
fm_gen: function: 0x410f4a58
fm_postprocess: nil
rl.Dataset.__init(): forward_model_init is set, run it
rl.Dataset.__init(): forward_model_init is set, run it
rl.Dataset.__init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 144748 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
| IndexedDataset: loaded ./dataset with 144748 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
rl.Dataset.__init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 144748 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
| IndexedDataset: loaded ./dataset with 144748 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
rl.Dataset.__init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 26814 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
rl.Dataset.__init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 26814 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
rl.Dataset.__init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 26814 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
rl.Dataset.__init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 26814 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-4547/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
/root/torch/install/bin/luajit: /root/torch/install/share/lua/5.1/nn/Container.lua:67:
In 1 module of nn.Sequential:
In 9 module of nn.Sequential:
/root/torch/install/share/lua/5.1/nn/THNN.lua:110: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-4547/cutorch/lib/THC/generic/THCStorage.cu:66
stack traceback:
[C]: in function 'v'
/root/torch/install/share/lua/5.1/nn/THNN.lua:110: in function 'BatchNormalization_updateOutput'
/root/torch/install/share/lua/5.1/nn/BatchNormalization.lua:124: in function </root/torch/install/share/lua/5.1/nn/BatchNormalization.lua:113>
[C]: in function 'xpcall'
/root/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function </root/torch/install/share/lua/5.1/nn/Sequential.lua:41>
[C]: in function 'xpcall'
/root/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
./train/rl_framework/infra/bundle.lua:161: in function 'forward'
./train/rl_framework/infra/agent.lua:46: in function 'optimize'
./train/rl_framework/infra/engine.lua:114: in function 'train'
./train/rl_framework/infra/framework.lua:304: in function 'run_rl'
train.lua:155: in main chunk
[C]: in function 'dofile'
/root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x004064f0
WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
[C]: in function 'error'
/root/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
/root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
./train/rl_framework/infra/bundle.lua:161: in function 'forward'
./train/rl_framework/infra/agent.lua:46: in function 'optimize'
./train/rl_framework/infra/engine.lua:114: in function 'train'
./train/rl_framework/infra/framework.lua:304: in function 'run_rl'
train.lua:155: in main chunk
[C]: in function 'dofile'
/root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x004064f0
I ran the "free" command before training. It turns out like this:
[root@localhost darkforestGo]# free
total used free shared buff/cache available
Mem: 115383448 1317128 112506528 10744 1559792 113786336
Swap: 67108860 0 67108860
It seems that I'm facing an "out of memory" issue.
May I ask how much memory do I need to train?
Or, is there anything wrong elsewere?
Thanks in advance
The text was updated successfully, but these errors were encountered:
Hi,
First of all, thanks to your nice work.
I was trying to run your go engine on my server, which has about 120 GiB memory.
It went all right until I tried to train with provided dataset.
The output are as followed:
I ran the "free" command before training. It turns out like this:
It seems that I'm facing an "out of memory" issue.
May I ask how much memory do I need to train?
Or, is there anything wrong elsewere?
Thanks in advance
The text was updated successfully, but these errors were encountered: