Skip to content
This repository was archived by the owner on Aug 28, 2021. It is now read-only.

How much memory do I need to train #37

Open
HongweiQin opened this issue Feb 21, 2017 · 1 comment
Open

How much memory do I need to train #37

HongweiQin opened this issue Feb 21, 2017 · 1 comment

Comments

@HongweiQin
Copy link

Hi,

First of all, thanks to your nice work.

I was trying to run your go engine on my server, which has about 120 GiB memory.

It went all right until I tried to train with provided dataset.

The output are as followed:

[root@localhost darkforestGo]# ./train.sh
{
  nstep = 3,
  optim = "supervised",
  loss = "policy",
  progress = false,
  nthread = 4,
  model_name = "model-12-parallel-384-n-output-bn",
  data_augmentation = true,
  actor = "policy",
  nGPU = 1,
  sampling = "replay",
  intermediate_step = 50,
  userank = true,
  alpha = 0.05,
  num_forward_models = 2048,
  batchsize = 256,
  epoch_size_test = 128000,
  feature_type = "extended",
  epoch_size = 128000,
  datasource = "kgs"
}	
fm_init: function: 0x4076e7c8	
fm_gen: function: 0x410f4a58	
fm_postprocess: nil	
rl.Dataset.__init(): forward_model_init is set, run it
rl.Dataset.__init(): forward_model_init is set, run it
rl.Dataset.__init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 144748 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
| IndexedDataset: loaded ./dataset with 144748 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
rl.Dataset.__init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 144748 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
| IndexedDataset: loaded ./dataset with 144748 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
rl.Dataset.__init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 26814 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
rl.Dataset.__init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 26814 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
rl.Dataset.__init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 26814 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
rl.Dataset.__init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 26814 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-4547/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
/root/torch/install/bin/luajit: /root/torch/install/share/lua/5.1/nn/Container.lua:67: 
In 1 module of nn.Sequential:
In 9 module of nn.Sequential:
/root/torch/install/share/lua/5.1/nn/THNN.lua:110: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-4547/cutorch/lib/THC/generic/THCStorage.cu:66
stack traceback:
	[C]: in function 'v'
	/root/torch/install/share/lua/5.1/nn/THNN.lua:110: in function 'BatchNormalization_updateOutput'
	/root/torch/install/share/lua/5.1/nn/BatchNormalization.lua:124: in function </root/torch/install/share/lua/5.1/nn/BatchNormalization.lua:113>
	[C]: in function 'xpcall'
	/root/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
	/root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function </root/torch/install/share/lua/5.1/nn/Sequential.lua:41>
	[C]: in function 'xpcall'
	/root/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
	/root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
	./train/rl_framework/infra/bundle.lua:161: in function 'forward'
	./train/rl_framework/infra/agent.lua:46: in function 'optimize'
	./train/rl_framework/infra/engine.lua:114: in function 'train'
	./train/rl_framework/infra/framework.lua:304: in function 'run_rl'
	train.lua:155: in main chunk
	[C]: in function 'dofile'
	/root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
	[C]: at 0x004064f0

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
	[C]: in function 'error'
	/root/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
	/root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
	./train/rl_framework/infra/bundle.lua:161: in function 'forward'
	./train/rl_framework/infra/agent.lua:46: in function 'optimize'
	./train/rl_framework/infra/engine.lua:114: in function 'train'
	./train/rl_framework/infra/framework.lua:304: in function 'run_rl'
	train.lua:155: in main chunk
	[C]: in function 'dofile'
	/root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
	[C]: at 0x004064f0

I ran the "free" command before training. It turns out like this:

[root@localhost darkforestGo]# free
              total        used        free      shared  buff/cache   available
Mem:      115383448     1317128   112506528       10744     1559792   113786336
Swap:      67108860           0    67108860

It seems that I'm facing an "out of memory" issue.

May I ask how much memory do I need to train?

Or, is there anything wrong elsewere?

Thanks in advance

@HongweiQin
Copy link
Author

I tried to modify the train.sh by changing the nthread parameter from 4 to 1. It didn't work out.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant