I am having a problem executing a task with GPU in the MARCC server.
I am wondering if someone has tried this setup.
I am using ESPnet (https://github.com/espnet/espnet.git) to train a voxforge model in the gpu.
ESPnet uses kaldi recipes so to submit the job I used slurm.pl.
The command for the slurm gpu is:
slurm.pl --mem 3G --gpu 1 --num_threads 6 --config conf/gpu.conf
And the config/gpu.conf file is:
command sbatch
--export=PATH,LD_LIBRARY_PATH,LIBRARY_PATH
option gpu=* -p gpu --gres=gpu:$0 --time 24:0:0
option mem=* --mem-per-cpu $0
option num_threads=* --cpus-per-task $0
option ntasks-per-node=*
option mail-type=end
option mail-user=nya...@jhu.edu
When I executed the program I got the following message:
# Running on gpu026
# Started at Mon Aug 20 13:32:42 EDT 2018
# SLURMD_NODENAME=gpu026
# SLURM_CHECKPOINT_IMAGE_DIR=/var/slurm/checkpoint
# SLURM_CLUSTER_NAME=marcc
# SLURM_CPUS_ON_NODE=6
# SLURM_CPUS_PER_TASK=6
# SLURM_EXPORT_ENV=PATH,LD_LIBRARY_PATH,LIBRARY_PATH
# SLURM_GET_USER_ENV=1
# SLURM_GTIDS=0
# SLURM_JOBID=28679557
# SLURM_JOB_ACCOUNT=swatana4
# SLURM_JOB_CPUS_PER_NODE=6
# SLURM_JOB_GID=1370
# SLURM_JOB_GPUS=3
# SLURM_JOB_ID=28679557
# SLURM_JOB_NAME=train.sh
# SLURM_JOB_NODELIST=gpu026
# SLURM_JOB_NUM_NODES=1
# SLURM_JOB_PARTITION=gpuk80
# SLURM_JOB_QOS=normal
# SLURM_JOB_UID=3068
# SLURM_JOB_USER=nya...@jhu.edu
# SLURM_LOCALID=0
# SLURM_MEM_PER_CPU=3072
# SLURM_NNODES=1
# SLURM_NODEID=0
# SLURM_NODELIST=gpu026
# SLURM_NODE_ALIASES='(null)'
# SLURM_OPEN_MODE=a
# SLURM_PROCID=0
# SLURM_SUBMIT_DIR=/scratch/groups/swatana4/nelson/espnet/egs/voxforge/asr1
# SLURM_SUBMIT_HOST=login-node01
# SLURM_TASKS_PER_NODE=1
# SLURM_TASK_PID=46202
# SLURM_TOPOLOGY_ADDR=ibswitch-b7-01.ibsw71-L10.ibswA17.gpu026
# SLURM_TOPOLOGY_ADDR_PATTERN=switch.switch.switch.node
# SLURM_WORKING_CLUSTER=marcc:mgmt1:6817:8192
...
File
"cupy/cuda/memory.pyx", line 468, in cupy.cuda.memory.alloc
File "cupy/cuda/memory.pyx", line 972, in
cupy.cuda.memory.MemoryPool.malloc
File "cupy/cuda/memory.pyx", line 993, in
cupy.cuda.memory.MemoryPool.malloc
File "cupy/cuda/memory.pyx", line 768, in
cupy.cuda.memory.SingleDeviceMemoryPool.malloc
File "cupy/cuda/memory.pyx", line 836, in
cupy.cuda.memory.SingleDeviceMemoryPool._malloc
Will finalize trainer extensions and updater before reraising the exception.
ESC[JTraceback (most recent call last):
File
"/scratch/groups/swatana4/nelson/espnet/egs/voxforge/asr1/../../../src/bin/asr_train.py",
line 233, in <module>
main()
File "/scratch/groups/swatana4/nelson/espnet/egs/voxforge/asr1/../../../src/bin/asr_train.py",
line 224, in main
train(args)
File
"/scratch/groups/swatana4/nelson/espnet/src/asr/asr_chainer.py", line
485, in train
trainer.run()
File "/scratch/groups/swatana4/nelson/espnet/tools/venv/lib/python2.7/site-packages/chainer/training/trainer.py",
line 320, in run
six.reraise(*sys.exc_info())
File
"/scratch/groups/swatana4/nelson/espnet/tools/venv/lib/python2.7/site-packages/chainer/training/trainer.py",
line 306, in run
update()
File
"/scratch/groups/swatana4/nelson/espnet/tools/venv/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py",
line 149, in update
self.update_core()
File
"/scratch/groups/swatana4/nelson/espnet/src/asr/asr_chainer.py", line
81, in update_core
loss.backward() # Backprop
File
"/scratch/groups/swatana4/nelson/espnet/tools/venv/lib/python2.7/site-packages/chainer/variable.py",
line 966, in backward
self._backward_main(retain_grad, loss_scale)
File "/scratch/groups/swatana4/nelson/espnet/tools/venv/lib/python2.7/site-packages/chainer/variable.py",
line 1095, in _backward_main
target_input_indexes, out_grad, in_grad)
File
"/scratch/groups/swatana4/nelson/espnet/tools/venv/lib/python2.7/site-packages/chainer/function_node.py",
line 548, in backward_accumulate
gxs = self.backward(target_input_indexes, grad_outputs)
File
"/scratch/groups/swatana4/nelson/espnet/tools/venv/lib/python2.7/site-packages/chainer/functions/connection/linear.py",
line 83, in backward
gW, = LinearGradWeight(W.dtype).apply((x, gy))
File
"/scratch/groups/swatana4/nelson/espnet/tools/venv/lib/python2.7/site-packages/chainer/function_node.py",
line 258, in apply
outputs = self.forward(in_data)
File "/scratch/groups/swatana4/nelson/espnet/tools/venv/lib/python2.7/site-packages/chainer/functions/connection/linear.py",
line 162, in forward
gW = gy.T.dot(x).astype(self._w_dtype, copy=False)
File "cupy/core/core.pyx", line 1656, in
cupy.core.core.ndarray.dot
File "cupy/core/core.pyx", line 3476, in cupy.core.core.dot
File "cupy/core/core.pyx", line 3814, in
cupy.core.core.tensordot_core
File "cupy/core/core.pyx", line 114, in
cupy.core.core.ndarray.__init__
File "cupy/cuda/memory.pyx", line 468, in
cupy.cuda.memory.alloc
File "cupy/cuda/memory.pyx", line 972, in
cupy.cuda.memory.MemoryPool.malloc
File "cupy/cuda/memory.pyx", line 993, in
cupy.cuda.memory.MemoryPool.malloc
File "cupy/cuda/memory.pyx", line 768, in
cupy.cuda.memory.SingleDeviceMemoryPool.malloc
File "cupy/cuda/memory.pyx", line 836, in
cupy.cuda.memory.SingleDeviceMemoryPool._malloc
cupy.cuda.memory.OutOfMemoryError: out of memory to allocate 1638400 bytes
(total 11783007232 bytes)
I am wondering if there is any additional flag that I should setup. The training model may use 10G of GPU memory so I would like to know if there is an additional memory gpu flag that to set.