MARCC slurm.pl GPU

164 views
Skip to first unread message

nyalta21

unread,
Aug 20, 2018, 3:16:28 PM8/20/18
to kaldi-help
Hello there

I am having a problem executing a task with GPU in the MARCC server.

I am wondering if someone has tried this setup.


I am using ESPnet (https://github.com/espnet/espnet.git) to train a voxforge model in the gpu.

ESPnet uses kaldi recipes so to submit the job I used slurm.pl.

The command for the slurm gpu is:


slurm.pl --mem 3G --gpu 1 --num_threads 6 --config conf/gpu.conf


And the config/gpu.conf file is:


command sbatch --export=PATH,LD_LIBRARY_PATH,LIBRARY_PATH
option gpu=* -p gpu --gres=gpu:$0 --time 24:0:0
option mem=* --mem-per-cpu $0
option num_threads=* --cpus-per-task $0
option ntasks-per-node=*
option mail-type=end
option mail-user=nya...@jhu.edu


When I executed the program I got the following message:


# Running on gpu026
# Started at Mon Aug 20 13:32:42 EDT 2018
# SLURMD_NODENAME=gpu026
# SLURM_CHECKPOINT_IMAGE_DIR=/var/slurm/checkpoint
# SLURM_CLUSTER_NAME=marcc
# SLURM_CPUS_ON_NODE=6
# SLURM_CPUS_PER_TASK=6
# SLURM_EXPORT_ENV=PATH,LD_LIBRARY_PATH,LIBRARY_PATH
# SLURM_GET_USER_ENV=1
# SLURM_GTIDS=0
# SLURM_JOBID=28679557
# SLURM_JOB_ACCOUNT=swatana4
# SLURM_JOB_CPUS_PER_NODE=6
# SLURM_JOB_GID=1370
# SLURM_JOB_GPUS=3
# SLURM_JOB_ID=28679557
# SLURM_JOB_NAME=train.sh
# SLURM_JOB_NODELIST=gpu026
# SLURM_JOB_NUM_NODES=1
# SLURM_JOB_PARTITION=gpuk80
# SLURM_JOB_QOS=normal
# SLURM_JOB_UID=3068
# SLURM_JOB_USER=nya...@jhu.edu
# SLURM_LOCALID=0
# SLURM_MEM_PER_CPU=3072
# SLURM_NNODES=1
# SLURM_NODEID=0
# SLURM_NODELIST=gpu026
# SLURM_NODE_ALIASES='(null)'
# SLURM_OPEN_MODE=a
# SLURM_PROCID=0
# SLURM_SUBMIT_DIR=/scratch/groups/swatana4/nelson/espnet/egs/voxforge/asr1
# SLURM_SUBMIT_HOST=login-node01
# SLURM_TASKS_PER_NODE=1
# SLURM_TASK_PID=46202
# SLURM_TOPOLOGY_ADDR=ibswitch-b7-01.ibsw71-L10.ibswA17.gpu026
# SLURM_TOPOLOGY_ADDR_PATTERN=switch.switch.switch.node
# SLURM_WORKING_CLUSTER=marcc:mgmt1:6817:8192

...

  File "cupy/cuda/memory.pyx", line 468, in cupy.cuda.memory.alloc
  File "cupy/cuda/memory.pyx", line 972, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 993, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 768, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 836, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
Will finalize trainer extensions and updater before reraising the exception.
ESC[JTraceback (most recent call last):
  File "/scratch/groups/swatana4/nelson/espnet/egs/voxforge/asr1/../../../src/bin/asr_train.py", line 233, in <module>
    main()
  File "/scratch/groups/swatana4/nelson/espnet/egs/voxforge/asr1/../../../src/bin/asr_train.py", line 224, in main
    train(args)
  File "/scratch/groups/swatana4/nelson/espnet/src/asr/asr_chainer.py", line 485, in train
    trainer.run()
  File "/scratch/groups/swatana4/nelson/espnet/tools/venv/lib/python2.7/site-packages/chainer/training/trainer.py", line 320, in run
    six.reraise(*sys.exc_info())
  File "/scratch/groups/swatana4/nelson/espnet/tools/venv/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run
    update()
  File "/scratch/groups/swatana4/nelson/espnet/tools/venv/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
    self.update_core()
  File "/scratch/groups/swatana4/nelson/espnet/src/asr/asr_chainer.py", line 81, in update_core
    loss.backward()  # Backprop
  File "/scratch/groups/swatana4/nelson/espnet/tools/venv/lib/python2.7/site-packages/chainer/variable.py", line 966, in backward
    self._backward_main(retain_grad, loss_scale)
  File "/scratch/groups/swatana4/nelson/espnet/tools/venv/lib/python2.7/site-packages/chainer/variable.py", line 1095, in _backward_main
    target_input_indexes, out_grad, in_grad)
  File "/scratch/groups/swatana4/nelson/espnet/tools/venv/lib/python2.7/site-packages/chainer/function_node.py", line 548, in backward_accumulate
    gxs = self.backward(target_input_indexes, grad_outputs)
  File "/scratch/groups/swatana4/nelson/espnet/tools/venv/lib/python2.7/site-packages/chainer/functions/connection/linear.py", line 83, in backward
    gW, = LinearGradWeight(W.dtype).apply((x, gy))
  File "/scratch/groups/swatana4/nelson/espnet/tools/venv/lib/python2.7/site-packages/chainer/function_node.py", line 258, in apply
    outputs = self.forward(in_data)
  File "/scratch/groups/swatana4/nelson/espnet/tools/venv/lib/python2.7/site-packages/chainer/functions/connection/linear.py", line 162, in forward
    gW = gy.T.dot(x).astype(self._w_dtype, copy=False)
  File "cupy/core/core.pyx", line 1656, in cupy.core.core.ndarray.dot
  File "cupy/core/core.pyx", line 3476, in cupy.core.core.dot
  File "cupy/core/core.pyx", line 3814, in cupy.core.core.tensordot_core
  File "cupy/core/core.pyx", line 114, in cupy.core.core.ndarray.__init__
  File "cupy/cuda/memory.pyx", line 468, in cupy.cuda.memory.alloc
  File "cupy/cuda/memory.pyx", line 972, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 993, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 768, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 836, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
cupy.cuda.memory.OutOfMemoryError: out of memory to allocate 1638400 bytes (total 11783007232 bytes)

 

I am wondering if there is any additional flag that I should setup. The training model may use 10G of GPU memory so I would like to know if there is an additional memory gpu flag that to set.

Daniel Povey

unread,
Aug 20, 2018, 3:18:12 PM8/20/18
to kaldi-help
Looks to me like more of an ESPNet question than a Kaldi question.
The GPUs may not even have that much memory, check with nvidia-smi.
Any flags would be specific to the SLURM queue, Kaldi's wrappers have
no concept of requesting GPU memory.
> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---
> You received this message because you are subscribed to the Google Groups "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
> To post to this group, send email to kaldi...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/a29df907-6018-41b6-bae9-f4932b2213db%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages