Job training array quota with slurm

149 views
Skip to first unread message

Hugo Stephens

unread,
Sep 4, 2019, 4:44:53 AM9/4/19
to kaldi-help
Hi, 

Running the LibriSpeech recipe, how can I reduce the number of jobs submitted to a slurm server to stay within my QOS limit? I cannot see how to reduce the job division or alter the --array argument

steps/nnet3/chain/get_egs.sh: recombining and shuffling order of archives on disk
sh: -c: line 12: unexpected EOF while looking for matching `''
sh: -c: line 13: syntax error: unexpected end of file
/home/user/kaldi_rep/kaldi/egs/librispeech/s5/utils/slurm.pl: error submitting jobs to queue (return status was 256)
queue log file is exp/chain_cleaned/tdnn_1d_sp/egs/q/shuffle.log, command was sbatch --export=PATH  --ntasks-per-node=1 --ntasks=1 --gres=gpu  -p general --mem-per-cpu 8G  --open-mode=append -e exp/chain_cleaned/tdnn_1d_sp/egs/q/shuffle.log -o exp/chain_cleaned/tdnn_1d_sp/egs/q/shuffle.log --array 1-414%6 /home/user/kaldi_rep/kaldi/egs/librispeech/s5/exp/chain_cleaned/tdnn_1d_sp/egs/q/shuffle.sh >>exp/chain_cleaned/tdnn_1d_sp/egs/q/shuffle.log 2>&1
sbatch: error: QOSMaxSubmitJobPerUserLimit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
Traceback (most recent call last):
  File "steps/nnet3/chain/train.py", line 644, in main
    train(args, run_opts)
  File "steps/nnet3/chain/train.py", line 405, in train
    stage=args.egs_stage)
  File "steps/libs/nnet3/train/chain_objf/acoustic_model.py", line 118, in generate_chain_egs
    egs_opts=egs_opts if egs_opts is not None else ''))
  File "steps/libs/common.py", line 158, in execute_command
    p.returncode, command))
Exception: Command exited with status 1: steps/nnet3/chain/get_egs.sh --frames-overlap-per-eg 0 --constrained false                 --cmd "slurm.pl"                 --cmvn-opts "--norm-means=false --norm-vars=false"                 --online-ivector-dir "exp/nnet3_cleaned/ivectors_train_960_cleaned_sp_hires"                 --left-context 41                 --right-context 41                 --left-context-initial -1                 --right-context-final -1                 --left-tolerance '5'                 --right-tolerance '5'                 --frame-subsampling-factor 3                 --alignment-subsampling-factor 3                 --stage -10                 --frames-per-iter 2500000                 --frames-per-eg 150,110,100                 --srand 0                 data/train_960_cleaned_sp_hires exp/chain_cleaned/tdnn_1d_sp exp/chain_cleaned/tri6b_cleaned_train_960_cleaned_sp_lats exp/chain_cleaned/tdnn_1d_sp/egs
steps/nnet3/chain/train.py --stage -10 --cmd slurm.pl --feat.online-ivector-dir exp/nnet3_cleaned/ivectors_train_960_cleaned_sp_hires --feat.cmvn-opts --norm-means=false --norm-vars=false --chain.xent-regularize 0.1 --chain.leaky-hmm-coefficient 0.1 --chain.l2-regularize 0.0 --chain.apply-deriv-weights false --chain.lm-opts=--num-extra-lm-states=2000 --egs.dir  --egs.stage -10 --egs.stage -10 --egs.opts --frames-overlap-per-eg 0 --constrained false --egs.chunk-width 150,110,100 --trainer.dropout-schedule 0,0...@0.20,0...@0.50,0 --trainer.add-option=--optimization.memory-compression-level=2 --trainer.num-chunk-per-minibatch 64 --trainer.frames-per-iter 2500000 --trainer.num-epochs 4 --trainer.optimization.num-jobs-initial 3 --trainer.optimization.num-jobs-final 16 --trainer.optimization.initial-effective-lrate 0.00015 --trainer.optimization.final-effective-lrate 0.000015 --trainer.max-param-change 2.0 --cleanup.remove-egs true --feat-dir data/train_960_cleaned_sp_hires --tree-dir exp/chain_cleaned/tree_sp --lat-dir exp/chain_cleaned/tri6b_cleaned_train_960_cleaned_sp_lats --dir exp/chain_cleaned/tdnn_1d_sp
['steps/nnet3/chain/train.py', '--stage', '-10', '--cmd', 'slurm.pl', '--feat.online-ivector-dir', 'exp/nnet3_cleaned/ivectors_train_960_cleaned_sp_hires', '--feat.cmvn-opts', '--norm-means=false --norm-vars=false', '--chain.xent-regularize', '0.1', '--chain.leaky-hmm-coefficient', '0.1', '--chain.l2-regularize', '0.0', '--chain.apply-deriv-weights', 'false', '--chain.lm-opts=--num-extra-lm-states=2000', '--egs.dir', '', '--egs.stage', '-10', '--egs.stage', '-10', '--egs.opts', '--frames-overlap-per-eg 0 --constrained false', '--egs.chunk-width', '150,110,100', '--trainer.dropout-schedule', '0,0...@0.20,0...@0.50,0', '--trainer.add-option=--optimization.memory-compression-level=2', '--trainer.num-chunk-per-minibatch', '64', '--trainer.frames-per-iter', '2500000', '--trainer.num-epochs', '4', '--trainer.optimization.num-jobs-initial', '3', '--trainer.optimization.num-jobs-final', '16', '--trainer.optimization.initial-effective-lrate', '0.00015', '--trainer.optimization.final-effective-lrate', '0.000015', '--trainer.max-param-change', '2.0', '--cleanup.remove-egs', 'true', '--feat-dir', 'data/train_960_cleaned_sp_hires', '--tree-dir', 'exp/chain_cleaned/tree_sp', '--lat-dir', 'exp/chain_cleaned/tri6b_cleaned_train_960_cleaned_sp_lats', '--dir', 'exp/chain_cleaned/tdnn_1d_sp']


Thank you for your help

Hugo

Jan Trmal

unread,
Sep 4, 2019, 4:52:20 AM9/4/19
to kaldi-help
there should be a parameter --egs.nj (added recently) to train.py -- run train.py without parameters to see the possible arguments.
y.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/07e5f72a-3c3b-4a92-babb-f61089f596f0%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages