queue.pl: Error submitting jobs to queue; training rm example

5,803 views
Skip to first unread message

Ali

unread,
Aug 10, 2016, 6:23:21 PM8/10/16
to kaldi-help
Hi,

I was trying to train the rm example from kaldi/egs/rm/s5/run.sh on a gpu, after near 1 hours, of running, it stopped the training with error. The last few lines are:


Using fMLLR transforms from exp/tri3b/decode
queue.pl: Error submitting jobs to queue (return status was 32512)
queue log file is exp/ubm4a/q/cluster.log, command was qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64* -o exp/ubm4a/q/cluster.log    /home/audio-dnn/Projects/Kaldi/kaldi/egs/rm/s5/exp/ubm4a/q/cluster.sh >>exp/ubm4a/q/cluster.log 2>&1
Output of qsub was: sh: qsub: command not found


Would you please help me to understand why this happening?

Thank you,
Ali

Daniel Povey

unread,
Aug 10, 2016, 6:27:04 PM8/10/16
to kaldi-help
Please read http://kaldi-asr.org/doc/queue.html carefully.
Dan
> --
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Ali

unread,
Aug 10, 2016, 8:31:25 PM8/10/16
to kaldi-help, dpo...@gmail.com
Hi Dan,

Thanks for the note. I read the page you suggested and I am a little confused. I am using a gpu to run the /egs/rm example and I don't use GridEngine. Should I create a /conf/queue.conf file that contains following:

# Default configuration
command qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64*
option mem=* -l mem_free=$0,ram_free=$0
option mem=0          # Do not add anything to qsub_opts
option num_threads=* -pe smp $0
option num_threads=1  # Do not add anything to qsub_opts
option max_jobs_run=* -tc $0
default gpu=0
option gpu=0
option gpu=* -l gpu=$0 -q g.q




Other than that, where is queue.pl located? I guess since I don't have GridEngine I don't have queue.pl, is this correct?

Also please let me know if it is necessary to install and use GridEngine in this case.

Thank you so much,
Ali

Daniel Povey

unread,
Aug 10, 2016, 9:07:06 PM8/10/16
to Ali, kaldi-help
You don't need to use GridEngine but make sure that it doesn't attempt
to use more than 1 GPU jobs (it depends on the script, but maybe set
num-jobs-initial and num-jobs-final to 1).

Also, you need to change the relevant '$cmd' string, e.g. cuda_cmd, to
run.pl. You obviously changed other $cmd strings to run.pl, or it
wouldn't have run to that point.

Dan

Ali

unread,
Aug 11, 2016, 2:06:05 PM8/11/16
to kaldi-help, bostonb...@gmail.com, dpo...@gmail.com
Thanks Dan. I edited the cmd.sh and replaced all queue.pl instances with run.pl. However, after running the script ./run.sh (and after a long time), it gave me the following error (last few lines):

add-self-loops --self-loop-scale=0.1 --reorder=true exp/sgmm2_4a_ali/final.mdl
steps/make_denlats_sgmm2.sh: feature type is lda
steps/make_denlats_sgmm2.sh: using fMLLR transforms from exp/tri3b
run.pl: 20 / 20 failed, log is in exp/sgmm2_4a_denlats/log/1/decode_den.*.log
steps/make_denlats_sgmm2.sh: error generating denominator lattices
steps/train_mmi_sgmm2.sh --cmd run.pl --transform-dir exp/tri3b --boost 0.2 data/train data/lang exp/sgmm2_4a_ali exp/sgmm2_4a_denlats exp/sgmm2_4a_mmi_b0.2
steps/train_mmi_sgmm2.sh: no such file exp/sgmm2_4a_denlats/lat.1.gz
steps/train_mmi_sgmm2.sh --cmd run.pl --transform-dir exp/tri3b --boost 0.2 --drop-frames true data/train data/lang exp/sgmm2_4a_ali exp/sgmm2_4a_denlats exp/sgmm2_4a_mmi_b0.2_x
steps/decode_sgmm2_rescore.sh --cmd run.pl --iter 3 --transform-dir exp/tri3b/decode data/lang data/test exp/sgmm2_4a/decode exp/sgmm2_4a_mmi_b0.2/decode_it3
steps/decode_sgmm2_rescore.sh --cmd run.pl --iter 4 --transform-dir exp/tri3b/decode data/lang data/test exp/sgmm2_4a/decode exp/sgmm2_4a_mmi_b0.2/decode_it4
steps/decode_sgmm2_rescore.sh --cmd run.pl --iter 1 --transform-dir exp/tri3b/decode data/lang data/test exp/sgmm2_4a/decode exp/sgmm2_4a_mmi_b0.2/decode_it1
steps/decode_sgmm2_rescore.sh --cmd run.pl --iter 2 --transform-dir exp/tri3b/decode data/lang data/test exp/sgmm2_4a/decode exp/sgmm2_4a_mmi_b0.2/decode_it2
steps/decode_sgmm2_rescore.sh: no such file exp/sgmm2_4a_mmi_b0.2/2.mdl
steps/decode_sgmm2_rescore.sh: no such file exp/sgmm2_4a_mmi_b0.2/1.mdl
steps/decode_sgmm2_rescore.sh: no such file exp/sgmm2_4a_mmi_b0.2/3.mdl
steps/decode_sgmm2_rescore.sh: no such file exp/sgmm2_4a_mmi_b0.2/4.mdl
steps/train_mmi_sgmm2.sh: no such file exp/sgmm2_4a_denlats/lat.1.gz
steps/decode_sgmm2_rescore.sh --cmd run.pl --iter 4 --transform-dir exp/tri3b/decode data/lang data/test exp/sgmm2_4a/decode exp/sgmm2_4a_mmi_b0.2_x/decode_it4
steps/decode_sgmm2_rescore.sh --cmd run.pl --iter 2 --transform-dir exp/tri3b/decode data/lang data/test exp/sgmm2_4a/decode exp/sgmm2_4a_mmi_b0.2_x/decode_it2
steps/decode_sgmm2_rescore.sh --cmd run.pl --iter 3 --transform-dir exp/tri3b/decode data/lang data/test exp/sgmm2_4a/decode exp/sgmm2_4a_mmi_b0.2_x/decode_it3
steps/decode_sgmm2_rescore.sh --cmd run.pl --iter 1 --transform-dir exp/tri3b/decode data/lang data/test exp/sgmm2_4a/decode exp/sgmm2_4a_mmi_b0.2_x/decode_it1
steps/decode_combine.sh data/test data/lang exp/tri1/decode exp/tri2a/decode exp/combine_1_2a/decode
steps/decode_combine.sh: no such file exp/tri2a/decode/lat.1.gz
audio-dnn@sjlab213:~/Projects/Kaldi/kaldi/egs/rm/s5$ steps/decode_sgmm2_rescore.sh: no such file exp/sgmm2_4a_mmi_b0.2_x/3.mdl
steps/decode_sgmm2_rescore.sh: no such file exp/sgmm2_4a_mmi_b0.2_x/1.mdl
steps/decode_sgmm2_rescore.sh: no such file exp/sgmm2_4a_mmi_b0.2_x/4.mdl
steps/decode_sgmm2_rescore.sh: no such file exp/sgmm2_4a_mmi_b0.2_x/2.mdl
run.pl: 20 / 20 failed, log is in exp/sgmm2_4a_denlats/log/2/decode_den.*.log


Would you please take a look at it and let me know what might be the problem?

Also, is it ok if running this example on gpu takes long time (several hours till error)?


Thank you,
Ali

Daniel Povey

unread,
Aug 11, 2016, 2:19:28 PM8/11/16
to Ali, kaldi-help
The first error was this:
20 / 20 failed, log is in exp/sgmm2_4a_denlats/log/1/decode_den.*.log
Probably due to out of memory.
You are not supposed to just run the entire run.sh, you are supposed
to run stage by stage with only the parts that you need.
It's normal for things to take quite a long time. Try to search for
the answers online- many of these questions will have been asked
before.
Dan
Message has been deleted

Ali

unread,
Aug 13, 2016, 10:02:46 PM8/13/16
to kaldi-help, bostonb...@gmail.com, dpo...@gmail.com
Hi Dan,

I was running the run.sh line by line unti I reached the following line in run_sgmm2.sh

steps/make_denlats_sgmm2.sh --nj 8 --sub-split 20 --cmd "$decode_cmd" --transform-dir exp/tri3b \
   data/train data/lang exp/sgmm2_4a_ali exp/sgmm2_4a_denlats


This gave me the following error:


run.pl: 20 / 20 failed, log is in exp/sgmm2_4a_denlats/log/1/decode_den.*.log

In the log file decode_den.1.log , I have this error messages:

ERROR (apply-cmvn:SequentialTableReader():util/kaldi-table-inl.h:888) Error constructing TableReader: rspecifier is scp:data/train/split8/1/split20/1/feats.scp

ERROR (transform-feats:main():transform-feats.cc:69) Problem opening transforms with rspecifier "ark:exp/tri3b/trans.1" and utt2spk rspecifier "ark:data/train/split8/1/split20/1/utt2spk"

ERROR (sgmm2-latgen-faster:~SequentialTableReaderArchiveImpl():util/kaldi-table-inl.h:690) TableReader: error detected closing archive 'apply-cmvn --utt2spk=ark:data/train/split8/1/split20/1/utt2spk scp:data/train/split8/1/split20/1/cmvn.scp scp:data/train/split8/1/split20/1/feats.scp ark:- | splice-feats --left-context=3 --right-context=3 ark:- ark:- | transform-feats exp/sgmm2_4a_ali/final.mat ark:- ark:- | transform-feats --utt2spk=ark:data/train/split8/1/split20/1/utt2spk ark:exp/tri3b/trans.1 ark:- ark:- |'



Would you please let me know what is the problem?

Thank you,
Ali

Daniel Povey

unread,
Aug 13, 2016, 10:09:53 PM8/13/16
to Ali, kaldi-help
I suspect you are running with an out-of-date copy of Kaldi, there was
I think a bug in that script at some point recently.
Do 'git pull' to update.
Reply all
Reply to author
Forward
0 new messages