I want to train a model on a server that has 3 GPUs.
I follow the WSJ 1i chain tdnnf model example.
I set GPU-2 on exclusive compute mod and set "--use-gpu=wait" and set export CUDA_VISIBLE_DEVICES=2.
but the process fails. there isn't any ERROR on log files. I attached logs.
$ nvidia-smi
Mon Feb 22 12:07:20 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.102.04 Driver Version: 450.102.04 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 206... Off | 00000000:03:00.0 Off | N/A |
| 45% 75C P2 201W / 215W | 5748MiB / 7982MiB | 99% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 206... Off | 00000000:05:00.0 Off | N/A |
| 39% 69C P2 186W / 215W | 5878MiB / 7982MiB | 91% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 206... Off | 00000000:09:00.0 Off | N/A |
| 33% 56C P8 18W / 215W | 53MiB / 7975MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1468 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 2221079 C python 5739MiB |
| 1 N/A N/A 1468 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2250548 C python 5869MiB |
| 2 N/A N/A 1468 G /usr/lib/xorg/Xorg 36MiB |
| 2 N/A N/A 1834 G /usr/bin/gnome-shell 15MiB |
+-----------------------------------------------------------------------------+
./local/chain/run_tdnn.sh --train_set train --test_sets " " --gmm tri3 --stage 16 --train_stage 0
./local/chain/run_tdnn.sh --train_set train --test_sets --gmm tri3 --stage 16 --train_stage 0
2021-02-22 12:02:23,605 [steps/nnet3/chain/train.py:35 - <module> - INFO ] Starting chain model trainer (train.py)
steps/nnet3/chain/train.py --stage=0 --egs.cmd=
run.pl --max-jobs-run 10 --mem 16G --num-threads 20 --cmd=
run.pl --mem 10G --max-jobs-run 10 --feat.online-ivector-dir=exp/nnet3_online_cmn/ivectors_train_s
p_hires --feat.cmvn-opts=--config=conf/online_cmvn.conf --chain.xent-regularize 0.1 --chain.leaky-hmm-coefficient=0.1 --chain.l2-regularize=0.0 --chain.apply-deriv-weights=false --chain.lm-opts=--num-ext
ra-lm-states=2000 --trainer.dropout-schedule 0,0...@0.20,0...@0.50,0 --trainer.add-option=--optimization.memory-compression-level=2 --trainer.srand=0 --trainer.max-param-change=2.0 --trainer.num-epochs=10 --
trainer.frames-per-iter=5000000 --trainer.optimization.num-jobs-initial=2 --trainer.optimization.num-jobs-final=8 --trainer.optimization.initial-effective-lrate=0.0005 --trainer.optimization.final-effect
ive-lrate=0.00005 --trainer.num-chunk-per-minibatch=128,64 --trainer.optimization.momentum=0.0 --egs.chunk-width=140,100,160 --egs.chunk-left-context=0 --egs.chunk-right-context=0 --egs.dir= --egs.opts=-
-frames-overlap-per-eg 0 --online-cmvn true --cleanup.remove-egs=true --use-gpu=wait --reporting.email= --feat-dir=data/train_sp_hires --tree-dir=exp/chain_online_cmn/tree_a_sp --lat-dir=exp/chain_online
_cmn/tri3_train_sp_lats --dir=exp/chain_online_cmn/tdnn1i_sp
['steps/nnet3/chain/train.py', '--stage=0', '--egs.cmd=
run.pl --max-jobs-run 10 --mem 16G --num-threads 20', '--cmd=
run.pl --mem 10G --max-jobs-run 10', '--feat.online-ivector-dir=exp/nnet3_online_cmn/iv
ectors_train_sp_hires', '--feat.cmvn-opts=--config=conf/online_cmvn.conf', '--chain.xent-regularize', '0.1', '--chain.leaky-hmm-coefficient=0.1', '--chain.l2-regularize=0.0', '--chain.apply-deriv-weights
=false', '--chain.lm-opts=--num-extra-lm-states=2000', '--trainer.dropout-schedule', '0,0...@0.20,0...@0.50,0', '--trainer.add-option=--optimization.memory-compression-level=2', '--trainer.srand=0', '--train
er.max-param-change=2.0', '--trainer.num-epochs=10', '--trainer.frames-per-iter=5000000', '--trainer.optimization.num-jobs-initial=2', '--trainer.optimization.num-jobs-final=8', '--trainer.optimization.i
nitial-effective-lrate=0.0005', '--trainer.optimization.final-effective-lrate=0.00005', '--trainer.num-chunk-per-minibatch=128,64', '--trainer.optimization.momentum=0.0', '--egs.chunk-width=140,100,160',
'--egs.chunk-left-context=0', '--egs.chunk-right-context=0', '--egs.dir=', '--egs.opts=--frames-overlap-per-eg 0 --online-cmvn true', '--cleanup.remove-egs=true', '--use-gpu=wait', '--reporting.email=',
'--feat-dir=data/train_sp_hires', '--tree-dir=exp/chain_online_cmn/tree_a_sp', '--lat-dir=exp/chain_online_cmn/tri3_train_sp_lats', '--dir=exp/chain_online_cmn/tdnn1i_sp']
2021-02-22 12:02:23,610 [steps/nnet3/chain/train.py:281 - train - INFO ] Arguments for the experiment
{'alignment_subsampling_factor': 3,
'apply_deriv_weights': False,
'backstitch_training_interval': 1,
'backstitch_training_scale': 0.0,
'chunk_left_context': 0,
'chunk_left_context_initial': -1,
'chunk_right_context': 0,
'chunk_right_context_final': -1,
'chunk_width': '140,100,160',
'cleanup': True,
'cmvn_opts': '--config=conf/online_cmvn.conf',
'combine_sum_to_one_penalty': 0.0,
'command': '
run.pl --mem 10G --max-jobs-run 10',
'compute_per_dim_accuracy': False,
'deriv_truncate_margin': None,
'dir': 'exp/chain_online_cmn/tdnn1i_sp',
'do_final_combination': True,
'dropout_schedule': '0,0...@0.20,0...@0.50,0',
'egs_command': '
run.pl --max-jobs-run 10 --mem 16G --num-threads 20',
'egs_dir': None,
'egs_nj': 0,
'egs_opts': '--frames-overlap-per-eg 0 --online-cmvn true',
'egs_stage': 0,
'email': None,
'exit_stage': None,
'feat_dir': 'data/train_sp_hires',
'final_effective_lrate': 5e-05,
'frame_subsampling_factor': 3,
'frames_per_iter': 5000000,
'initial_effective_lrate': 0.0005,
'input_model': None,
'l2_regularize': 0.0,
'lat_dir': 'exp/chain_online_cmn/tri3_train_sp_lats',
'leaky_hmm_coefficient': 0.1,
'left_deriv_truncate': None,
'left_tolerance': 5,
'lm_opts': '--num-extra-lm-states=2000',
'max_lda_jobs': 10,
'max_models_combine': 20,
'max_objective_evaluations': 30,
'max_param_change': 2.0,
'momentum': 0.0,
'num_chunk_per_minibatch': '128,64',
'num_epochs': 10.0,
'num_jobs_final': 8,
'num_jobs_initial': 2,
'num_jobs_step': 1,
'online_ivector_dir': 'exp/nnet3_online_cmn/ivectors_train_sp_hires',
'preserve_model_interval': 100,
'presoftmax_prior_scale_power': -0.25,
'proportional_shrink': 0.0,
'rand_prune': 4.0,
'remove_egs': True,
'reporting_interval': 0.1,
'right_tolerance': 5,
'samples_per_iter': 400000,
'shrink_saturation_threshold': 0.4,
'shrink_value': 1.0,
'shuffle_buffer_size': 5000,
'srand': 0,
'stage': 0,
'train_opts': ['--optimization.memory-compression-level=2'],
'tree_dir': 'exp/chain_online_cmn/tree_a_sp',
'use_gpu': 'wait',
'xent_regularize': 0.1}
2021-02-22 12:02:23,728 [steps/nnet3/chain/train.py:428 - train - INFO ] Copying the properties from exp/chain_online_cmn/tdnn1i_sp/egs to exp/chain_online_cmn/tdnn1i_sp
2021-02-22 12:02:23,729 [steps/nnet3/chain/train.py:484 - train - INFO ] Training will run for 10.0 epochs = 396 iterations
2021-02-22 12:02:23,729 [steps/nnet3/chain/train.py:523 - train - INFO ] Iter: 0/395 Jobs: 2 Epoch: 0.00/10.0 (0.0% complete) lr: 0.001000
bash: line 1: 2569944 Segmentation fault (core dumped) ( nnet3-chain-train --use-gpu=wait --apply-deriv-weights=False --l2-regularize=0.0 --leaky-hmm-coefficient=0.1 --write-cache=exp/chain_online_cmn/tdnn1i_sp/cache.1 --xent-regularize=0.1 --print-interval=10 --momentum=0.0 --max-param-change=1.414213562373095 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=0.5 --optimization.memory-compression-level=2 --srand=0 "nnet3-am-copy --raw=true --learning-rate=0.001 --scale=1.0 exp/chain_online_cmn/tdnn1i_sp/0.mdl - |nnet3-copy --edits='set-dropout-proportion name=* proportion=0.0' - - |" exp/chain_online_cmn/tdnn1i_sp/den.fst "ark,bg:nnet3-chain-copy-egs --frame-shift=1 ark:exp/chain_online_cmn/tdnn1i_sp/egs/cegs.1.ark ark:- | nnet3-chain-shuffle-egs --buffer-size=5000 --srand=0 ark:- ark:- | nnet3-chain-merge-egs --minibatch-size=64,32 ark:- ark:- |" exp/chain_online_cmn/tdnn1i_sp/1.1.raw ) 2>> exp/chain_online_cmn/tdnn1i_sp/log/train.0.1.log >> exp/chain_online_cmn/tdnn1i_sp/log/train.0.1.log
run.pl: job failed, log is in exp/chain_online_cmn/tdnn1i_sp/log/train.0.1.log
2021-02-22 12:03:44,181 [steps/libs/common.py:207 - background_command_waiter - ERROR ] Command exited with status 1:
run.pl --mem 10G --max-jobs-run 10 --gpu 1 exp/chain_online_cmn/tdnn1i_sp/log/train.0.1.log nnet3-chain-train --use-gpu=wait --apply-deriv-weights=False --l2-regularize=0.0 --leaky-hmm-coefficient=0.1 --write-cache=exp/chain_online_cmn/tdnn1i_sp/cache.1 --xent-regularize=0.1 --print-interval=10 --momentum=0.0 --max-param-change=1.414213562373095 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=0.5 --optimization.memory-compression-level=2 --srand=0 "nnet3-am-copy --raw=true --learning-rate=0.001 --scale=1.0 exp/chain_online_cmn/tdnn1i_sp/0.mdl - |nnet3-copy --edits='set-dropout-proportion name=* proportion=0.0' - - |" exp/chain_online_cmn/tdnn1i_sp/den.fst "ark,bg:nnet3-chain-copy-egs --frame-shift=1 ark:exp/chain_online_cmn/tdnn1i_sp/egs/cegs.1.ark ark:- | nnet3-chain-shuffle-egs --buffer-size=5000 --srand=0 ark:- ark:- | nnet3-chain-merge-egs --minibatch-size=64,32 ark:- ark:- |" exp/chain_online_cmn/tdnn1i_sp/1.1.raw