Cholesky decomposition failed. Maybe matrix is not positive definite

121 views
Skip to first unread message

Palash Jain

unread,
Mar 23, 2023, 6:07:14 AM3/23/23
to kaldi-help
Hi
When I am running chain modeI, I am getting error in log file (find in exp/chain/tdnn1a_sp/log) as "Cholesky decomposition failed. Maybe matrix is not positive definite" and the execution stops as epochs starts...
I am attaching log file, NVIDIA pic as attachment.
pls help me out....

./runn.sh
local/chain/run_tdnn.sh --stage 0 --nj 48 --decode_nj 24
local/nnet3/run_ivector_common.sh: preparing directory for low-resolution speed-perturbed data (for alignment)
fix_data_dir.sh: kept all 26606 utterances.
fix_data_dir.sh: old files are kept in data/train/.backup
utils/data/perturb_data_dir_speed_3way.sh: making sure the utt2dur and the reco2dur files are present
... in data/train, because obtaining it after speed-perturbing
... would be very slow, and you might need them.
utils/data/get_utt2dur.sh: data/train/utt2dur already exists with the expected length.  We won't recompute it.
utils/data/get_reco2dur.sh: obtaining durations from recordings
utils/data/get_reco2dur.sh: could not get recording lengths from sphere-file headers, using wav-to-duration
utils/data/get_reco2dur.sh: computed data/train/reco2dur
utils/data/perturb_data_dir_speed.sh: generated speed-perturbed version of data in data/train, in data/train_sp_speed0.9
fix_data_dir.sh: kept all 26606 utterances.
fix_data_dir.sh: old files are kept in data/train_sp_speed0.9/.backup
utils/validate_data_dir.sh: Successfully validated data-directory data/train_sp_speed0.9
utils/data/perturb_data_dir_speed.sh: generated speed-perturbed version of data in data/train, in data/train_sp_speed1.1
fix_data_dir.sh: kept all 26606 utterances.
fix_data_dir.sh: old files are kept in data/train_sp_speed1.1/.backup
utils/validate_data_dir.sh: Successfully validated data-directory data/train_sp_speed1.1
utils/data/combine_data.sh data/train_sp data/train data/train_sp_speed0.9 data/train_sp_speed1.1
utils/data/combine_data.sh: combined utt2uniq
utils/data/combine_data.sh: combined segments
utils/data/combine_data.sh: combined utt2spk
utils/data/combine_data.sh [info]: not combining utt2lang as it does not exist
utils/data/combine_data.sh: combined utt2dur
utils/data/combine_data.sh [info]: **not combining utt2num_frames as it does not exist everywhere**
utils/data/combine_data.sh: combined reco2dur
utils/data/combine_data.sh [info]: **not combining feats.scp as it does not exist everywhere**
utils/data/combine_data.sh: combined text
utils/data/combine_data.sh [info]: **not combining cmvn.scp as it does not exist everywhere**
utils/data/combine_data.sh [info]: not combining vad.scp as it does not exist
utils/data/combine_data.sh [info]: not combining reco2file_and_channel as it does not exist
utils/data/combine_data.sh: combined wav.scp
utils/data/combine_data.sh [info]: not combining spk2gender as it does not exist
fix_data_dir.sh: kept all 79818 utterances.
fix_data_dir.sh: old files are kept in data/train_sp/.backup
utils/data/perturb_data_dir_speed_3way.sh: generated 3-way speed-perturbed version of data in data/train, in data/train_sp
utils/validate_data_dir.sh: Successfully validated data-directory data/train_sp
local/nnet3/run_ivector_common.sh: making MFCC features for low-resolution speed-perturbed data
steps/make_mfcc.sh --cmd run.pl --nj 48 data/train_sp
utils/validate_data_dir.sh: Successfully validated data-directory data/train_sp
steps/make_mfcc.sh [info]: segments file exists: using that.
steps/make_mfcc.sh: Succeeded creating MFCC features for train_sp
steps/compute_cmvn_stats.sh data/train_sp
Succeeded creating CMVN stats for train_sp
fix_data_dir.sh: kept all 79818 utterances.
fix_data_dir.sh: old files are kept in data/train_sp/.backup
local/nnet3/run_ivector_common.sh: aligning with the perturbed low-resolution data
steps/align_fmllr.sh --nj 48 --cmd run.pl data/train_sp data/lang exp/tri4b exp/tri4b_ali_train_sp
steps/align_fmllr.sh: feature type is lda
steps/align_fmllr.sh: compiling training graphs
steps/align_fmllr.sh: aligning data in data/train_sp using exp/tri4b/final.alimdl and speaker-independent features.
steps/align_fmllr.sh: computing fMLLR transforms
steps/align_fmllr.sh: doing final alignment.
steps/align_fmllr.sh: done aligning data.
steps/diagnostic/analyze_alignments.sh --cmd run.pl data/lang exp/tri4b_ali_train_sp
steps/diagnostic/analyze_alignments.sh: see stats in exp/tri4b_ali_train_sp/log/analyze_alignments.log
27255 warnings in exp/tri4b_ali_train_sp/log/align_pass1.*.log
9784 warnings in exp/tri4b_ali_train_sp/log/fmllr.*.log
28602 warnings in exp/tri4b_ali_train_sp/log/align_pass2.*.log
local/nnet3/run_ivector_common.sh: creating high-resolution MFCC features
utils/copy_data_dir.sh: copied data from data/train_sp to data/train_sp_hires
utils/validate_data_dir.sh: Successfully validated data-directory data/train_sp_hires
utils/copy_data_dir.sh: copied data from data/test to data/test_hires
utils/validate_data_dir.sh: Successfully validated data-directory data/test_hires
utils/data/perturb_data_dir_volume.sh: data/train_sp_hires/feats.scp exists; moving it to data/train_sp_hires/.backup/ as it wouldn't be valid any more.
utils/data/perturb_data_dir_volume.sh: added volume perturbation to the data in data/train_sp_hires
steps/make_mfcc.sh --nj 48 --mfcc-config conf/mfcc_hires.conf --cmd run.pl data/train_sp_hires
utils/validate_data_dir.sh: Successfully validated data-directory data/train_sp_hires
steps/make_mfcc.sh [info]: segments file exists: using that.
steps/make_mfcc.sh: Succeeded creating MFCC features for train_sp_hires
steps/compute_cmvn_stats.sh data/train_sp_hires
Succeeded creating CMVN stats for train_sp_hires
fix_data_dir.sh: kept all 79818 utterances.
fix_data_dir.sh: old files are kept in data/train_sp_hires/.backup
steps/make_mfcc.sh --nj 48 --mfcc-config conf/mfcc_hires.conf --cmd run.pl data/test_hires
steps/make_mfcc.sh: moving data/test_hires/feats.scp to data/test_hires/.backup
utils/validate_data_dir.sh: Successfully validated data-directory data/test_hires
steps/make_mfcc.sh [info]: segments file exists: using that.
steps/make_mfcc.sh: Succeeded creating MFCC features for test_hires
steps/compute_cmvn_stats.sh data/test_hires
Succeeded creating CMVN stats for test_hires
fix_data_dir.sh: kept all 4275 utterances.
fix_data_dir.sh: old files are kept in data/test_hires/.backup
local/nnet3/run_ivector_common.sh: computing a subset of data to train the diagonal UBM.
utils/data/subset_data_dir.sh: reducing #utt from 79818 to 19954
local/nnet3/run_ivector_common.sh: computing a PCA transform from the hires data.
steps/online/nnet2/get_pca_transform.sh --cmd run.pl --splice-opts --left-context=3 --right-context=3 --max-utts 10000 --subsample 2 exp/nnet3/diag_ubm/train_sp_hires_subset exp/nnet3/pca_transform
Done estimating PCA transform in exp/nnet3/pca_transform
local/nnet3/run_ivector_common.sh: training the diagonal UBM.
steps/online/nnet2/train_diag_ubm.sh --cmd run.pl --nj 30 --num-frames 700000 --num-threads 8 exp/nnet3/diag_ubm/train_sp_hires_subset 512 exp/nnet3/pca_transform exp/nnet3/diag_ubm
steps/online/nnet2/train_diag_ubm.sh: Directory exp/nnet3/diag_ubm already exists. Backing up diagonal UBM in exp/nnet3/diag_ubm/backup.ru5
steps/online/nnet2/train_diag_ubm.sh: initializing model from E-M in memory,
steps/online/nnet2/train_diag_ubm.sh: starting from 256 Gaussians, reaching 512;
steps/online/nnet2/train_diag_ubm.sh: for 20 iterations, using at most 700000 frames of data
Getting Gaussian-selection info
steps/online/nnet2/train_diag_ubm.sh: will train for 4 iterations, in parallel over
steps/online/nnet2/train_diag_ubm.sh: 30 machines, parallelized with 'run.pl'
steps/online/nnet2/train_diag_ubm.sh: Training pass 0
steps/online/nnet2/train_diag_ubm.sh: Training pass 1
steps/online/nnet2/train_diag_ubm.sh: Training pass 2
steps/online/nnet2/train_diag_ubm.sh: Training pass 3
local/nnet3/run_ivector_common.sh: training the iVector extractor
steps/online/nnet2/train_ivector_extractor.sh --cmd run.pl --nj 48 data/train_sp_hires exp/nnet3/diag_ubm exp/nnet3/extractor
steps/online/nnet2/train_ivector_extractor.sh: doing Gaussian selection and posterior computation
Accumulating stats (pass 0)
Summing accs (pass 0)
Updating model (pass 0)
Accumulating stats (pass 1)
Summing accs (pass 1)
Updating model (pass 1)
Accumulating stats (pass 2)
Summing accs (pass 2)
Updating model (pass 2)
Accumulating stats (pass 3)
Summing accs (pass 3)
Updating model (pass 3)
Accumulating stats (pass 4)
Summing accs (pass 4)
Updating model (pass 4)
Accumulating stats (pass 5)
Summing accs (pass 5)
Updating model (pass 5)
Accumulating stats (pass 6)
Summing accs (pass 6)
Updating model (pass 6)
Accumulating stats (pass 7)
Summing accs (pass 7)
Updating model (pass 7)
Accumulating stats (pass 8)
Summing accs (pass 8)
Updating model (pass 8)
Accumulating stats (pass 9)
Summing accs (pass 9)
Updating model (pass 9)
utils/data/modify_speaker_info.sh: copied data from data/train_sp_hires to exp/nnet3/ivectors_train_sp_hires/train_sp_hires_max2, number of speakers changed from 801 to 40113
utils/validate_data_dir.sh: Successfully validated data-directory exp/nnet3/ivectors_train_sp_hires/train_sp_hires_max2
steps/online/nnet2/extract_ivectors_online.sh --cmd run.pl --nj 48 exp/nnet3/ivectors_train_sp_hires/train_sp_hires_max2 exp/nnet3/extractor exp/nnet3/ivectors_train_sp_hires
filter_scps.pl: warning: some input lines were output to multiple files [OK if splitting per utt]
steps/online/nnet2/extract_ivectors_online.sh: extracting iVectors
steps/online/nnet2/extract_ivectors_online.sh: combining iVectors across jobs
steps/online/nnet2/extract_ivectors_online.sh: done extracting (online) iVectors to exp/nnet3/ivectors_train_sp_hires using the extractor in exp/nnet3/extractor.
steps/online/nnet2/extract_ivectors_online.sh --cmd run.pl --nj 24 data/test_hires exp/nnet3/extractor exp/nnet3/ivectors_test_hires
steps/online/nnet2/extract_ivectors_online.sh: extracting iVectors
steps/online/nnet2/extract_ivectors_online.sh: combining iVectors across jobs
steps/online/nnet2/extract_ivectors_online.sh: done extracting (online) iVectors to exp/nnet3/ivectors_test_hires using the extractor in exp/nnet3/extractor.
local/chain/run_tdnn.sh: creating lang directory data/lang_chain with chain-type topology
steps/align_fmllr_lats.sh --nj 48 --cmd run.pl data/train_sp data/lang exp/tri4b exp/chain/tri4b_train_sp_lats
steps/align_fmllr_lats.sh: feature type is lda
steps/align_fmllr_lats.sh: compiling training graphs
steps/align_fmllr_lats.sh: aligning data in data/train_sp using exp/tri4b/final.alimdl and speaker-independent features.
steps/align_fmllr_lats.sh: computing fMLLR transforms
steps/align_fmllr_lats.sh: generating lattices containing alternate pronunciations.
steps/align_fmllr_lats.sh: done generating lattices from training transcripts.
4431 warnings in exp/chain/tri4b_train_sp_lats/log/generate_lattices.*.log
27363 warnings in exp/chain/tri4b_train_sp_lats/log/align_pass1.*.log
9810 warnings in exp/chain/tri4b_train_sp_lats/log/fmllr.*.log
steps/nnet3/chain/build_tree.sh --frame-subsampling-factor 3 --context-opts --context-width=2 --central-position=1 --cmd run.pl 4200 data/train_sp data/lang_chain exp/tri4b_ali_train_sp exp/chain/tree_sp
steps/nnet3/chain/build_tree.sh: feature type is lda
steps/nnet3/chain/build_tree.sh: Using transforms from exp/tri4b_ali_train_sp
steps/nnet3/chain/build_tree.sh: Initializing monophone model (for alignment conversion, in case topology changed)
steps/nnet3/chain/build_tree.sh: Accumulating tree stats
steps/nnet3/chain/build_tree.sh: Getting questions for tree clustering.
steps/nnet3/chain/build_tree.sh: Building the tree
steps/nnet3/chain/build_tree.sh: Initializing the model
WARNING (gmm-init-model[5.5.1068~1-59299]:InitAmGmm():gmm-init-model.cc:55) Tree has pdf-id 109 with no stats; corresponding phone list: 439 440 441 442
This is a bad warning.
steps/nnet3/chain/build_tree.sh: Converting alignments from exp/tri4b_ali_train_sp to use current tree
steps/nnet3/chain/build_tree.sh: Done building tree
local/chain/run_tdnn.sh: creating neural net configs using the xconfig parser
tree-info exp/chain/tree_sp/tree
steps/nnet3/xconfig_to_configs.py --xconfig-file exp/chain/tdnn1a_sp/configs/network.xconfig --config-dir exp/chain/tdnn1a_sp/configs/
nnet3-init exp/chain/tdnn1a_sp/configs//init.config exp/chain/tdnn1a_sp/configs//init.raw
LOG (nnet3-init[5.5.1068~1-59299]:main():nnet3-init.cc:80) Initialized raw neural net and wrote it to exp/chain/tdnn1a_sp/configs//init.raw
nnet3-info exp/chain/tdnn1a_sp/configs//init.raw
nnet3-init exp/chain/tdnn1a_sp/configs//ref.config exp/chain/tdnn1a_sp/configs//ref.raw
LOG (nnet3-init[5.5.1068~1-59299]:main():nnet3-init.cc:80) Initialized raw neural net and wrote it to exp/chain/tdnn1a_sp/configs//ref.raw
nnet3-info exp/chain/tdnn1a_sp/configs//ref.raw
nnet3-init exp/chain/tdnn1a_sp/configs//ref.config exp/chain/tdnn1a_sp/configs//ref.raw
LOG (nnet3-init[5.5.1068~1-59299]:main():nnet3-init.cc:80) Initialized raw neural net and wrote it to exp/chain/tdnn1a_sp/configs//ref.raw
nnet3-info exp/chain/tdnn1a_sp/configs//ref.raw
2023-03-23 14:21:56,946 [steps/nnet3/chain/train.py:35 - <module> - INFO ] Starting chain model trainer (train.py)
steps/nnet3/chain/train.py --stage=-10 --cmd=run.pl --feat.online-ivector-dir=exp/nnet3/ivectors_train_sp_hires --feat.cmvn-opts=--norm-means=false --norm-vars=false --chain.xent-regularize 0.1 --chain.leaky-hmm-coefficient=0.1 --chain.l2-regularize=0.00005 --chain.apply-deriv-weights=false --chain.lm-opts=--num-extra-lm-states=2000 --trainer.srand=123 --trainer.max-param-change=2.0 --trainer.num-epochs=4 --trainer.frames-per-iter=1500000 --trainer.optimization.num-jobs-initial=1 --trainer.optimization.num-jobs-final=1 --trainer.optimization.initial-effective-lrate=0.001 --trainer.optimization.final-effective-lrate=0.0001 --trainer.optimization.shrink-value=1.0 --trainer.num-chunk-per-minibatch=128,64 --trainer.optimization.momentum=0.0 --egs.chunk-width=140,100,160 --egs.chunk-left-context=0 --egs.chunk-right-context=0 --egs.chunk-left-context-initial=0 --egs.chunk-right-context-final=0 --egs.dir= --egs.opts=--frames-overlap-per-eg 0 --cleanup.remove-egs=false --use-gpu=true --reporting.email= --feat-dir=data/train_sp_hires --tree-dir=exp/chain/tree_sp --lat-dir=exp/chain/tri4b_train_sp_lats --dir=exp/chain/tdnn1a_sp
['steps/nnet3/chain/train.py', '--stage=-10', '--cmd=run.pl', '--feat.online-ivector-dir=exp/nnet3/ivectors_train_sp_hires', '--feat.cmvn-opts=--norm-means=false --norm-vars=false', '--chain.xent-regularize', '0.1', '--chain.leaky-hmm-coefficient=0.1', '--chain.l2-regularize=0.00005', '--chain.apply-deriv-weights=false', '--chain.lm-opts=--num-extra-lm-states=2000', '--trainer.srand=123', '--trainer.max-param-change=2.0', '--trainer.num-epochs=4', '--trainer.frames-per-iter=1500000', '--trainer.optimization.num-jobs-initial=1', '--trainer.optimization.num-jobs-final=1', '--trainer.optimization.initial-effective-lrate=0.001', '--trainer.optimization.final-effective-lrate=0.0001', '--trainer.optimization.shrink-value=1.0', '--trainer.num-chunk-per-minibatch=128,64', '--trainer.optimization.momentum=0.0', '--egs.chunk-width=140,100,160', '--egs.chunk-left-context=0', '--egs.chunk-right-context=0', '--egs.chunk-left-context-initial=0', '--egs.chunk-right-context-final=0', '--egs.dir=', '--egs.opts=--frames-overlap-per-eg 0', '--cleanup.remove-egs=false', '--use-gpu=true', '--reporting.email=', '--feat-dir=data/train_sp_hires', '--tree-dir=exp/chain/tree_sp', '--lat-dir=exp/chain/tri4b_train_sp_lats', '--dir=exp/chain/tdnn1a_sp']
2023-03-23 14:21:56,985 [steps/nnet3/chain/train.py:284 - train - INFO ] Arguments for the experiment
{'alignment_subsampling_factor': 3,
 'apply_deriv_weights': False,
 'backstitch_training_interval': 1,
 'backstitch_training_scale': 0.0,
 'chain_opts': '',
 'chunk_left_context': 0,
 'chunk_left_context_initial': 0,
 'chunk_right_context': 0,
 'chunk_right_context_final': 0,
 'chunk_width': '140,100,160',
 'cleanup': True,
 'cmvn_opts': '--norm-means=false --norm-vars=false',
 'combine_sum_to_one_penalty': 0.0,
 'command': 'run.pl',
 'compute_per_dim_accuracy': False,
 'deriv_truncate_margin': None,
 'dir': 'exp/chain/tdnn1a_sp',
 'do_final_combination': True,
 'dropout_schedule': None,
 'egs_command': None,
 'egs_dir': None,
 'egs_nj': 0,
 'egs_opts': '--frames-overlap-per-eg 0',
 'egs_stage': 0,
 'email': None,
 'exit_stage': None,
 'feat_dir': 'data/train_sp_hires',
 'final_effective_lrate': 0.0001,
 'frame_subsampling_factor': 3,
 'frames_per_iter': 1500000,
 'initial_effective_lrate': 0.001,
 'input_model': None,
 'l2_regularize': 5e-05,
 'lat_dir': 'exp/chain/tri4b_train_sp_lats',
 'leaky_hmm_coefficient': 0.1,
 'left_deriv_truncate': None,
 'left_tolerance': 5,
 'lm_opts': '--num-extra-lm-states=2000',
 'max_lda_jobs': 10,
 'max_models_combine': 20,
 'max_objective_evaluations': 30,
 'max_param_change': 2.0,
 'momentum': 0.0,
 'num_chunk_per_minibatch': '128,64',
 'num_epochs': 4.0,
 'num_jobs_final': 1,
 'num_jobs_initial': 1,
 'num_jobs_step': 1,
 'online_ivector_dir': 'exp/nnet3/ivectors_train_sp_hires',
 'preserve_model_interval': 100,
 'presoftmax_prior_scale_power': -0.25,
 'proportional_shrink': 0.0,
 'rand_prune': 4.0,
 'remove_egs': False,
 'reporting_interval': 0.1,
 'right_tolerance': 5,
 'samples_per_iter': 400000,
 'shrink_saturation_threshold': 0.4,
 'shrink_value': 1.0,
 'shuffle_buffer_size': 5000,
 'srand': 123,
 'stage': -10,
 'train_opts': [],
 'tree_dir': 'exp/chain/tree_sp',
 'use_gpu': 'yes',
 'xent_regularize': 0.1}
2023-03-23 14:21:58,703 [steps/nnet3/chain/train.py:341 - train - INFO ] Creating phone language-model
2023-03-23 14:22:00,389 [steps/nnet3/chain/train.py:346 - train - INFO ] Creating denominator FST
copy-transition-model exp/chain/tree_sp/final.mdl exp/chain/tdnn1a_sp/0.trans_mdl
LOG (copy-transition-model[5.5.1068~1-59299]:main():copy-transition-model.cc:62) Copied transition model.
2023-03-23 14:22:01,232 [steps/nnet3/chain/train.py:353 - train - INFO ] Initializing a basic network for estimating preconditioning matrix
2023-03-23 14:22:01,310 [steps/nnet3/chain/train.py:382 - train - INFO ] Generating egs
steps/nnet3/chain/get_egs.sh --frames-overlap-per-eg 0 --cmd run.pl --cmvn-opts --norm-means=false --norm-vars=false --online-ivector-dir exp/nnet3/ivectors_train_sp_hires --left-context 17 --right-context 11 --left-context-initial 17 --right-context-final 11 --left-tolerance 5 --right-tolerance 5 --frame-subsampling-factor 3 --alignment-subsampling-factor 3 --stage 0 --frames-per-iter 1500000 --frames-per-eg 140,100,160 --srand 123 data/train_sp_hires exp/chain/tdnn1a_sp exp/chain/tri4b_train_sp_lats exp/chain/tdnn1a_sp/egs
steps/nnet3/chain/get_egs.sh: File data/train_sp_hires/utt2uniq exists, so ensuring the hold-out set includes all perturbed versions of the same source utterance.
steps/nnet3/chain/get_egs.sh: Holding out 51 utterances in validation set and 50 in training diagnostic set, out of total 79818.
steps/nnet3/chain/get_egs.sh: creating egs.  To ensure they are not deleted later you can do:  touch exp/chain/tdnn1a_sp/egs/.nodelete
steps/nnet3/chain/get_egs.sh: feature type is raw, with 'apply-cmvn'
tree-info exp/chain/tdnn1a_sp/tree
feat-to-dim scp:exp/nnet3/ivectors_train_sp_hires/ivector_online.scp -
steps/nnet3/chain/get_egs.sh: working out number of frames of training data
steps/nnet3/chain/get_egs.sh: working out feature dim
steps/nnet3/chain/get_egs.sh: creating 34 archives, each with 9182 egs, with
steps/nnet3/chain/get_egs.sh:   140,100,160 labels per example, and (left,right) context = (17,11)
steps/nnet3/chain/get_egs.sh:   ... and (left-context-initial,right-context-final) = (17,11)
steps/nnet3/chain/get_egs.sh: Getting validation and training subset examples in background.
steps/nnet3/chain/get_egs.sh: Generating training examples on disk
steps/nnet3/chain/get_egs.sh: Getting subsets of validation examples for diagnostics and combination.
steps/nnet3/chain/get_egs.sh: recombining and shuffling order of archives on disk
steps/nnet3/chain/get_egs.sh: Removing temporary archives, alignments and lattices
steps/nnet3/chain/get_egs.sh: Finished preparing training examples
2023-03-23 14:23:59,333 [steps/nnet3/chain/train.py:431 - train - INFO ] Copying the properties from exp/chain/tdnn1a_sp/egs to exp/chain/tdnn1a_sp
2023-03-23 14:23:59,334 [steps/nnet3/chain/train.py:445 - train - INFO ] Computing the preconditioning matrix for input features
2023-03-23 14:24:11,710 [steps/nnet3/chain/train.py:454 - train - INFO ] Preparing the initial acoustic model.
2023-03-23 14:24:14,359 [steps/nnet3/chain/train.py:488 - train - INFO ] Training will run for 4.0 epochs = 408 iterations
2023-03-23 14:24:14,359 [steps/nnet3/chain/train.py:535 - train - INFO ] Iter: 0/407   Jobs: 1   Epoch: 0.00/4.0 (0.0% complete)   lr: 0.001000  
bash: line 1: 2389547 Aborted                 (core dumped) ( nnet3-chain-train --use-gpu=yes --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --write-cache=exp/chain/tdnn1a_sp/cache.1 --xent-regularize=0.1 --print-interval=10 --momentum=0.0 --max-param-change=1.414213562373095 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=1.0 --srand=123 "nnet3-am-copy --raw=true --learning-rate=0.001 --scale=1.0 exp/chain/tdnn1a_sp/0.mdl - |" exp/chain/tdnn1a_sp/den.fst "ark,bg:nnet3-chain-copy-egs                          --frame-shift=1                         ark:exp/chain/tdnn1a_sp/egs/cegs.1.ark ark:- |                         nnet3-chain-shuffle-egs --buffer-size=5000                         --srand=123 ark:- ark:- | nnet3-chain-merge-egs                         --minibatch-size=64,32 ark:- ark:- |" exp/chain/tdnn1a_sp/1.1.raw ) 2>> exp/chain/tdnn1a_sp/log/train.0.1.log >> exp/chain/tdnn1a_sp/log/train.0.1.log
run.pl: job failed, log is in exp/chain/tdnn1a_sp/log/train.0.1.log
2023-03-23 14:24:50,312 [steps/libs/common.py:207 - background_command_waiter - ERROR ] Command exited with status 1: run.pl --gpu 1 exp/chain/tdnn1a_sp/log/train.0.1.log                     nnet3-chain-train --use-gpu=yes                      --apply-deriv-weights=False                     --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1                      --write-cache=exp/chain/tdnn1a_sp/cache.1  --xent-regularize=0.1                                          --print-interval=10 --momentum=0.0                     --max-param-change=1.414213562373095                     --backstitch-training-scale=0.0                     --backstitch-training-interval=1                     --l2-regularize-factor=1.0                       --srand=123                     "nnet3-am-copy --raw=true --learning-rate=0.001 --scale=1.0 exp/chain/tdnn1a_sp/0.mdl - |" exp/chain/tdnn1a_sp/den.fst                     "ark,bg:nnet3-chain-copy-egs                          --frame-shift=1                         ark:exp/chain/tdnn1a_sp/egs/cegs.1.ark ark:- |                         nnet3-chain-shuffle-egs --buffer-size=5000                         --srand=123 ark:- ark:- | nnet3-chain-merge-egs                         --minibatch-size=64,32 ark:- ark:- |"                     exp/chain/tdnn1a_sp/1.1.raw



output in log file-------------------------

# nnet3-chain-train --use-gpu=yes --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --write-cache=exp/chain/tdnn1a_sp/cache.1 --xent-regularize=0.1 --print-interval=10 --momentum=0.0 --max-param-change=1.414213562373095 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=1.0 --srand=123 "nnet3-am-copy --raw=true --learning-rate=0.001 --scale=1.0 exp/chain/tdnn1a_sp/0.mdl - |" exp/chain/tdnn1a_sp/den.fst "ark,bg:nnet3-chain-copy-egs                          --frame-shift=1                         ark:exp/chain/tdnn1a_sp/egs/cegs.1.ark ark:- |                         nnet3-chain-shuffle-egs --buffer-size=5000                         --srand=123 ark:- ark:- | nnet3-chain-merge-egs                         --minibatch-size=64,32 ark:- ark:- |" exp/chain/tdnn1a_sp/1.1.raw
# Started at Thu Mar 23 14:24:17 IST 2023
#
nnet3-chain-train --use-gpu=yes --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --write-cache=exp/chain/tdnn1a_sp/cache.1 --xent-regularize=0.1 --print-interval=10 --momentum=0.0 --max-param-change=1.414213562373095 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=1.0 --srand=123 'nnet3-am-copy --raw=true --learning-rate=0.001 --scale=1.0 exp/chain/tdnn1a_sp/0.mdl - |' exp/chain/tdnn1a_sp/den.fst 'ark,bg:nnet3-chain-copy-egs                          --frame-shift=1                         ark:exp/chain/tdnn1a_sp/egs/cegs.1.ark ark:- |                         nnet3-chain-shuffle-egs --buffer-size=5000                         --srand=123 ark:- ark:- | nnet3-chain-merge-egs                         --minibatch-size=64,32 ark:- ark:- |' exp/chain/tdnn1a_sp/1.1.raw
WARNING (nnet3-chain-train[5.5.1068~1-59299]:SelectGpuId():cu-device.cc:243) Not in compute-exclusive mode.  Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode
LOG (nnet3-chain-train[5.5.1068~1-59299]:SelectGpuIdAuto():cu-device.cc:438) Selecting from 1 GPUs
LOG (nnet3-chain-train[5.5.1068~1-59299]:SelectGpuIdAuto():cu-device.cc:453) cudaSetDevice(0): Quadro P2200 free:4756M, used:288M, total:5045M, free/total:0.94289
LOG (nnet3-chain-train[5.5.1068~1-59299]:SelectGpuIdAuto():cu-device.cc:501) Device: 0, mem_ratio: 0.94289
LOG (nnet3-chain-train[5.5.1068~1-59299]:SelectGpuId():cu-device.cc:382) Trying to select device: 0
LOG (nnet3-chain-train[5.5.1068~1-59299]:SelectGpuIdAuto():cu-device.cc:511) Success selecting device 0 free mem ratio: 0.94289
LOG (nnet3-chain-train[5.5.1068~1-59299]:FinalizeActiveGpu():cu-device.cc:338) The active GPU is [0]: Quadro P2200 free:4572M, used:472M, total:5045M, free/total:0.906418 version 6.1
nnet3-am-copy --raw=true --learning-rate=0.001 --scale=1.0 exp/chain/tdnn1a_sp/0.mdl -
LOG (nnet3-am-copy[5.5.1068~1-59299]:main():nnet3-am-copy.cc:153) Copied neural net from exp/chain/tdnn1a_sp/0.mdl to raw format as -
nnet3-chain-shuffle-egs --buffer-size=5000 --srand=123 ark:- ark:-
nnet3-chain-merge-egs --minibatch-size=64,32 ark:- ark:-
nnet3-chain-copy-egs --frame-shift=1 ark:exp/chain/tdnn1a_sp/egs/cegs.1.ark ark:-
WARNING (nnet3-chain-train[5.5.1068~1-59299]:ReorthogonalizeRt1():natural-gradient-online.cc:241) Cholesky out of expected range, reorthogonalizing with Gram-Schmidt
WARNING (nnet3-chain-train[5.5.1068~1-59299]:ReorthogonalizeRt1():natural-gradient-online.cc:241) Cholesky out of expected range, reorthogonalizing with Gram-Schmidt
WARNING (nnet3-chain-train[5.5.1068~1-59299]:ReorthogonalizeRt1():natural-gradient-online.cc:241) Cholesky out of expected range, reorthogonalizing with Gram-Schmidt
LOG (nnet3-chain-train[5.5.1068~1-59299]:UpdateNnetWithMaxChange():nnet-utils.cc:2205) Per-component max-change active on 8 / 12 Updatable Components. (Smallest factor=8.72335e-06 on tdnn2.affine with max-change=0.75). Global max-change factor was 0.54361 with max-change=1.41421.
LOG (nnet3-chain-train[5.5.1068~1-59299]:UpdateNnetWithMaxChange():nnet-utils.cc:2205) Per-component max-change active on 8 / 12 Updatable Components. (Smallest factor=1.48141e-06 on tdnn2.affine with max-change=0.75). Global max-change factor was 0.563213 with max-change=1.41421.
ERROR (nnet3-chain-train[5.5.1068~1-59299]:Cholesky():Cholesky decomposition failed Cholesky decomposition failed. Maybe matrix is not positive definite.

[ Stack-Trace: ]
nnet3-chain-train(kaldi::MessageLogger::LogMessage() const+0x793) [0x56073751ca69]
nnet3-chain-train(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x25) [0x5607371766cb]
nnet3-chain-train(kaldi::TpMatrix<float>::Cholesky(kaldi::SpMatrix<float> const&)+0x1b1) [0x560737504275]
nnet3-chain-train(kaldi::nnet3::OnlineNaturalGradient::ReorthogonalizeRt1(kaldi::VectorBase<float> const&, float, kaldi::CuMatrixBase<float>*, kaldi::CuMatrixBase<float>*, kaldi::CuMatrixBase<float>*)+0x473) [0x5607372296c9]
nnet3-chain-train(kaldi::nnet3::OnlineNaturalGradient::PreconditionDirectionsInternal(float, float, bool, kaldi::Vector<float> const&, kaldi::CuMatrixBase<float>*, kaldi::CuMatrixBase<float>*)+0xf5a) [0x56073722ab0e]
nnet3-chain-train(kaldi::nnet3::OnlineNaturalGradient::PreconditionDirections(kaldi::CuMatrixBase<float>*, float*)+0x1ef) [0x56073722b81b]
nnet3-chain-train(kaldi::nnet3::NaturalGradientAffineComponent::Update(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&)+0x227) [0x5607371e540b]
nnet3-chain-train(kaldi::nnet3::AffineComponent::Backprop(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kaldi::nnet3::ComponentPrecomputedIndexes const*, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, void*, kaldi::nnet3::Component*, kaldi::CuMatrixBase<float>*) const+0xba) [0x5607371e2cdc]
nnet3-chain-train(kaldi::nnet3::NnetComputer::ExecuteCommand()+0x8ae) [0x560737247778]
nnet3-chain-train(kaldi::nnet3::NnetComputer::Run()+0x14b) [0x560737248531]
nnet3-chain-train(kaldi::nnet3::NnetChainTrainer::TrainInternal(kaldi::nnet3::NnetChainExample const&, kaldi::nnet3::NnetComputation const&)+0x7d) [0x5607371cab8f]
nnet3-chain-train(kaldi::nnet3::NnetChainTrainer::Train(kaldi::nnet3::NnetChainExample const&)+0xe8) [0x5607371caf1e]
nnet3-chain-train(main+0x791) [0x56073717549a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f4e3666b083]
nnet3-chain-train(_start+0x2e) [0x560737174c4e]

WARNING (nnet3-chain-train[5.5.1068~1-59299]:ReorthogonalizeRt1():natural-gradient-online.cc:248) Cholesky or Invert() failed while re-orthogonalizing R_t. Re-orthogonalizing on CPU.
WARNING (nnet3-chain-train[5.5.1068~1-59299]:ReorthogonalizeRt1():natural-gradient-online.cc:241) Cholesky out of expected range, reorthogonalizing with Gram-Schmidt
ERROR (nnet3-chain-train[5.5.1068~1-59299]:Cholesky():tp-matrix.cc:110) Cholesky decomposition failed. Maybe matrix is not positive definite.

[ Stack-Trace: ]
nnet3-chain-train(kaldi::MessageLogger::LogMessage() const+0x793) [0x56073751ca69]
nnet3-chain-train(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x25) [0x5607371766cb]
nnet3-chain-train(kaldi::TpMatrix<float>::Cholesky(kaldi::SpMatrix<float> const&)+0x1b1) [0x560737504275]
nnet3-chain-train(kaldi::nnet3::OnlineNaturalGradient::ReorthogonalizeRt1(kaldi::VectorBase<float> const&, float, kaldi::CuMatrixBase<float>*, kaldi::CuMatrixBase<float>*, kaldi::CuMatrixBase<float>*)+0x473) [0x5607372296c9]
nnet3-chain-train(kaldi::nnet3::OnlineNaturalGradient::PreconditionDirectionsInternal(float, float, bool, kaldi::Vector<float> const&, kaldi::CuMatrixBase<float>*, kaldi::CuMatrixBase<float>*)+0xf5a) [0x56073722ab0e]
nnet3-chain-train(kaldi::nnet3::OnlineNaturalGradient::PreconditionDirections(kaldi::CuMatrixBase<float>*, float*)+0x1ef) [0x56073722b81b]
nnet3-chain-train(kaldi::nnet3::NaturalGradientAffineComponent::Update(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&)+0x227) [0x5607371e540b]
nnet3-chain-train(kaldi::nnet3::AffineComponent::Backprop(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kaldi::nnet3::ComponentPrecomputedIndexes const*, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, void*, kaldi::nnet3::Component*, kaldi::CuMatrixBase<float>*) const+0xba) [0x5607371e2cdc]
nnet3-chain-train(kaldi::nnet3::NnetComputer::ExecuteCommand()+0x8ae) [0x560737247778]
nnet3-chain-train(kaldi::nnet3::NnetComputer::Run()+0x14b) [0x560737248531]
nnet3-chain-train(kaldi::nnet3::NnetChainTrainer::TrainInternal(kaldi::nnet3::NnetChainExample const&, kaldi::nnet3::NnetComputation const&)+0x7d) [0x5607371cab8f]
nnet3-chain-train(kaldi::nnet3::NnetChainTrainer::Train(kaldi::nnet3::NnetChainExample const&)+0xe8) [0x5607371caf1e]
nnet3-chain-train(main+0x791) [0x56073717549a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f4e3666b083]
nnet3-chain-train(_start+0x2e) [0x560737174c4e]

WARNING (nnet3-chain-train[5.5.1068~1-59299]:ReorthogonalizeRt1():natural-gradient-online.cc:248) Cholesky or Invert() failed while re-orthogonalizing R_t. Re-orthogonalizing on CPU.
ASSERTION_FAILED (nnet3-chain-train[5.5.1068~1-59299]:HouseBackward():qr.cc:123) Assertion failed: (KALDI_ISFINITE(sigma) && "Tridiagonalizing matrix that is too large or has NaNs.")

[ Stack-Trace: ]
nnet3-chain-train(kaldi::MessageLogger::LogMessage() const+0x793) [0x56073751ca69]
nnet3-chain-train(kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*)+0x72) [0x56073751d46a]
nnet3-chain-train(void kaldi::HouseBackward<float>(int, float const*, float*, float*)+0x16b) [0x5607375049c5]
nnet3-chain-train(kaldi::SpMatrix<float>::Tridiagonalize(kaldi::MatrixBase<float>*)+0x324) [0x560737504eec]
nnet3-chain-train(kaldi::SpMatrix<float>::Eig(kaldi::VectorBase<float>*, kaldi::MatrixBase<float>*) const+0x6c) [0x5607375065e0]
nnet3-chain-train(kaldi::nnet3::OnlineNaturalGradient::PreconditionDirectionsInternal(float, float, bool, kaldi::Vector<float> const&, kaldi::CuMatrixBase<float>*, kaldi::CuMatrixBase<float>*)+0x987) [0x56073722a53b]
nnet3-chain-train(kaldi::nnet3::OnlineNaturalGradient::PreconditionDirections(kaldi::CuMatrixBase<float>*, float*)+0x1ef) [0x56073722b81b]
nnet3-chain-train(kaldi::nnet3::NaturalGradientAffineComponent::Update(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&)+0x227) [0x5607371e540b]
nnet3-chain-train(kaldi::nnet3::AffineComponent::Backprop(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kaldi::nnet3::ComponentPrecomputedIndexes const*, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, void*, kaldi::nnet3::Component*, kaldi::CuMatrixBase<float>*) const+0xba) [0x5607371e2cdc]
nnet3-chain-train(kaldi::nnet3::NnetComputer::ExecuteCommand()+0x8ae) [0x560737247778]
nnet3-chain-train(kaldi::nnet3::NnetComputer::Run()+0x14b) [0x560737248531]
nnet3-chain-train(kaldi::nnet3::NnetChainTrainer::TrainInternal(kaldi::nnet3::NnetChainExample const&, kaldi::nnet3::NnetComputation const&)+0x7d) [0x5607371cab8f]
nnet3-chain-train(kaldi::nnet3::NnetChainTrainer::Train(kaldi::nnet3::NnetChainExample const&)+0xe8) [0x5607371caf1e]
nnet3-chain-train(main+0x791) [0x56073717549a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f4e3666b083]
nnet3-chain-train(_start+0x2e) [0x560737174c4e]

# Accounting: time=33 threads=1
# Ended (code 134) at Thu Mar 23 14:24:50 IST 2023, elapsed time 33 seconds
train.0.1.log
NVIDIA.png

Daniel Povey

unread,
Mar 24, 2023, 11:46:56 AM3/24/23
to kaldi...@googlegroups.com
That Cholesky stuff happens on CPU not GPU so it could in principle be some issue with your BLAS or LAPACK libraries.
Make sure that the tests run correctly in src/matrix/ ("make test")

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/f7b692b8-f3cf-4808-a917-706b30a6dabdn%40googlegroups.com.
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted

Joseph Brightly

unread,
Apr 25, 2023, 7:17:33 AM4/25/23
to kaldi...@googlegroups.com
Googling the failed assertion
Tridiagonalizing matrix that is too large or has NaNs

On Tue, Apr 25, 2023 at 12:28 AM tao li <litao...@gmail.com> wrote:

I seem to be having the same problem
The test   in src/matrix/   seems to be correct

test@DESKTOP-TC8Q0VO:~/kaldi/src/matrix$ make test
Running matrix-lib-test ... 1s... SUCCESS matrix-lib-test
Running sparse-matrix-test ... 0s... SUCCESS sparse-matrix-test
Running numpy-array-test ... 0s... SUCCESS numpy-array-test

 log as follows


# nnet3-chain-train --use-gpu=wait --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --write-cache=exp/chain/tdnn_1b_all_sp/cache.1 --xent-regularize=0.1 --print-interval=10 --momentum=0.0 --max-param-change=1.414213562373095 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=1.0 --srand=0 "nnet3-am-copy --raw=true --learning-rate=0.0001 --scale=1.0 exp/chain/tdnn_1b_all_sp/0.mdl - |nnet3-copy --edits='set-dropout-proportion name=* proportion=0.0' - - |" exp/chain/tdnn_1b_all_sp/den.fst "ark,bg:nnet3-chain-copy-egs                          --frame-shift=1                         ark:exp/chain/tdnn_1b_all_sp/egs/cegs.1.ark ark:- |                         nnet3-chain-shuffle-egs --buffer-size=5000                         --srand=0 ark:- ark:- | nnet3-chain-merge-egs                         --minibatch-size=64 ark:- ark:- |" exp/chain/tdnn_1b_all_sp/1.1.raw
# Started at Sun Apr 23 21:26:48 CST 2023
#
nnet3-chain-train --use-gpu=wait --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --write-cache=exp/chain/tdnn_1b_all_sp/cache.1 --xent-regularize=0.1 --print-interval=10 --momentum=0.0 --max-param-change=1.414213562373095 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=1.0 --srand=0 "nnet3-am-copy --raw=true --learning-rate=0.0001 --scale=1.0 exp/chain/tdnn_1b_all_sp/0.mdl - |nnet3-copy --edits='set-dropout-proportion name=* proportion=0.0' - - |" exp/chain/tdnn_1b_all_sp/den.fst 'ark,bg:nnet3-chain-copy-egs                          --frame-shift=1                         ark:exp/chain/tdnn_1b_all_sp/egs/cegs.1.ark ark:- |                         nnet3-chain-shuffle-egs --buffer-size=5000                         --srand=0 ark:- ark:- | nnet3-chain-merge-egs                         --minibatch-size=64 ark:- ark:- |' exp/chain/tdnn_1b_all_sp/1.1.raw
WARNING (nnet3-chain-train[5.5.1068~3-59299]:SelectGpuId():cu-device.cc:229) Waited 0 seconds before creating CUDA context
WARNING (nnet3-chain-train[5.5.1068~3-59299]:SelectGpuId():cu-device.cc:243) Not in compute-exclusive mode.  Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode
LOG (nnet3-chain-train[5.5.1068~3-59299]:SelectGpuIdAuto():cu-device.cc:438) Selecting from 1 GPUs
LOG (nnet3-chain-train[5.5.1068~3-59299]:SelectGpuIdAuto():cu-device.cc:453) cudaSetDevice(0): NVIDIA GeForce MX250 free:1655M, used:392M, total:2047M, free/total:0.808187
LOG (nnet3-chain-train[5.5.1068~3-59299]:SelectGpuIdAuto():cu-device.cc:501) Device: 0, mem_ratio: 0.808187
LOG (nnet3-chain-train[5.5.1068~3-59299]:SelectGpuId():cu-device.cc:382) Trying to select device: 0
LOG (nnet3-chain-train[5.5.1068~3-59299]:SelectGpuIdAuto():cu-device.cc:511) Success selecting device 0 free mem ratio: 0.808187
LOG (nnet3-chain-train[5.5.1068~3-59299]:FinalizeActiveGpu():cu-device.cc:338) The active GPU is [0]: NVIDIA GeForce MX250 free:1471M, used:576M, total:2047M, free/total:0.718337 version 6.1
nnet3-am-copy --raw=true --learning-rate=0.0001 --scale=1.0 exp/chain/tdnn_1b_all_sp/0.mdl -
nnet3-copy '--edits=set-dropout-proportion name=* proportion=0.0' - -
LOG (nnet3-am-copy[5.5.1068~3-59299]:main():nnet3-am-copy.cc:153) Copied neural net from exp/chain/tdnn_1b_all_sp/0.mdl to raw format as -
LOG (nnet3-copy[5.5.1068~3-59299]:ReadEditConfig():nnet-utils.cc:1413) Set dropout proportions for 11 components.
LOG (nnet3-copy[5.5.1068~3-59299]:main():nnet3-copy.cc:123) Copied raw neural net from - to -
nnet3-chain-copy-egs --frame-shift=1 ark:exp/chain/tdnn_1b_all_sp/egs/cegs.1.ark ark:-
nnet3-chain-merge-egs --minibatch-size=64 ark:- ark:-
nnet3-chain-shuffle-egs --buffer-size=5000 --srand=0 ark:- ark:-
LOG (nnet3-chain-train[5.5.1068~3-59299]:AllocateNewRegion():cu-allocator.cc:478) About to allocate new memory region of 385875968 bytes; current memory info is: free:735M, used:1312M, total:2047M, free/total:0.35894
LOG (nnet3-chain-train[5.5.1068~3-59299]:AllocateNewRegion():cu-allocator.cc:478) About to allocate new memory region of 192937984 bytes; current memory info is: free:367M, used:1680M, total:2047M, free/total:0.179242
LOG (nnet3-chain-train[5.5.1068~3-59299]:AllocateNewRegion():cu-allocator.cc:478) About to allocate new memory region of 96468992 bytes; current memory info is: free:183M, used:1864M, total:2047M, free/total:0.0893928
LOG (nnet3-chain-train[5.5.1068~3-59299]:AllocateNewRegion():cu-allocator.cc:478) About to allocate new memory region of 48234496 bytes; current memory info is: free:91M, used:1956M, total:2047M, free/total:0.0444682
LOG (nnet3-chain-train[5.5.1068~3-59299]:AllocateNewRegion():cu-allocator.cc:478) About to allocate new memory region of 161480704 bytes; current memory info is: free:45M, used:2002M, total:2047M, free/total:0.0220059
WARNING (nnet3-chain-train[5.5.1068~3-59299]:ReorthogonalizeRt1():natural-gradient-online.cc:241) Cholesky out of expected range, reorthogonalizing with Gram-Schmidt
WARNING (nnet3-chain-train[5.5.1068~3-59299]:ReorthogonalizeRt1():natural-gradient-online.cc:241) Cholesky out of expected range, reorthogonalizing with Gram-Schmidt
WARNING (nnet3-chain-train[5.5.1068~3-59299]:ReorthogonalizeRt1():natural-gradient-online.cc:241) Cholesky out of expected range, reorthogonalizing with Gram-Schmidt
WARNING (nnet3-chain-train[5.5.1068~3-59299]:ReorthogonalizeRt1():natural-gradient-online.cc:241) Cholesky out of expected range, reorthogonalizing with Gram-Schmidt
LOG (nnet3-chain-train[5.5.1068~3-59299]:UpdateNnetWithMaxChange():nnet-utils.cc:2205) Per-component max-change active on 11 / 28 Updatable Components. (Smallest factor=0.000214834 on tdnn2l with max-change=0.75). Global max-change factor was 0.554058 with max-change=1.41421.
WARNING (nnet3-chain-train[5.5.1068~3-59299]:ReorthogonalizeRt1():natural-gradient-online.cc:241) Cholesky out of expected range, reorthogonalizing with Gram-Schmidt
WARNING (nnet3-chain-train[5.5.1068~3-59299]:ReorthogonalizeRt1():natural-gradient-online.cc:241) Cholesky out of expected range, reorthogonalizing with Gram-Schmidt
WARNING (nnet3-chain-train[5.5.1068~3-59299]:ReorthogonalizeRt1():natural-gradient-online.cc:241) Cholesky out of expected range, reorthogonalizing with Gram-Schmidt
ERROR (nnet3-chain-train[5.5.1068~3-59299]:Cholesky():tp-matrix.cc:110) Cholesky decomposition failed. Maybe matrix is not positive definite.

[ Stack-Trace: ]
/home/test/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0xb42) [0x7fa07c61f732]
nnet3-chain-train(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x21) [0x55b122611e8d]
/home/test/kaldi/src/lib/libkaldi-matrix.so(kaldi::TpMatrix<float>::Cholesky(kaldi::SpMatrix<float> const&)+0x1ae) [0x7fa07c889d66]
/home/test/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::OnlineNaturalGradient::ReorthogonalizeRt1(kaldi::VectorBase<float> const&, float, kaldi::CuMatrixBase<float>*, kaldi::CuMatrixBase<float>*, kaldi::CuMatrixBase<float>*)+0x461) [0x7fa07ecd80af]
/home/test/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::OnlineNaturalGradient::PreconditionDirectionsInternal(float, float, bool, kaldi::Vector<float> const&, kaldi::CuMatrixBase<float>*, kaldi::CuMatrixBase<float>*)+0x103c) [0x7fa07ecd95ba]
/home/test/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::OnlineNaturalGradient::PreconditionDirections(kaldi::CuMatrixBase<float>*, float*)+0x1e3) [0x7fa07ecda357]
/home/test/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NaturalGradientAffineComponent::Update(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&)+0x222) [0x7fa07ec91ce4]
/home/test/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::AffineComponent::Backprop(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kaldi::nnet3::ComponentPrecomputedIndexes const*, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, void*, kaldi::nnet3::Component*, kaldi::CuMatrixBase<float>*) const+0x92) [0x7fa07ec8f2a8]
/home/test/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetComputer::ExecuteCommand()+0x8d5) [0x7fa07ed2d7fd]
/home/test/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetComputer::Run()+0x178) [0x7fa07ed2e74e]
/home/test/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetChainTrainer::TrainInternal(kaldi::nnet3::NnetChainExample const&, kaldi::nnet3::NnetComputation const&)+0x79) [0x7fa07ed86577]
/home/test/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetChainTrainer::Train(kaldi::nnet3::NnetChainExample const&)+0xe4) [0x7fa07ed86908]
nnet3-chain-train(main+0x767) [0x55b122610ee1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7fa07b6f6c87]
nnet3-chain-train(_start+0x2a) [0x55b12261069a]

WARNING (nnet3-chain-train[5.5.1068~3-59299]:ReorthogonalizeRt1():natural-gradient-online.cc:248) Cholesky or Invert() failed while re-orthogonalizing R_t. Re-orthogonalizing on CPU.
WARNING (nnet3-chain-train[5.5.1068~3-59299]:ReorthogonalizeRt1():natural-gradient-online.cc:241) Cholesky out of expected range, reorthogonalizing with Gram-Schmidt
ERROR (nnet3-chain-train[5.5.1068~3-59299]:Cholesky():tp-matrix.cc:110) Cholesky decomposition failed. Maybe matrix is not positive definite.

[ Stack-Trace: ]
/home/test/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0xb42) [0x7fa07c61f732]
nnet3-chain-train(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x21) [0x55b122611e8d]
/home/test/kaldi/src/lib/libkaldi-matrix.so(kaldi::TpMatrix<float>::Cholesky(kaldi::SpMatrix<float> const&)+0x1ae) [0x7fa07c889d66]
/home/test/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::OnlineNaturalGradient::ReorthogonalizeRt1(kaldi::VectorBase<float> const&, float, kaldi::CuMatrixBase<float>*, kaldi::CuMatrixBase<float>*, kaldi::CuMatrixBase<float>*)+0x461) [0x7fa07ecd80af]
/home/test/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::OnlineNaturalGradient::PreconditionDirectionsInternal(float, float, bool, kaldi::Vector<float> const&, kaldi::CuMatrixBase<float>*, kaldi::CuMatrixBase<float>*)+0x103c) [0x7fa07ecd95ba]
/home/test/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::OnlineNaturalGradient::PreconditionDirections(kaldi::CuMatrixBase<float>*, float*)+0x1e3) [0x7fa07ecda357]
/home/test/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NaturalGradientAffineComponent::Update(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&)+0x222) [0x7fa07ec91ce4]
/home/test/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::AffineComponent::Backprop(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kaldi::nnet3::ComponentPrecomputedIndexes const*, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, void*, kaldi::nnet3::Component*, kaldi::CuMatrixBase<float>*) const+0x92) [0x7fa07ec8f2a8]
/home/test/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetComputer::ExecuteCommand()+0x8d5) [0x7fa07ed2d7fd]
/home/test/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetComputer::Run()+0x178) [0x7fa07ed2e74e]
/home/test/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetChainTrainer::TrainInternal(kaldi::nnet3::NnetChainExample const&, kaldi::nnet3::NnetComputation const&)+0x79) [0x7fa07ed86577]
/home/test/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetChainTrainer::Train(kaldi::nnet3::NnetChainExample const&)+0xe4) [0x7fa07ed86908]
nnet3-chain-train(main+0x767) [0x55b122610ee1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7fa07b6f6c87]
nnet3-chain-train(_start+0x2a) [0x55b12261069a]

WARNING (nnet3-chain-train[5.5.1068~3-59299]:ReorthogonalizeRt1():natural-gradient-online.cc:248) Cholesky or Invert() failed while re-orthogonalizing R_t. Re-orthogonalizing on CPU.
WARNING (nnet3-chain-train[5.5.1068~3-59299]:ReorthogonalizeRt1():natural-gradient-online.cc:241) Cholesky out of expected range, reorthogonalizing with Gram-Schmidt
WARNING (nnet3-chain-train[5.5.1068~3-59299]:ReorthogonalizeRt1():natural-gradient-online.cc:241) Cholesky out of expected range, reorthogonalizing with Gram-Schmidt
WARNING (nnet3-chain-train[5.5.1068~3-59299]:ReorthogonalizeRt1():natural-gradient-online.cc:241) Cholesky out of expected range, reorthogonalizing with Gram-Schmidt
ASSERTION_FAILED (nnet3-chain-train[5.5.1068~3-59299]:HouseBackward():qr.cc:124) Assertion failed: (KALDI_ISFINITE(sigma) && "Tridiagonalizing matrix that is too large or has NaNs.")

[ Stack-Trace: ]
/home/test/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0xb42) [0x7fa07c61f732]
/home/test/kaldi/src/lib/libkaldi-base.so(kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*)+0x6e) [0x7fa07c62042e]
/home/test/kaldi/src/lib/libkaldi-matrix.so(void kaldi::HouseBackward<float>(int, float const*, float*, float*)+0x16c) [0x7fa07c88e6f8]
/home/test/kaldi/src/lib/libkaldi-matrix.so(kaldi::SpMatrix<float>::Tridiagonalize(kaldi::MatrixBase<float>*)+0x32b) [0x7fa07c88ec11]
/home/test/kaldi/src/lib/libkaldi-matrix.so(kaldi::SpMatrix<float>::Eig(kaldi::VectorBase<float>*, kaldi::MatrixBase<float>*) const+0x6a) [0x7fa07c8902b8]
/home/test/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::OnlineNaturalGradient::PreconditionDirectionsInternal(float, float, bool, kaldi::Vector<float> const&, kaldi::CuMatrixBase<float>*, kaldi::CuMatrixBase<float>*)+0xa52) [0x7fa07ecd8fd0]
/home/test/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::OnlineNaturalGradient::PreconditionDirections(kaldi::CuMatrixBase<float>*, float*)+0x1e3) [0x7fa07ecda357]
/home/test/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NaturalGradientAffineComponent::Update(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&)+0x222) [0x7fa07ec91ce4]
/home/test/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::AffineComponent::Backprop(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kaldi::nnet3::ComponentPrecomputedIndexes const*, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, void*, kaldi::nnet3::Component*, kaldi::CuMatrixBase<float>*) const+0x92) [0x7fa07ec8f2a8]
/home/test/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetComputer::ExecuteCommand()+0x8d5) [0x7fa07ed2d7fd]
/home/test/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetComputer::Run()+0x178) [0x7fa07ed2e74e]
/home/test/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetChainTrainer::TrainInternal(kaldi::nnet3::NnetChainExample const&, kaldi::nnet3::NnetComputation const&)+0x79) [0x7fa07ed86577]
/home/test/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetChainTrainer::Train(kaldi::nnet3::NnetChainExample const&)+0xe4) [0x7fa07ed86908]
nnet3-chain-train(main+0x767) [0x55b122610ee1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7fa07b6f6c87]
nnet3-chain-train(_start+0x2a) [0x55b12261069a]

# Accounting: time=10 threads=1
# Ended (code 134) at Sun Apr 23 21:26:58 CST 2023, elapsed time 10 seconds

Daniel Povey

unread,
Apr 25, 2023, 7:59:17 AM4/25/23
to kaldi...@googlegroups.com
I  this case it looks like a case of instability in the model training; if the script has not been changed
(e.g. the model topology), then the easiest fix would be to, say, halve the learning rate.


Reply all
Reply to author
Forward
0 new messages