My new MT engine crashes after every 5K steps. If I restart manually, it keeps going for another 5K and crashes again.
--valid-freq 5000 --save-freq 5000 --disp-freq 500 \
--valid-metrics ce-mean-words perplexity translation \
--valid-sets $DATA_DIR/$TESTCORPUS.bpe.$SRC $DATA_DIR/$TESTCORPUS.bpe.$TRG \
--valid-script-path $WORKING_DIR/validate.sh \
--valid-translation-output $MODEL_DIR/$TESTCORPUS.bpe.$SRC.output --quiet-translation \
--beam-size 12 --normalize=1 \
--valid-mini-batch 64 \
[2021-03-02 17:21:24] Saving model weights and runtime parameters to model/model.npz.orig.npz
[2021-03-02 17:21:38] Saving model weights and runtime parameters to model/model.npz
[2021-03-02 17:21:50] Saving Adam parameters to model/model.npz.optimizer.npz
[2021-03-02 17:22:10] Saving model weights and runtime parameters to model/model.npz.best-ce-mean-words.npz
[2021-03-02 17:22:14] [valid] Ep. 1 : Up. 10000 : ce-mean-words : 2.59641 : new best
[2021-03-02 17:22:22] Saving model weights and runtime parameters to model/model.npz.best-perplexity.npz
[2021-03-02 17:22:26] [valid] Ep. 1 : Up. 10000 : perplexity : 13.4155 : new best
tcmalloc: large alloc 7381975040 bytes == 0x7f66f5de8000 @
tcmalloc: large alloc 7381975040 bytes == (nil) @
[2021-03-02 17:25:11] Caught std::exception in sub-thread: std::bad_alloc
Aborted from marian::ThreadPool::enqueue(F&&, Args&& ...)::<lambda()> [with F = marian::TranslationValidator::valida
te(const std::vector<std::shared_ptr<marian::ExpressionGraph> >&)::<lambda(size_t)>&; Args = {long unsigned int&}; r
eturn_type = void] in marian/src/3rd_party/threadpool.h: 144