Training crashes in valiadation

43 visualizzazioni
Passa al primo messaggio da leggere

Chatzitheodorou Konstantinos

da leggere,
5 mar 2021, 06:35:1205/03/21
a marian-nmt
Hi all

My new MT engine crashes after every 5K steps. If I restart manually, it keeps going for another 5K and crashes again.

Here are the parameters that I'm using

    --mini-batch-fit -w 7000 --mini-batch 900 --maxi-batch 900 \
    --valid-freq 5000 --save-freq 5000 --disp-freq 500 \
    --valid-metrics ce-mean-words perplexity translation \
    --valid-sets $DATA_DIR/$TESTCORPUS.bpe.$SRC $DATA_DIR/$TESTCORPUS.bpe.$TRG \
    --valid-script-path $WORKING_DIR/validate.sh \
    --valid-translation-output $MODEL_DIR/$TESTCORPUS.bpe.$SRC.output --quiet-translation \
    --beam-size 12 --normalize=1 \
    --valid-mini-batch 64 \

Here is the log I'm getting
[2021-03-02 17:21:24] Saving model weights and runtime parameters to model/model.npz.orig.npz
[2021-03-02 17:21:38] Saving model weights and runtime parameters to model/model.npz
[2021-03-02 17:21:50] Saving Adam parameters to model/model.npz.optimizer.npz
[2021-03-02 17:22:10] Saving model weights and runtime parameters to model/model.npz.best-ce-mean-words.npz
[2021-03-02 17:22:14] [valid] Ep. 1 : Up. 10000 : ce-mean-words : 2.59641 : new best
[2021-03-02 17:22:22] Saving model weights and runtime parameters to model/model.npz.best-perplexity.npz
[2021-03-02 17:22:26] [valid] Ep. 1 : Up. 10000 : perplexity : 13.4155 : new best
tcmalloc: large alloc 7381975040 bytes == 0x7f66f5de8000 @ 
tcmalloc: large alloc 7381975040 bytes == (nil) @ 
[2021-03-02 17:25:11] Caught std::exception in sub-thread: std::bad_alloc
Aborted from marian::ThreadPool::enqueue(F&&, Args&& ...)::<lambda()> [with F = marian::TranslationValidator::valida
te(const std::vector<std::shared_ptr<marian::ExpressionGraph> >&)::<lambda(size_t)>&; Args = {long unsigned int&}; r
eturn_type = void] in marian/src/3rd_party/threadpool.h: 144

Thanks

Roman Grundkiewicz

da leggere,
10 mar 2021, 03:48:5110/03/21
a marian-nmt
Hi,
I would double check if there are some very long sentences in the validation set and fix this and/or test with a smaller `--valid-mini-batch`.

Chatzitheodorou Konstantinos

da leggere,
11 mar 2021, 10:31:4711/03/21
a maria...@googlegroups.com
Hi Roman

I am using a smaller `--valid-mini-batch` now and it works fine.

Best
Konstantinos



--
You received this message because you are subscribed to a topic in the Google Groups "marian-nmt" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/marian-nmt/HhJGbO5iZqM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to marian-nmt+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/marian-nmt/5f45dff6-7818-46d2-854d-ed2ca3fac284n%40googlegroups.com.
Rispondi a tutti
Rispondi all'autore
Inoltra
Il messaggio è stato eliminato
0 nuovi messaggi