Marian keeps failing for large vocabulary size

241 views
Skip to first unread message

charan n.a

unread,
Jul 2, 2020, 3:03:26 PM7/2/20
to marian-nmt
I have a vocabulary size of around 388597 and 304290, I went through them and most of it seems fine. Even if I use a subword segmentation it shouldn't reduce by a large number. This is for a translation model and the training seems to crash multiple times after lines like below appear in the log.

tcmalloc: large alloc 1555718144 bytes == 0x56440fc24000 @

The command I am running is 
nohup marian --model /home/ubuntu/model.s2s/model.npz --type s2s --train-sets ~/lang1.txt ~/lang2.txt --max-length 150 --mini-batch-fit -w 7000 --maxi-batch 1000 --save-freq 1000 --disp-freq 1000 --overwrite --keep-best --early-stopping 5 --after-epochs 10 --cost-type=ce-mean-words --log /home/ubuntu/model.s2s/train.log --tied-embeddings --layer-normalization --seed 0 --exponential-smoothing --devices 0

and the log I end up with is:
[2020-07-02 18:29:07] [marian] Marian v1.9.0 ba94c5b9 2020-05-17 10:42:17 +0100
[2020-07-02 18:29:07] [marian] Running on ip-172-31-28-255 as process 5525 with command line:
[2020-07-02 18:29:07] [marian] /home/ubuntu/marian/build/marian --model /home/ubuntu/model.s2s/model.npz --type s2s --train-sets /home/ubuntu/lang1.txt /home/ubuntu/lang2.txt --max-length 150 --mini-batch-fit -w 5000 --maxi-batch 1000 --save-freq 1000 --disp-freq 1000 --overwrite --keep-best --early-stopping 5 --after-epochs 10 --cost-type=ce-mean-words --log /home/ubuntu/model.s2s/train.log --tied-embeddings --layer-normalization --seed 0 --exponential-smoothing --devices 0
[2020-07-02 18:29:07] [config] after-batches: 0
[2020-07-02 18:29:07] [config] after-epochs: 10
[2020-07-02 18:29:07] [config] all-caps-every: 0
[2020-07-02 18:29:07] [config] allow-unk: false
[2020-07-02 18:29:07] [config] authors: false
[2020-07-02 18:29:07] [config] beam-size: 12
[2020-07-02 18:29:07] [config] bert-class-symbol: "[CLS]"
[2020-07-02 18:29:07] [config] bert-mask-symbol: "[MASK]"
[2020-07-02 18:29:07] [config] bert-masking-fraction: 0.15
[2020-07-02 18:29:07] [config] bert-sep-symbol: "[SEP]"
[2020-07-02 18:29:07] [config] bert-train-type-embeddings: true
[2020-07-02 18:29:07] [config] bert-type-vocab-size: 2
[2020-07-02 18:29:07] [config] build-info: ""
[2020-07-02 18:29:07] [config] cite: false
[2020-07-02 18:29:07] [config] clip-gemm: 0
[2020-07-02 18:29:07] [config] clip-norm: 1
[2020-07-02 18:29:07] [config] cost-scaling:
[2020-07-02 18:29:07] [config]   []
[2020-07-02 18:29:07] [config] cost-type: ce-mean-words
[2020-07-02 18:29:07] [config] cpu-threads: 0
[2020-07-02 18:29:07] [config] data-weighting: ""
[2020-07-02 18:29:07] [config] data-weighting-type: sentence
[2020-07-02 18:29:07] [config] dec-cell: gru
[2020-07-02 18:29:07] [config] dec-cell-base-depth: 2
[2020-07-02 18:29:07] [config] dec-cell-high-depth: 1
[2020-07-02 18:29:07] [config] dec-depth: 1
[2020-07-02 18:29:07] [config] devices:
[2020-07-02 18:29:07] [config]   - 0
[2020-07-02 18:29:07] [config] dim-emb: 512
[2020-07-02 18:29:07] [config] dim-rnn: 1024
[2020-07-02 18:29:07] [config] dim-vocabs:
[2020-07-02 18:29:07] [config]   - 0
[2020-07-02 18:29:07] [config]   - 0
[2020-07-02 18:29:07] [config] disp-first: 0
[2020-07-02 18:29:07] [config] disp-freq: 1000
[2020-07-02 18:29:07] [config] disp-label-counts: false
[2020-07-02 18:29:07] [config] dropout-rnn: 0
[2020-07-02 18:29:07] [config] dropout-src: 0
[2020-07-02 18:29:07] [config] dropout-trg: 0
[2020-07-02 18:29:07] [config] dump-config: ""
[2020-07-02 18:29:07] [config] early-stopping: 5
[2020-07-02 18:29:07] [config] embedding-fix-src: false
[2020-07-02 18:29:07] [config] embedding-fix-trg: false
[2020-07-02 18:29:07] [config] embedding-normalization: false
[2020-07-02 18:29:07] [config] embedding-vectors:
[2020-07-02 18:29:07] [config]   []
[2020-07-02 18:29:07] [config] enc-cell: gru
[2020-07-02 18:29:07] [config] enc-cell-depth: 1
[2020-07-02 18:29:07] [config] enc-depth: 1
[2020-07-02 18:29:07] [config] enc-type: bidirectional
[2020-07-02 18:29:07] [config] english-title-case-every: 0
[2020-07-02 18:29:07] [config] exponential-smoothing: 0.0001
[2020-07-02 18:29:07] [config] factor-weight: 1
[2020-07-02 18:29:07] [config] grad-dropping-momentum: 0
[2020-07-02 18:29:07] [config] grad-dropping-rate: 0
[2020-07-02 18:29:07] [config] grad-dropping-warmup: 100
[2020-07-02 18:29:07] [config] gradient-checkpointing: false
[2020-07-02 18:29:07] [config] guided-alignment: none
[2020-07-02 18:29:07] [config] guided-alignment-cost: mse
[2020-07-02 18:29:07] [config] guided-alignment-weight: 0.1
[2020-07-02 18:29:07] [config] ignore-model-config: false
[2020-07-02 18:29:07] [config] input-types:
[2020-07-02 18:29:07] [config]   []
[2020-07-02 18:29:07] [config] interpolate-env-vars: false
[2020-07-02 18:29:07] [config] keep-best: true
[2020-07-02 18:29:07] [config] label-smoothing: 0
[2020-07-02 18:29:07] [config] layer-normalization: true
[2020-07-02 18:29:07] [config] learn-rate: 0.0001
[2020-07-02 18:29:07] [config] lemma-dim-emb: 0
[2020-07-02 18:29:07] [config] log: /home/ubuntu/model.s2s/train.log
[2020-07-02 18:29:07] [config] log-level: info
[2020-07-02 18:29:07] [config] log-time-zone: ""
[2020-07-02 18:29:07] [config] lr-decay: 0
[2020-07-02 18:29:07] [config] lr-decay-freq: 50000
[2020-07-02 18:29:07] [config] lr-decay-inv-sqrt:
[2020-07-02 18:29:07] [config]   - 0
[2020-07-02 18:29:07] [config] lr-decay-repeat-warmup: false
[2020-07-02 18:29:07] [config] lr-decay-reset-optimizer: false
[2020-07-02 18:29:07] [config] lr-decay-start:
[2020-07-02 18:29:07] [config]   - 10
[2020-07-02 18:29:07] [config]   - 1
[2020-07-02 18:29:07] [config] lr-decay-strategy: epoch+stalled
[2020-07-02 18:29:07] [config] lr-report: false
[2020-07-02 18:29:07] [config] lr-warmup: 0
[2020-07-02 18:29:07] [config] lr-warmup-at-reload: false
[2020-07-02 18:29:07] [config] lr-warmup-cycle: false
[2020-07-02 18:29:07] [config] lr-warmup-start-rate: 0
[2020-07-02 18:29:07] [config] max-length: 150
[2020-07-02 18:29:07] [config] max-length-crop: false
[2020-07-02 18:29:07] [config] max-length-factor: 3
[2020-07-02 18:29:07] [config] maxi-batch: 1000
[2020-07-02 18:29:07] [config] maxi-batch-sort: trg
[2020-07-02 18:29:07] [config] mini-batch: 64
[2020-07-02 18:29:07] [config] mini-batch-fit: true
[2020-07-02 18:29:07] [config] mini-batch-fit-step: 10
[2020-07-02 18:29:07] [config] mini-batch-track-lr: false
[2020-07-02 18:29:07] [config] mini-batch-warmup: 0
[2020-07-02 18:29:07] [config] mini-batch-words: 0
[2020-07-02 18:29:07] [config] mini-batch-words-ref: 0
[2020-07-02 18:29:07] [config] model: /home/ubuntu/model.s2s/model.npz
[2020-07-02 18:29:07] [config] multi-loss-type: sum
[2020-07-02 18:29:07] [config] multi-node: false
[2020-07-02 18:29:07] [config] multi-node-overlap: true
[2020-07-02 18:29:07] [config] n-best: false
[2020-07-02 18:29:07] [config] no-nccl: false
[2020-07-02 18:29:07] [config] no-reload: false
[2020-07-02 18:29:07] [config] no-restore-corpus: false
[2020-07-02 18:29:07] [config] normalize: 0
[2020-07-02 18:29:07] [config] normalize-gradient: false
[2020-07-02 18:29:07] [config] num-devices: 0
[2020-07-02 18:29:07] [config] optimizer: adam
[2020-07-02 18:29:07] [config] optimizer-delay: 1
[2020-07-02 18:29:07] [config] optimizer-params:
[2020-07-02 18:29:07] [config]   []
[2020-07-02 18:29:07] [config] overwrite: true
[2020-07-02 18:29:07] [config] precision:
[2020-07-02 18:29:07] [config]   - float32
[2020-07-02 18:29:07] [config]   - float32
[2020-07-02 18:29:07] [config]   - float32
[2020-07-02 18:29:07] [config] pretrained-model: ""
[2020-07-02 18:29:07] [config] quiet: false
[2020-07-02 18:29:07] [config] quiet-translation: false
[2020-07-02 18:29:07] [config] relative-paths: false
[2020-07-02 18:29:07] [config] right-left: false
[2020-07-02 18:29:07] [config] save-freq: 1000
[2020-07-02 18:29:07] [config] seed: 0
[2020-07-02 18:29:07] [config] shuffle: data
[2020-07-02 18:29:07] [config] shuffle-in-ram: false
[2020-07-02 18:29:07] [config] skip: false
[2020-07-02 18:29:07] [config] sqlite: ""
[2020-07-02 18:29:07] [config] sqlite-drop: false
[2020-07-02 18:29:07] [config] sync-sgd: false
[2020-07-02 18:29:07] [config] tempdir: /tmp
[2020-07-02 18:29:07] [config] tied-embeddings: true
[2020-07-02 18:29:07] [config] tied-embeddings-all: false
[2020-07-02 18:29:07] [config] tied-embeddings-src: false
[2020-07-02 18:29:07] [config] train-sets:
[2020-07-02 18:29:07] [config]   - /home/ubuntu/hin_all.txt
[2020-07-02 18:29:07] [config]   - /home/ubuntu/eng_all.txt
[2020-07-02 18:29:07] [config] transformer-aan-activation: swish
[2020-07-02 18:29:07] [config] transformer-aan-depth: 2
[2020-07-02 18:29:07] [config] transformer-aan-nogate: false
[2020-07-02 18:29:07] [config] transformer-decoder-autoreg: self-attention
[2020-07-02 18:29:07] [config] transformer-depth-scaling: false
[2020-07-02 18:29:07] [config] transformer-dim-aan: 2048
[2020-07-02 18:29:07] [config] transformer-dim-ffn: 2048
[2020-07-02 18:29:07] [config] transformer-dropout: 0
[2020-07-02 18:29:07] [config] transformer-dropout-attention: 0
[2020-07-02 18:29:07] [config] transformer-dropout-ffn: 0
[2020-07-02 18:29:07] [config] transformer-ffn-activation: swish
[2020-07-02 18:29:07] [config] transformer-ffn-depth: 2
[2020-07-02 18:29:07] [config] transformer-guided-alignment-layer: last
[2020-07-02 18:29:07] [config] transformer-heads: 8
[2020-07-02 18:29:07] [config] transformer-no-projection: false
[2020-07-02 18:29:07] [config] transformer-postprocess: dan
[2020-07-02 18:29:07] [config] transformer-postprocess-emb: d
[2020-07-02 18:29:07] [config] transformer-preprocess: ""
[2020-07-02 18:29:07] [config] transformer-tied-layers:
[2020-07-02 18:29:07] [config]   []
[2020-07-02 18:29:07] [config] transformer-train-position-embeddings: false
[2020-07-02 18:29:07] [config] type: s2s
[2020-07-02 18:29:07] [config] ulr: false
[2020-07-02 18:29:07] [config] ulr-dim-emb: 0
[2020-07-02 18:29:07] [config] ulr-dropout: 0
[2020-07-02 18:29:07] [config] ulr-keys-vectors: ""
[2020-07-02 18:29:07] [config] ulr-query-vectors: ""
[2020-07-02 18:29:07] [config] ulr-softmax-temperature: 1
[2020-07-02 18:29:07] [config] ulr-trainable-transformation: false
[2020-07-02 18:29:07] [config] unlikelihood-loss: false
[2020-07-02 18:29:07] [config] valid-freq: 10000u
[2020-07-02 18:29:07] [config] valid-log: ""
[2020-07-02 18:29:07] [config] valid-max-length: 1000
[2020-07-02 18:29:07] [config] valid-metrics:
[2020-07-02 18:29:07] [config]   - cross-entropy
[2020-07-02 18:29:07] [config] valid-mini-batch: 32
[2020-07-02 18:29:07] [config] valid-reset-stalled: false
[2020-07-02 18:29:07] [config] valid-script-args:
[2020-07-02 18:29:07] [config]   []
[2020-07-02 18:29:07] [config] valid-script-path: ""
[2020-07-02 18:29:07] [config] valid-sets:
[2020-07-02 18:29:07] [config]   []
[2020-07-02 18:29:07] [config] valid-translation-output: ""
[2020-07-02 18:29:07] [config] vocabs:
[2020-07-02 18:29:07] [config]   []
[2020-07-02 18:29:07] [config] word-penalty: 0
[2020-07-02 18:29:07] [config] word-scores: false
[2020-07-02 18:29:07] [config] workspace: 5000
[2020-07-02 18:29:07] [config] Model is being created with Marian v1.9.0 ba94c5b9 2020-05-17 10:42:17 +0100
[2020-07-02 18:29:07] Using single-device training
[2020-07-02 18:29:07] No vocabulary files given, trying to find or build based on training data. Vocabularies will be built separately for each file.
[2020-07-02 18:29:07] No vocabulary path given; trying to find default vocabulary based on data path /home/ubuntu/hin_all.txt
[2020-07-02 18:29:07] [data] Loading vocabulary from JSON/Yaml file /home/ubuntu/hin_all.txt.yml
[2020-07-02 18:29:11] [data] Setting vocabulary size for input 0 to 388597
[2020-07-02 18:29:11] No vocabulary path given; trying to find default vocabulary based on data path /home/ubuntu/eng_all.txt
[2020-07-02 18:29:11] [data] Loading vocabulary from JSON/Yaml file /home/ubuntu/eng_all.txt.yml
[2020-07-02 18:29:13] [data] Setting vocabulary size for input 1 to 304290
[2020-07-02 18:29:13] Compiled without MPI support. Falling back to FakeMPIWrapper
[2020-07-02 18:29:13] [batching] Collecting statistics for batch fitting with step size 10
[2020-07-02 18:29:13] [memory] Extending reserved space to 5120 MB (device gpu0)
[2020-07-02 18:29:14] [logits] applyLossFunction() for 1 factors
[2020-07-02 18:29:07] [config] valid-max-length: 1000
[2020-07-02 18:29:07] [config] valid-metrics:
[2020-07-02 18:29:07] [config]   - cross-entropy
[2020-07-02 18:29:07] [config] valid-mini-batch: 32
[2020-07-02 18:29:07] [config] valid-reset-stalled: false
[2020-07-02 18:29:07] [config] valid-script-args:
[2020-07-02 18:29:07] [config]   []
[2020-07-02 18:29:07] [config] valid-script-path: ""
[2020-07-02 18:29:07] [config] valid-sets:
[2020-07-02 18:29:07] [config]   []
[2020-07-02 18:29:07] [config] valid-translation-output: ""
[2020-07-02 18:29:07] [config] vocabs:
[2020-07-02 18:29:07] [config]   []
[2020-07-02 18:29:07] [config] word-penalty: 0
[2020-07-02 18:29:07] [config] word-scores: false
[2020-07-02 18:29:07] [config] workspace: 5000
[2020-07-02 18:29:07] [config] Model is being created with Marian v1.9.0 ba94c5b9 2020-05-17 10:42:17 +0100
[2020-07-02 18:29:07] Using single-device training
[2020-07-02 18:29:07] No vocabulary files given, trying to find or build based on training data. Vocabularies will be built separately for each file.
[2020-07-02 18:29:07] No vocabulary path given; trying to find default vocabulary based on data path /home/ubuntu/hin_all.txt
[2020-07-02 18:29:07] [data] Loading vocabulary from JSON/Yaml file /home/ubuntu/hin_all.txt.yml
[2020-07-02 18:29:11] [data] Setting vocabulary size for input 0 to 388597
[2020-07-02 18:29:11] No vocabulary path given; trying to find default vocabulary based on data path /home/ubuntu/eng_all.txt
[2020-07-02 18:29:11] [data] Loading vocabulary from JSON/Yaml file /home/ubuntu/eng_all.txt.yml
[2020-07-02 18:29:13] [data] Setting vocabulary size for input 1 to 304290
[2020-07-02 18:29:13] Compiled without MPI support. Falling back to FakeMPIWrapper
[2020-07-02 18:29:13] [batching] Collecting statistics for batch fitting with step size 10
[2020-07-02 18:29:13] [memory] Extending reserved space to 5120 MB (device gpu0)
[2020-07-02 18:29:14] [logits] applyLossFunction() for 1 factors
[2020-07-02 18:29:14] [memory] Reserving 1483 MB, device gpu0
[2020-07-02 18:29:14] [gpu] 16-bit TensorCores enabled for float32 matrix operations
[2020-07-02 18:29:14] [memory] Reserving 1483 MB, device gpu0
[2020-07-02 18:29:30] [batching] Done. Typical MB size is 1014 target words
[2020-07-02 18:29:30] [memory] Extending reserved space to 5120 MB (device gpu0)
[2020-07-02 18:29:30] Training started
[2020-07-02 18:29:30] [data] Shuffling data
[2020-07-02 18:29:30] [data] Done reading 1566840 sentences
[2020-07-02 18:29:37] [data] Done shuffling 1566840 sentences to temp files
[2020-07-02 18:29:38] [memory] Reserving 1483 MB, device gpu0
[2020-07-02 18:29:38] [memory] Reserving 1483 MB, device gpu0
[2020-07-02 18:29:39] [memory] Reserving 2967 MB, device gpu0
[2020-07-02 18:29:39] [memory] Reserving 1483 MB, device gpu0
[2020-07-02 18:39:20] Ep. 1 : Up. 1000 : Sen. 64,389 : Cost 7.17595577 : Time 590.50s : 1461.74 words/s
[2020-07-02 18:39:20] Saving model weights and runtime parameters to /home/ubuntu/model.s2s/model.npz.orig.npz
[2020-07-02 18:39:26] Saving model weights and runtime parameters to /home/ubuntu/model.s2s/model.npz
[2020-07-02 18:39:31] Saving Adam parameters to /home/ubuntu/model.s2s/model.npz.optimizer.npz
tcmalloc: large alloc 1555718144 bytes == 0x55ae4c5ce000 @
tcmalloc: large alloc 1555718144 bytes == 0x55aea9174000 @
tcmalloc: large alloc 1555718144 bytes == 0x55af25768000 @
tcmalloc: large alloc 1555718144 bytes == 0x55af8230e000 @

I have tried increasing my workspace, reducing maxi-batch to up to 1000, saving more frequently but nothing seems to work. Can someone help me with what exactly is wrong? 

I am planning to implement models with more layers and dropout but wanted to use this as a beginning point but not much success is stopping me from experimenting with more complex models

charan n.a

unread,
Jul 2, 2020, 7:12:28 PM7/2/20
to marian-nmt
I also found just now that switching to SGD instead of the default of ADAM optimizer it runs without errors. Is there anyway that I can prevent the saving of the ADAM optimizer weights from failing?

Marcin Junczys-Dowmunt

unread,
Jul 2, 2020, 7:14:54 PM7/2/20
to maria...@googlegroups.com

Hi,

What’s the size of the RAM you have on your GPU? You are only using one GPU, right?

 

Also, make sure to review, if you actually expect to have more than 300 thousand vocabulary items? Is that a setting that makes sense?

--
You received this message because you are subscribed to the Google Groups "marian-nmt" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marian-nmt+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/marian-nmt/930de3bc-5d5f-4961-858d-5905e70f85eao%40googlegroups.com.

 

charan n.a

unread,
Jul 2, 2020, 8:51:52 PM7/2/20
to marian-nmt
Hey, I am running a single nvidia titan t4 which is 16gigs of GPU and I also have 16gigs of RAM. I might trim down the numbers further but there are no basic issues like comma separation or repeated words, I have reviewed it for that. My training data is also 1,566,840 and might increase. This is still initial testing but I am guessing my vocab file will be big. Even otherwise the optimizer save is what is killing the process. As a mentioned after changing it use SGD optimizer is running it without issues

To unsubscribe from this group and stop receiving emails from it, send an email to maria...@googlegroups.com.

Reply all
Reply to author
Forward
0 new messages