I have a vocabulary size of around 388597 and 304290, I went through them and most of it seems fine. Even if I use a subword segmentation it shouldn't reduce by a large number. This is for a translation model and the training seems to crash multiple times after lines like below appear in the log.
[2020-07-02 18:29:07] [marian] Marian v1.9.0 ba94c5b9 2020-05-17 10:42:17 +0100
[2020-07-02 18:29:07] [marian] Running on ip-172-31-28-255 as process 5525 with command line:
[2020-07-02 18:29:07] [marian] /home/ubuntu/marian/build/marian --model /home/ubuntu/model.s2s/model.npz --type s2s --train-sets /home/ubuntu/lang1.txt /home/ubuntu/lang2.txt --max-length 150 --mini-batch-fit -w 5000 --maxi-batch 1000 --save-freq 1000 --disp-freq 1000 --overwrite --keep-best --early-stopping 5 --after-epochs 10 --cost-type=ce-mean-words --log /home/ubuntu/model.s2s/train.log --tied-embeddings --layer-normalization --seed 0 --exponential-smoothing --devices 0
[2020-07-02 18:29:07] [config] after-batches: 0
[2020-07-02 18:29:07] [config] after-epochs: 10
[2020-07-02 18:29:07] [config] all-caps-every: 0
[2020-07-02 18:29:07] [config] allow-unk: false
[2020-07-02 18:29:07] [config] authors: false
[2020-07-02 18:29:07] [config] beam-size: 12
[2020-07-02 18:29:07] [config] bert-class-symbol: "[CLS]"
[2020-07-02 18:29:07] [config] bert-mask-symbol: "[MASK]"
[2020-07-02 18:29:07] [config] bert-masking-fraction: 0.15
[2020-07-02 18:29:07] [config] bert-sep-symbol: "[SEP]"
[2020-07-02 18:29:07] [config] bert-train-type-embeddings: true
[2020-07-02 18:29:07] [config] bert-type-vocab-size: 2
[2020-07-02 18:29:07] [config] build-info: ""
[2020-07-02 18:29:07] [config] cite: false
[2020-07-02 18:29:07] [config] clip-gemm: 0
[2020-07-02 18:29:07] [config] clip-norm: 1
[2020-07-02 18:29:07] [config] cost-scaling:
[2020-07-02 18:29:07] [config] []
[2020-07-02 18:29:07] [config] cost-type: ce-mean-words
[2020-07-02 18:29:07] [config] cpu-threads: 0
[2020-07-02 18:29:07] [config] data-weighting: ""
[2020-07-02 18:29:07] [config] data-weighting-type: sentence
[2020-07-02 18:29:07] [config] dec-cell: gru
[2020-07-02 18:29:07] [config] dec-cell-base-depth: 2
[2020-07-02 18:29:07] [config] dec-cell-high-depth: 1
[2020-07-02 18:29:07] [config] dec-depth: 1
[2020-07-02 18:29:07] [config] devices:
[2020-07-02 18:29:07] [config] - 0
[2020-07-02 18:29:07] [config] dim-emb: 512
[2020-07-02 18:29:07] [config] dim-rnn: 1024
[2020-07-02 18:29:07] [config] dim-vocabs:
[2020-07-02 18:29:07] [config] - 0
[2020-07-02 18:29:07] [config] - 0
[2020-07-02 18:29:07] [config] disp-first: 0
[2020-07-02 18:29:07] [config] disp-freq: 1000
[2020-07-02 18:29:07] [config] disp-label-counts: false
[2020-07-02 18:29:07] [config] dropout-rnn: 0
[2020-07-02 18:29:07] [config] dropout-src: 0
[2020-07-02 18:29:07] [config] dropout-trg: 0
[2020-07-02 18:29:07] [config] dump-config: ""
[2020-07-02 18:29:07] [config] early-stopping: 5
[2020-07-02 18:29:07] [config] embedding-fix-src: false
[2020-07-02 18:29:07] [config] embedding-fix-trg: false
[2020-07-02 18:29:07] [config] embedding-normalization: false
[2020-07-02 18:29:07] [config] embedding-vectors:
[2020-07-02 18:29:07] [config] []
[2020-07-02 18:29:07] [config] enc-cell: gru
[2020-07-02 18:29:07] [config] enc-cell-depth: 1
[2020-07-02 18:29:07] [config] enc-depth: 1
[2020-07-02 18:29:07] [config] enc-type: bidirectional
[2020-07-02 18:29:07] [config] english-title-case-every: 0
[2020-07-02 18:29:07] [config] exponential-smoothing: 0.0001
[2020-07-02 18:29:07] [config] factor-weight: 1
[2020-07-02 18:29:07] [config] grad-dropping-momentum: 0
[2020-07-02 18:29:07] [config] grad-dropping-rate: 0
[2020-07-02 18:29:07] [config] grad-dropping-warmup: 100
[2020-07-02 18:29:07] [config] gradient-checkpointing: false
[2020-07-02 18:29:07] [config] guided-alignment: none
[2020-07-02 18:29:07] [config] guided-alignment-cost: mse
[2020-07-02 18:29:07] [config] guided-alignment-weight: 0.1
[2020-07-02 18:29:07] [config] ignore-model-config: false
[2020-07-02 18:29:07] [config] input-types:
[2020-07-02 18:29:07] [config] []
[2020-07-02 18:29:07] [config] interpolate-env-vars: false
[2020-07-02 18:29:07] [config] keep-best: true
[2020-07-02 18:29:07] [config] label-smoothing: 0
[2020-07-02 18:29:07] [config] layer-normalization: true
[2020-07-02 18:29:07] [config] learn-rate: 0.0001
[2020-07-02 18:29:07] [config] lemma-dim-emb: 0
[2020-07-02 18:29:07] [config] log: /home/ubuntu/model.s2s/train.log
[2020-07-02 18:29:07] [config] log-level: info
[2020-07-02 18:29:07] [config] log-time-zone: ""
[2020-07-02 18:29:07] [config] lr-decay: 0
[2020-07-02 18:29:07] [config] lr-decay-freq: 50000
[2020-07-02 18:29:07] [config] lr-decay-inv-sqrt:
[2020-07-02 18:29:07] [config] - 0
[2020-07-02 18:29:07] [config] lr-decay-repeat-warmup: false
[2020-07-02 18:29:07] [config] lr-decay-reset-optimizer: false
[2020-07-02 18:29:07] [config] lr-decay-start:
[2020-07-02 18:29:07] [config] - 10
[2020-07-02 18:29:07] [config] - 1
[2020-07-02 18:29:07] [config] lr-decay-strategy: epoch+stalled
[2020-07-02 18:29:07] [config] lr-report: false
[2020-07-02 18:29:07] [config] lr-warmup: 0
[2020-07-02 18:29:07] [config] lr-warmup-at-reload: false
[2020-07-02 18:29:07] [config] lr-warmup-cycle: false
[2020-07-02 18:29:07] [config] lr-warmup-start-rate: 0
[2020-07-02 18:29:07] [config] max-length: 150
[2020-07-02 18:29:07] [config] max-length-crop: false
[2020-07-02 18:29:07] [config] max-length-factor: 3
[2020-07-02 18:29:07] [config] maxi-batch: 1000
[2020-07-02 18:29:07] [config] maxi-batch-sort: trg
[2020-07-02 18:29:07] [config] mini-batch: 64
[2020-07-02 18:29:07] [config] mini-batch-fit: true
[2020-07-02 18:29:07] [config] mini-batch-fit-step: 10
[2020-07-02 18:29:07] [config] mini-batch-track-lr: false
[2020-07-02 18:29:07] [config] mini-batch-warmup: 0
[2020-07-02 18:29:07] [config] mini-batch-words: 0
[2020-07-02 18:29:07] [config] mini-batch-words-ref: 0
[2020-07-02 18:29:07] [config] model: /home/ubuntu/model.s2s/model.npz
[2020-07-02 18:29:07] [config] multi-loss-type: sum
[2020-07-02 18:29:07] [config] multi-node: false
[2020-07-02 18:29:07] [config] multi-node-overlap: true
[2020-07-02 18:29:07] [config] n-best: false
[2020-07-02 18:29:07] [config] no-nccl: false
[2020-07-02 18:29:07] [config] no-reload: false
[2020-07-02 18:29:07] [config] no-restore-corpus: false
[2020-07-02 18:29:07] [config] normalize: 0
[2020-07-02 18:29:07] [config] normalize-gradient: false
[2020-07-02 18:29:07] [config] num-devices: 0
[2020-07-02 18:29:07] [config] optimizer: adam
[2020-07-02 18:29:07] [config] optimizer-delay: 1
[2020-07-02 18:29:07] [config] optimizer-params:
[2020-07-02 18:29:07] [config] []
[2020-07-02 18:29:07] [config] overwrite: true
[2020-07-02 18:29:07] [config] precision:
[2020-07-02 18:29:07] [config] - float32
[2020-07-02 18:29:07] [config] - float32
[2020-07-02 18:29:07] [config] - float32
[2020-07-02 18:29:07] [config] pretrained-model: ""
[2020-07-02 18:29:07] [config] quiet: false
[2020-07-02 18:29:07] [config] quiet-translation: false
[2020-07-02 18:29:07] [config] relative-paths: false
[2020-07-02 18:29:07] [config] right-left: false
[2020-07-02 18:29:07] [config] save-freq: 1000
[2020-07-02 18:29:07] [config] seed: 0
[2020-07-02 18:29:07] [config] shuffle: data
[2020-07-02 18:29:07] [config] shuffle-in-ram: false
[2020-07-02 18:29:07] [config] skip: false
[2020-07-02 18:29:07] [config] sqlite: ""
[2020-07-02 18:29:07] [config] sqlite-drop: false
[2020-07-02 18:29:07] [config] sync-sgd: false
[2020-07-02 18:29:07] [config] tempdir: /tmp
[2020-07-02 18:29:07] [config] tied-embeddings: true
[2020-07-02 18:29:07] [config] tied-embeddings-all: false
[2020-07-02 18:29:07] [config] tied-embeddings-src: false
[2020-07-02 18:29:07] [config] train-sets:
[2020-07-02 18:29:07] [config] - /home/ubuntu/hin_all.txt
[2020-07-02 18:29:07] [config] - /home/ubuntu/eng_all.txt
[2020-07-02 18:29:07] [config] transformer-aan-activation: swish
[2020-07-02 18:29:07] [config] transformer-aan-depth: 2
[2020-07-02 18:29:07] [config] transformer-aan-nogate: false
[2020-07-02 18:29:07] [config] transformer-decoder-autoreg: self-attention
[2020-07-02 18:29:07] [config] transformer-depth-scaling: false
[2020-07-02 18:29:07] [config] transformer-dim-aan: 2048
[2020-07-02 18:29:07] [config] transformer-dim-ffn: 2048
[2020-07-02 18:29:07] [config] transformer-dropout: 0
[2020-07-02 18:29:07] [config] transformer-dropout-attention: 0
[2020-07-02 18:29:07] [config] transformer-dropout-ffn: 0
[2020-07-02 18:29:07] [config] transformer-ffn-activation: swish
[2020-07-02 18:29:07] [config] transformer-ffn-depth: 2
[2020-07-02 18:29:07] [config] transformer-guided-alignment-layer: last
[2020-07-02 18:29:07] [config] transformer-heads: 8
[2020-07-02 18:29:07] [config] transformer-no-projection: false
[2020-07-02 18:29:07] [config] transformer-postprocess: dan
[2020-07-02 18:29:07] [config] transformer-postprocess-emb: d
[2020-07-02 18:29:07] [config] transformer-preprocess: ""
[2020-07-02 18:29:07] [config] transformer-tied-layers:
[2020-07-02 18:29:07] [config] []
[2020-07-02 18:29:07] [config] transformer-train-position-embeddings: false
[2020-07-02 18:29:07] [config] type: s2s
[2020-07-02 18:29:07] [config] ulr: false
[2020-07-02 18:29:07] [config] ulr-dim-emb: 0
[2020-07-02 18:29:07] [config] ulr-dropout: 0
[2020-07-02 18:29:07] [config] ulr-keys-vectors: ""
[2020-07-02 18:29:07] [config] ulr-query-vectors: ""
[2020-07-02 18:29:07] [config] ulr-softmax-temperature: 1
[2020-07-02 18:29:07] [config] ulr-trainable-transformation: false
[2020-07-02 18:29:07] [config] unlikelihood-loss: false
[2020-07-02 18:29:07] [config] valid-freq: 10000u
[2020-07-02 18:29:07] [config] valid-log: ""
[2020-07-02 18:29:07] [config] valid-max-length: 1000
[2020-07-02 18:29:07] [config] valid-metrics:
[2020-07-02 18:29:07] [config] - cross-entropy
[2020-07-02 18:29:07] [config] valid-mini-batch: 32
[2020-07-02 18:29:07] [config] valid-reset-stalled: false
[2020-07-02 18:29:07] [config] valid-script-args:
[2020-07-02 18:29:07] [config] []
[2020-07-02 18:29:07] [config] valid-script-path: ""
[2020-07-02 18:29:07] [config] valid-sets:
[2020-07-02 18:29:07] [config] []
[2020-07-02 18:29:07] [config] valid-translation-output: ""
[2020-07-02 18:29:07] [config] vocabs:
[2020-07-02 18:29:07] [config] []
[2020-07-02 18:29:07] [config] word-penalty: 0
[2020-07-02 18:29:07] [config] word-scores: false
[2020-07-02 18:29:07] [config] workspace: 5000
[2020-07-02 18:29:07] [config] Model is being created with Marian v1.9.0 ba94c5b9 2020-05-17 10:42:17 +0100
[2020-07-02 18:29:07] Using single-device training
[2020-07-02 18:29:07] No vocabulary files given, trying to find or build based on training data. Vocabularies will be built separately for each file.
[2020-07-02 18:29:07] No vocabulary path given; trying to find default vocabulary based on data path /home/ubuntu/hin_all.txt
[2020-07-02 18:29:07] [data] Loading vocabulary from JSON/Yaml file /home/ubuntu/hin_all.txt.yml
[2020-07-02 18:29:11] [data] Setting vocabulary size for input 0 to 388597
[2020-07-02 18:29:11] No vocabulary path given; trying to find default vocabulary based on data path /home/ubuntu/eng_all.txt
[2020-07-02 18:29:11] [data] Loading vocabulary from JSON/Yaml file /home/ubuntu/eng_all.txt.yml
[2020-07-02 18:29:13] [data] Setting vocabulary size for input 1 to 304290
[2020-07-02 18:29:13] Compiled without MPI support. Falling back to FakeMPIWrapper
[2020-07-02 18:29:13] [batching] Collecting statistics for batch fitting with step size 10
[2020-07-02 18:29:13] [memory] Extending reserved space to 5120 MB (device gpu0)
[2020-07-02 18:29:14] [logits] applyLossFunction() for 1 factors
[2020-07-02 18:29:07] [config] valid-max-length: 1000
[2020-07-02 18:29:07] [config] valid-metrics:
[2020-07-02 18:29:07] [config] - cross-entropy
[2020-07-02 18:29:07] [config] valid-mini-batch: 32
[2020-07-02 18:29:07] [config] valid-reset-stalled: false
[2020-07-02 18:29:07] [config] valid-script-args:
[2020-07-02 18:29:07] [config] []
[2020-07-02 18:29:07] [config] valid-script-path: ""
[2020-07-02 18:29:07] [config] valid-sets:
[2020-07-02 18:29:07] [config] []
[2020-07-02 18:29:07] [config] valid-translation-output: ""
[2020-07-02 18:29:07] [config] vocabs:
[2020-07-02 18:29:07] [config] []
[2020-07-02 18:29:07] [config] word-penalty: 0
[2020-07-02 18:29:07] [config] word-scores: false
[2020-07-02 18:29:07] [config] workspace: 5000
[2020-07-02 18:29:07] [config] Model is being created with Marian v1.9.0 ba94c5b9 2020-05-17 10:42:17 +0100
[2020-07-02 18:29:07] Using single-device training
[2020-07-02 18:29:07] No vocabulary files given, trying to find or build based on training data. Vocabularies will be built separately for each file.
[2020-07-02 18:29:07] No vocabulary path given; trying to find default vocabulary based on data path /home/ubuntu/hin_all.txt
[2020-07-02 18:29:07] [data] Loading vocabulary from JSON/Yaml file /home/ubuntu/hin_all.txt.yml
[2020-07-02 18:29:11] [data] Setting vocabulary size for input 0 to 388597
[2020-07-02 18:29:11] No vocabulary path given; trying to find default vocabulary based on data path /home/ubuntu/eng_all.txt
[2020-07-02 18:29:11] [data] Loading vocabulary from JSON/Yaml file /home/ubuntu/eng_all.txt.yml
[2020-07-02 18:29:13] [data] Setting vocabulary size for input 1 to 304290
[2020-07-02 18:29:13] Compiled without MPI support. Falling back to FakeMPIWrapper
[2020-07-02 18:29:13] [batching] Collecting statistics for batch fitting with step size 10
[2020-07-02 18:29:13] [memory] Extending reserved space to 5120 MB (device gpu0)
[2020-07-02 18:29:14] [logits] applyLossFunction() for 1 factors
[2020-07-02 18:29:14] [memory] Reserving 1483 MB, device gpu0
[2020-07-02 18:29:14] [gpu] 16-bit TensorCores enabled for float32 matrix operations
[2020-07-02 18:29:14] [memory] Reserving 1483 MB, device gpu0
[2020-07-02 18:29:30] [batching] Done. Typical MB size is 1014 target words
[2020-07-02 18:29:30] [memory] Extending reserved space to 5120 MB (device gpu0)
[2020-07-02 18:29:30] Training started
[2020-07-02 18:29:30] [data] Shuffling data
[2020-07-02 18:29:30] [data] Done reading 1566840 sentences
[2020-07-02 18:29:37] [data] Done shuffling 1566840 sentences to temp files
[2020-07-02 18:29:38] [memory] Reserving 1483 MB, device gpu0
[2020-07-02 18:29:38] [memory] Reserving 1483 MB, device gpu0
[2020-07-02 18:29:39] [memory] Reserving 2967 MB, device gpu0
[2020-07-02 18:29:39] [memory] Reserving 1483 MB, device gpu0
[2020-07-02 18:39:20] Ep. 1 : Up. 1000 : Sen. 64,389 : Cost 7.17595577 : Time 590.50s : 1461.74 words/s
[2020-07-02 18:39:20] Saving model weights and runtime parameters to /home/ubuntu/model.s2s/model.npz.orig.npz
[2020-07-02 18:39:26] Saving model weights and runtime parameters to /home/ubuntu/model.s2s/model.npz
[2020-07-02 18:39:31] Saving Adam parameters to /home/ubuntu/model.s2s/model.npz.optimizer.npz
I have tried increasing my workspace, reducing maxi-batch to up to 1000, saving more frequently but nothing seems to work. Can someone help me with what exactly is wrong?
I am planning to implement models with more layers and dropout but wanted to use this as a beginning point but not much success is stopping me from experimenting with more complex models