Error using guided-alignment and fp16 together

57 views
Skip to first unread message

Jeremiah Chow

unread,
Apr 15, 2022, 10:16:53 AM4/15/22
to marian-nmt
Hi, happy holidays

I am using marian 1.11.0 on pop!os (ubuntu 21) using --guided-alignment and --fp16.  Using CUDA 11 with NVidia RTX 3090

It gives :
[2022-04-15 22:06:35] [logits] Applying loss function for 1 factor(s)
[2022-04-15 22:06:35] Error: Child 1 has different type (first: float32 != child: float16)
[2022-04-15 22:06:35] Error: Aborted from static marian::Type marian::NaryNodeOp::commonType(const std::vector<IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase> > > >&) in /home/bobafett/marian/marian/src/graph/node.h:207

This does not happen when I remove --fp16.

The settings were pretty much a copy from oput-mt en-zho where the authors there used guided-alignment with fp16, seemingly without problem.

I also attach the training log for more info.  Any help is much appreciated, thanks!

JC


enzhv4train.log

Marcin Junczys-Dowmunt

unread,
Apr 15, 2022, 5:55:26 PM4/15/22
to maria...@googlegroups.com

Hi,

Yes, this will be fixed in the next release, which should arrive in a couple of days (over the weekend most likely). It’s fixed internally, but not yet pushed to the public.

 

From: Jeremiah Chow
Sent: Friday, April 15, 2022 7:19 AM
To: marian-nmt
Subject: [EXTERNAL] [marian-nmt] Error using guided-alignment and fp16 together

 

You don't often get email from whendr...@gmail.com. Learn why this is important

--
You received this message because you are subscribed to the Google Groups "marian-nmt" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marian-nmt+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/marian-nmt/9f04cbc9-dd12-4805-8bd7-26b99e9d7d7fn%40googlegroups.com.

 

Jeremiah Chow

unread,
Apr 16, 2022, 3:13:05 AM4/16/22
to marian-nmt
That's good to know!  Thanks for all the hard work and for pushing the release early :).  Will test and revert

Jeremiah Chow

unread,
Apr 16, 2022, 9:40:19 AM4/16/22
to marian-nmt
Potential bug in the latest marian-dev version, log attached

I git cloned the latest version from marian-dev about 20 hours ago and used it to train two models (Eng-Zho, Zho-Eng).  I stopped their training, restarted, and they both show
[2022-04-16 21:30:31] Loading Adam parameters
[2022-04-16 21:30:31] [warn] Adam parameters not found in .npz file
[2022-04-16 21:30:31] No master parameters found in checkpoint, parameters reloaded from last inference model

It seems the latest marian-dev version has troubles saving/recalling the parameters.  I used the latest marian-dev's marian-decoder, and the marian-server in the stable release (v1.11.0) and they both have troubles loading the parameters.

Thought I would share it here in case it will help, cheers.  I can upload the files to my dropbox (about 5.1GB/model) if needed
JC
enzhv4train.log

Marcin Junczys-Dowmunt

unread,
Apr 16, 2022, 11:18:38 AM4/16/22
to maria...@googlegroups.com

You are reloading from the model npz only, if I am not wrong. You also need the *.optimizer.npz, that’s where the optimizer parameters sit.

Jeremiah Chow

unread,
Apr 16, 2022, 12:07:21 PM4/16/22
to marian-nmt
Yes, I understand that is the case, that's why I added the --relative-path option but marian does not see the npz.optimizer.npz which is in the same folder.  Is there any document I need to modify (such as one of the .yml files) to point Marian to  the optimizer file?  Thanks and cheersScreenshot from 2022-04-17 00-04-53.png

Marcin Junczys-Dowmunt

unread,
Apr 16, 2022, 12:15:55 PM4/16/22
to maria...@googlegroups.com

Hm, there is no way this doesn’t work, we have tons of automatic regression tests for that. At least for the default case.

What are the exact commands you ran for training and restarted training?

 

From: Jeremiah Chow
Sent: Saturday, April 16, 2022 9:10 AM
To: marian-nmt
Subject: Re: [EXTERNAL] [marian-nmt] Error using guided-alignment and fp16 together

 

You don't often get email from whendr...@gmail.com. Learn why this is important

Yes, I understand that is the case, that's why I added the --relative-path option but marian does not see the npz.optimizer.npz which is in the same folder.  Is there any document I need to modify (such as one of the .yml files) to point Marian to  the optimizer file?  Thanks and cheers

Jeremiah Chow

unread,
Apr 16, 2022, 12:23:07 PM4/16/22
to marian-nmt
Hi,

I tried this in a .sh bash script
marian --guided-alignment zhen.align2 --early-stopping 15 --beam-size 12 --normalize 1 --allow-unk --overwrite --keep-best --model ../zhenv4/zhenv4.npz --type transformer --train-sets zhcleaned.txt encleaned.txt --max-length 500 --vocabs ../vocab.yml ../vocab.yml --mini-batch-fit -w 4096 --maxi-batch 500 --save-freq 10000 --disp-freq 500 --log ../zhenv4/zhenv4train.log --enc-depth 6 --dec-depth 6 --transformer-heads 8 --transformer-postprocess-emb d --transformer-postprocess dan --transformer-dropout 0.1 --label-smoothing 0.1 --learn-rate 0.0001 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report --optimizer-params 0.9 0.98 1e-09 --clip-norm 5 --tied-embeddings-all --devices 1 --sync-sgd --seed 1111 --exponential-smoothing --after 15e --valid-freq 70000 --valid-mini-batch 4 --valid-metrics perplexity --valid-log ../zhenv4/zhenv4valid.log --valid-sets zh1Mx50.txt en1Mx50.txt --valid-translation-output ../zhenv4/testout.txt --relative-paths true

and this
marian --guided-alignment enzh_clean.align --early-stopping 15 --beam-size 12 --normalize 1 --allow-unk --overwrite --keep-best --model ../enzhv4/enzhv4.npz --type transformer --train-sets encleaned.txt zhcleaned.txt --max-length 500 --vocabs ../vocab.yml ../vocab.yml --mini-batch-fit -w 4096 --maxi-batch 500 --save-freq 10000 --disp-freq 500 --log ../enzhv4/enzhv4train.log --enc-depth 6 --dec-depth 6 --transformer-heads 8 --transformer-postprocess-emb d --transformer-postprocess dan --transformer-dropout 0.1 --label-smoothing 0.1 --learn-rate 0.0003 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report --optimizer-params 0.9 0.98 1e-09 --clip-norm 5 --tied-embeddings-all --devices 0 --sync-sgd --seed 1111 --exponential-smoothing --after 15e --valid-freq 70000 --valid-mini-batch 4 --valid-metrics perplexity --valid-log ../enzhv4/enzhv4valid.log --valid-sets en1Mx50.txt zh1Mx50.txt --valid-translation-output ../enzhv4/testout.txt

when I try to restart, I just press the up arrow and re-run the bash {train.sh}.

hope this helps

Marcin Junczys-Dowmunt

unread,
Apr 16, 2022, 12:32:13 PM4/16/22
to maria...@googlegroups.com

Hm, looked at the code. These messages appear when the checkpoint file was found, but does not contain the correct data. I would say *.optimizer.npz is corrupted. Is it possible you interrupted while *.optimizer.npz was not fully written to disk yet?

Jeremiah Chow

unread,
Apr 16, 2022, 12:44:01 PM4/16/22
to marian-nmt
hmm that does not seem likely to me, I have Dropbox running in the background which handles upload, but I have had the same setup and machine for the last 14 days and nothing similar happened.

I did modify the decoder.yml file to try to launch a marian-server to test the models (thus I found out the missing model weights model are not specific to marian, but also to marian-decoder and marian-server), but again this should not affect the files.

overall I would say it is not very likely that the problems would happen to both training right after I switch to marian-dev.

If you prefer I can send you a link to my dropbox for both model folders, plus the binaries I used. 

I saved the compile log of marian-dev in journey.log and AFAIK no compilation error was thrown.

What would be the appropriate email to send to?  Can't seem to see your full email address here.  Cheers

Marcin Junczys-Dowmunt

unread,
Apr 16, 2022, 12:49:29 PM4/16/22
to maria...@googlegroups.com

Oh wait. You “switched to marian-dev”, do you mean from an older release in marian? We did change the optimizer.npz format at some point, so old checkpoints would not be expected to work. Old models files are still compatible.

Jeremiah Chow

unread,
Apr 16, 2022, 12:56:19 PM4/16/22
to marian-nmt
ah let me clarify.

I compiled the latest marian-dev about 24 hours ago (1 am).  Started the enzh training with dev's marian immediately.  At around 930am, after generating the word-alignment file, I started the zhen training with dev's marian.

I started tinkering with dev's marian-decoder and original release's marian-server at around 6pm and discovered the problem.  Then I stopped training to check the files.  Could not restart either training, same "missing weights" error

I provided the time so that it is easier to read the log file I attach to my previous message.

To clarify, both trainings were "brand new" from marian-dev :)

Marcin Junczys-Dowmunt

unread,
Apr 16, 2022, 1:03:49 PM4/16/22
to maria...@googlegroups.com

Then my bet is still corrupted optimizer file. It’s finding the file, otherwise you would get a different error message, but then it’s not finding the right fields inside the file. There isn’t really any other option than bad optimizer file, however that might have happened.

 

Could dropbox syncing be responsible? I have lost data due to bad syncing via dropbox in the past.

Marcin Junczys-Dowmunt

unread,
Apr 16, 2022, 1:09:30 PM4/16/22
to maria...@googlegroups.com

Ah, email address: marc...@microsoft.com

I am curious to take a look at the optimizer.npz

Jeremiah Chow

unread,
Apr 16, 2022, 1:44:23 PM4/16/22
to marian-nmt
I have sent to your email, thanks!

To unsubscribe from this group and stop receiving emails from it, send an email to marian-nmt+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "marian-nmt" group.

To unsubscribe from this group and stop receiving emails from it, send an email to marian-nmt+unsubscribe@googlegroups.com.

Reply all
Reply to author
Forward
Message has been deleted
0 new messages