Help for training Akkadian language for Tesseract 4 needed

143 views
Skip to first unread message

Wincent Balin

unread,
Feb 16, 2020, 3:16:30 PM2/16/20
to tesseract-ocr
Hello all,

after preparing ground truth files for Akkadian language, I started the training using the tesstrain Makefile, but over 4000000 iterations later, the output is like following:

At iteration 4437804/4478900/4478900, Mean rms=1.453%, delta=9.455%, char train=121.423%, word train=87.461%, skip ratio=0%,  wrote checkpoint.

Does char train=121% mean CER of 121%? What could be the cause for such high values even after over 10 days of training?

Yours truly,

Wincent

Shree Devi Kumar

unread,
Feb 17, 2020, 2:23:38 AM2/17/20
to tesseract-ocr
Try lstmtraining again for 1000 iterations with --debug_level -1 




--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/79acb8ca-cb51-4e23-8853-ca4b3405a718%40googlegroups.com.

Shree Devi Kumar

unread,
Feb 17, 2020, 2:53:26 AM2/17/20
to tesseract-ocr
I had done a test training for Akkadian sometime back. I will see if I still have the files.

Shree Devi Kumar

unread,
Feb 17, 2020, 11:30:26 PM2/17/20
to tesseract-ocr
Please see https://github.com/Shreeshrii/tessdata_akk where I have uploaded the traineddata files.
I don't have notes on the different versions, but you could unpack them with `combine_tessdata -u` .
I vaguely remember having the problem where model was not converging. It worked better after I removed some fonts (this training was done using tesstrain.sh).

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Wincent Balin

unread,
Feb 22, 2020, 4:22:19 AM2/22/20
to tesseract-ocr
Hello Shree,

I tried that. The command was

lstmtraining   --traineddata data/akk/akk.traineddata   --old_traineddata /usr/share/tesseract-ocr/4.00/tessdata/akk-1m.traineddata   --continue_from data/akk-1m/akk.lstm   --model_output data/akk/checkpoints/akk   --train_listfile data/akk/list.train   --eval_listfile data/akk/list.eval   --max_iterations 1000   --debug_level -1

and the output started with

Loaded file data/akk/checkpoints/akk_checkpoint, unpacking...
Successfully restored trainer from data/akk/checkpoints/akk_checkpoint
Loaded 1/1 pages (1-1) of document data/akk-ground-truth/P336598.000347.CuneiformComposite.exp0.lstmf
Loaded 1/1 pages (1-1) of document data/akk-ground-truth/P238121.000012.CuneiformNAOutline_Medium.exp0.lstmf

and ended with

Loaded 1/1 pages (1-1) of document data/akk-ground-truth/Q005388.000005.Segoe_UI_Historic.exp0.lstmf
At iteration 4716762/4760600/4760600, Mean rms=1.436%, delta=8.366%, char train=105.86%, word train=86.31%, skip ratio=0%,  wrote checkpoint.

Finished! Error rate = 88.246

Do I have have to retrain completely from scratch, meaning without loading the previous checkpoint?

Maybe I should check out another approach from yours and try to train with one font excluded, so the LSTM converges.

Another thought: I tried training Akkadian with Tesseract 4 once before, but with ground truth consisting of short text files with multiple lines of text, not one-liners. Obviously I used PSM 6, not PSM 11. Is there anything wrong with this approach?


Am Montag, 17. Februar 2020 08:23:38 UTC+1 schrieb shree:
Try lstmtraining again for 1000 iterations with --debug_level -1 




On Mon, Feb 17, 2020, 01:46 Wincent Balin <wincen...@gmail.com> wrote:
Hello all,

after preparing ground truth files for Akkadian language, I started the training using the tesstrain Makefile, but over 4000000 iterations later, the output is like following:

At iteration 4437804/4478900/4478900, Mean rms=1.453%, delta=9.455%, char train=121.423%, word train=87.461%, skip ratio=0%,  wrote checkpoint.

Does char train=121% mean CER of 121%? What could be the cause for such high values even after over 10 days of training?

Yours truly,

Wincent

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Shree Devi Kumar

unread,
Feb 22, 2020, 4:56:46 AM2/22/20
to tesseract-ocr
try with the following - ie with a new output name so that training starts again from 0. The debug output for each iteration (line of text) will show you if any particular font is not aligning or if there are some issues. 

lstmtraining   --traineddata data/akk/akk.traineddata   --old_traineddata /usr/share/tesseract-ocr/4.00/tessdata/akk-1m.traineddata   --continue_from data/akk-1m/akk.lstm   --model_output data/akk/checkpoints/akkNEW   --train_listfile data/akk/list.train   --eval_listfile data/akk/list.eval   --max_iterations 1000   --debug_level -1



To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c5ccc3c8-f18f-4540-93e8-b55ffb37c3ac%40googlegroups.com.

Wincent Balin

unread,
Mar 22, 2020, 4:37:57 PM3/22/20
to tesser...@googlegroups.com
Hello Shree,

I used the Makefile, which does almost the same thing, with this command:

make MODEL_NAME=akk PSM=11 MAX_ITERATIONS=1000 DEBUG_INTERVAL=-1 training

I attached the logging file. I cut out most of the ground truth conversion to lstmf/box files in the line 15.

How comes that all characters appearing are Unicode replacement files? Did I misconfigure something?

Is the warning in the line 75 important?

What does null char=374 in the line 93 mean?

training.log

Shree Devi Kumar

unread,
Mar 24, 2020, 5:58:49 AM3/24/20
to tesseract-ocr
How comes that all characters appearing are Unicode replacement files? Did I misconfigure something?

This could be a locale or encoding issue. It needs to be a unicode text file, I open in notepad++ in windows10, encode in utf-8. I run training on a ubuntu machine remotely.

Is the warning in the line 75 important?

No. I usually give a 0 in the network spec and it uses the number of characters in unicharset.

Warning: LSTMTrainer deserialized an LSTMRecognizer!
Continuing from data/eng/eng.lstm
Appending a new network to an old one!!Warning: given outputs 1 not equal to unicharset of 130.
Num outputs,weights in Series:
  Lfx96:96, 74112
  Fc130:130, 12610
Total weights = 86722
Built network:[1,36,0,1[C3,3Ft16]Mp3,3Lfys64Lfx96Lrx96Lfx96Fc130] from request [Lfx 96 O1c1]
Training parameters:
  Debug interval = -1, weights = 0.1, learning rate = 0.001, momentum=0.5
null char=2

What does null char=374 in the line 93 mean?

I don't know. Please look at the unicharset files, they usually have a line related to NULL right near the top.

Shree Devi Kumar

unread,
Mar 26, 2020, 6:51:13 AM3/26/20
to tesseract-ocr
Please see https://github.com/Shreeshrii/tesstrain-akk which has the LSTM training input, training steps and resulting traineddata files.

You can change the training text and fonts to customize and further finetune the models.

Shree Devi Kumar

unread,
Mar 26, 2020, 6:54:48 AM3/26/20
to tesseract-ocr
Wincent,

FYI I use a combination of bash script and makefile for running training, since I am not able to control the processing via makefile

On Thu, Mar 26, 2020 at 4:20 PM Shree Devi Kumar <shree...@gmail.com> wrote:
Please see https://github.com/Shreeshrii/tesstrain-akk which has the LSTM training input, training steps and resulting traineddata files.

You can change the training text and fonts to customize and further finetune the models.


Reply all
Reply to author
Forward
0 new messages