Training Tesseract 5.0.0 to recognize digital handwriting

428 views
Skip to first unread message

Fabio Lugli

unread,
Jan 14, 2020, 11:43:40 AM1/14/20
to tesseract-ocr
Hello everyone, i'm trying to train tesseract on handwriting, knowing that it's not the best option, using the latest version available for Windows. I have access to a huge amount of .tif files, lines of handwritten text, i'm able to obtain the .box files, which I later edit to be compliant to the latest requirements (boxes all over the line, spaces between words, tab at the end). After that i did not understand how to improve eng.traineddata or how to create an own .traineddata file, also following the instructions on https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
So which are the next passages to obtain a correct training dataset?

Shree Devi Kumar

unread,
Jan 15, 2020, 9:29:57 AM1/15/20
to tesseract-ocr
Take a look at tesseract-ocr/tesstrain

On Tue, Jan 14, 2020 at 10:13 PM 'Fabio Lugli' via tesseract-ocr <tesser...@googlegroups.com> wrote:
Hello everyone, i'm trying to train tesseract on handwriting, knowing that it's not the best option, using the latest version available for Windows. I have access to a huge amount of .tif files, lines of handwritten text, i'm able to obtain the .box files, which I later edit to be compliant to the latest requirements (boxes all over the line, spaces between words, tab at the end). After that i did not understand how to improve eng.traineddata or how to create an own .traineddata file, also following the instructions on https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
So which are the next passages to obtain a correct training dataset?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b736f06c-0627-41ad-bd2a-6dcad01b4576%40googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Fabio Lugli

unread,
Jan 15, 2020, 9:32:58 AM1/15/20
to tesseract-ocr
After some work i am able to:
- Use the method lstmbox of tesseract.exe to obtain the .box files of my .tif images
- Use the third party software JTessBoxEditor to correct the recognized characters, leaving boxes all around the full line of text
- Use the method lstm.train of tesseract.exe to obtain the .lstmf files from the .box files

Now when i try to use lstmtraining.exe, using eng.traineddata as starter traineddata i obtain the error:

Deserialize header failed: [myfile1].lstmf
Deserialize header failed: [myfile2].lstmf
Deserialize header failed: [myfile3].lstmf
Loaded 1/1 lines (1-1) of document [myfile4].lstmf
Load of images failed!!

From this i can understand there is an error either in the process of creating .lstmf files or in the images themselves that i have selected. Any suggestion is well accepted.

Fabio Lugli

unread,
Jan 15, 2020, 9:35:15 AM1/15/20
to tesseract-ocr
Thanks for the suggestion, I already tried this one but i will try again!

Shree Devi Kumar

unread,
Jan 15, 2020, 9:38:23 AM1/15/20
to tesseract-ocr
Please share a couple of lstmf files for testing.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Fabio Lugli

unread,
Jan 15, 2020, 9:45:40 AM1/15/20
to tesseract-ocr
Yes, i forgot to do it in the latest post. I share a couple of the images and their correspondant .box and .lstmf files. The others that i tried until now are very similar to these ones.
eng.test.pro1.box
eng.test.pro1.lstmf
eng.test.pro1.tif
eng.test.pro5.box
eng.test.pro5.lstmf
eng.test.pro5.tif

Fabio Lugli

unread,
Jan 15, 2020, 11:04:53 AM1/15/20
to tesseract-ocr
I tried again this path not remembering where i got stuck, and after following all the instructions and running make training the terminal is stuck at the first step

unicharset_extractor --output_unicharset "data/eng/unicharset" --norm_mode 2 "data/eng/all-gt"

From here it does nothing, even leaving the computer running for all night. Other instructions like make lists don't get stuck instead. 
For information i'm using for this procedure Ubuntu 16.04 on Windows, through the WSL that you can download from Microsoft Store, i don't know if this may be the issue.

Shree Devi Kumar

unread,
Jan 16, 2020, 4:45:59 AM1/16/20
to tesseract-ocr
Are you sure you have the files in the right places? It seems to work for me...

ubuntu@tesseract-ocr:~/tesseract$ cd ../TEST/lstmf
ubuntu@tesseract-ocr:~/TEST/lstmf$ tesseract unpack  eng.test.pro1.lstmf
Extracting eng.test.pro1.lstmf...
Loaded 1/1 lines (1-1) of document eng.test.pro1.lstmf
ubuntu@tesseract-ocr:~/TEST/lstmf$ ls
eng.test.pro1_0.gt.txt  eng.test.pro1_0.png  eng.test.pro1.box  eng.test.pro1.lstmf  eng.test.pro1.tif  eng.test.pro5.box  eng.test.pro5.lstmf  eng.test.pro5.tif  fabio
ubuntu@tesseract-ocr:~/TEST/lstmf$ tesseract unpack  eng.test.pro5.lstmf
Extracting eng.test.pro5.lstmf...
Loaded 1/1 lines (1-1) of document eng.test.pro5.lstmf
ubuntu@tesseract-ocr:~/TEST/lstmf$ ls -1 *.lstmf > all-lstmf
ubuntu@tesseract-ocr:~/TEST/lstmf$
ubuntu@tesseract-ocr:~/TEST/lstmf$  rm -rf ./lowercase_cursive
ubuntu@tesseract-ocr:~/TEST/lstmf$  mkdir -p ./lowercase_cursive
ubuntu@tesseract-ocr:~/TEST/lstmf$  #
ubuntu@tesseract-ocr:~/TEST/lstmf$  combine_tessdata -e ~/tessdata_best/eng.traineddata \
>  ./lowercase_cursive/eng.lstm
Extracting tessdata components from /home/ubuntu/tessdata_best/eng.traineddata
Wrote ./lowercase_cursive/eng.lstm
Version string:4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
17:lstm:size=11689099, offset=192
18:lstm-punc-dawg:size=4322, offset=11689291
19:lstm-word-dawg:size=3694794, offset=11693613
20:lstm-number-dawg:size=4738, offset=15388407
21:lstm-unicharset:size=6360, offset=15393145
22:lstm-recoder:size=1012, offset=15399505
23:version:size=80, offset=15400517
ubuntu@tesseract-ocr:~/TEST/lstmf$ #
ubuntu@tesseract-ocr:~/TEST/lstmf$ time lstmtraining \
>   --debug_interval  -1 \
>   --model_output ./lowercase_cursive/impact \
>   --continue_from ./lowercase_cursive/eng.lstm \
>   --train_listfile /home/ubuntu/TEST/lstmf/all-lstmf \
>   --traineddata ~/tessdata_best/eng.traineddata \
>   --max_iterations 400
Loaded file ./lowercase_cursive/eng.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Continuing from ./lowercase_cursive/eng.lstm
Loaded 1/1 lines (1-1) of document eng.test.pro1.lstmf
Loaded 1/1 lines (1-1) of document eng.test.pro5.lstmf
Iteration 0: GROUND  TRUTH : nominating any more Labour life Peers
Iteration 0: ALIGNED TRUTH : nominating any moree Labour life Peers
Iteration 0: BEST OCR TEXT : wominadng  ang wow.  Lobowr Lfe_ "Paoro
File eng.test.pro1.lstmf line 0 :
Mean rms=3.82%, delta=18.848%, train=75.676%(100%), skip ratio=0%
Iteration 1: GROUND  TRUTH : Griffiths, MP for Mancheste Exchange
Iteration 1: ALIGNED TRUTH : Griiffiths, MP for Mancheste Exchanngee
Iteration 1: BEST OCR TEXT : Galbhtha , UP Roe Mowomadl) Cxerlaomqre
File eng.test.pro5.lstmf line 0 :
Mean rms=3.908%, delta=20.581%, train=86.449%(100%), skip ratio=0%
Iteration 2: GROUND  TRUTH : nominating any more Labour life Peers
Iteration 2: BEST OCR TEXT : wominading any wone. Lobowr Lfe. "Paoro
File eng.test.pro1.lstmf line 0 :
Mean rms=3.74%, delta=19.305%, train=75.651%(94.444%), skip ratio=0%
Iteration 3: GROUND  TRUTH : Griffiths, MP for Mancheste Exchange
Iteration 3: ALIGNED TRUTH : Griffiths, MP for Mancheste Exchanngee
Iteration 3: BEST OCR TEXT : Galbhtha , MUP foe Manomadl) Cxclaomgle
File eng.test.pro5.lstmf line 0 :
Mean rms=3.708%, delta=18.921%, train=78.266%(95.833%), skip ratio=0%
Iteration 4: GROUND  TRUTH : nominating any more Labour life Peers
Iteration 4: BEST OCR TEXT : wominading any wone Loabour Lfe. "Paro

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Fabio Lugli

unread,
Jan 16, 2020, 5:29:54 AM1/16/20
to tesseract-ocr
The command tesseract unpack is not recognized by my version of tesseract, is it a utility that you have yourself or is it already there in any release?
Anyway does it only extract  the .box .gt.txt .tif files? If that's the case I can simply copy those file in the folder?
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Shree Devi Kumar

unread,
Jan 16, 2020, 6:04:50 AM1/16/20
to tesseract-ocr
tesseract unpack is a new feature by @stweil - not yet in the master branch. I was testing to see that your lstmf files are read correctly and they are.

For tesstrain, all you need are single line images and their gt.txt.

I ram lstmtraining using your lstmf files, which worked fine. 

If you want to test, try the following in a directory where you have the two sample lstmf files.
Change  ~/tessdata_best to wherever you have the best traineddata file.

ls -1 *.lstmf > all-lstmf
mkdir -p ./testdir
combine_tessdata -e ~/tessdata_best/eng.traineddata   ./testdir/eng.lstm

time lstmtraining \
   --debug_interval  -1 \
   --model_output ./testdir/impact \
   --continue_from ./testdir/eng.lstm \
   --train_listfile all-lstmf \
   --traineddata ~/tessdata_best/eng.traineddata \
   --max_iterations 400





To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5c4e3998-ff4c-43be-b207-c5068c921c0a%40googlegroups.com.

Fabio Lugli

unread,
Jan 16, 2020, 7:13:49 AM1/16/20
to tesseract-ocr
I still get the error, but I understood it being how I write the all-lstmf file, from which lstmtraining can't get the images. Right now i write into it:

[FULL PATH TO MY FILE]/eng.test.pro0.lstmf
[FULL PATH TO MY FILE]/eng.test.pro1.lstmf
[FULL PATH TO MY FILE]/eng.test.pro2.lstmf
ecc.

Am i correct saying that this is not what i should have inside all-lstmf

Shree Devi Kumar

unread,
Jan 16, 2020, 7:28:06 AM1/16/20
to tesseract-ocr
Full path should work.
Are you using Windows? Check the EOL character. It needs to be in Unix format.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Fabio Lugli

unread,
Jan 16, 2020, 8:00:00 AM1/16/20
to tesseract-ocr
Thank you very much, now i can get to see them. But obviously, after one simple step forward here is another wall: 

Warning: LSTMTrainer deserialized an LSTMRecognizer!
Continuing from ./tessdata/unpacked/eng.lstm
Loaded 1/1 lines (1-1) of document eng.test.pro0.lstmf
Loaded 1/1 lines (1-1) of document eng.test.pro1.lstmf
Loaded 1/1 lines (1-1) of document eng.test.pro5.lstmf
Loaded 1/1 lines (1-1) of document eng.test.pro3.lstmf
Loaded 2/2 lines (1-2) of document eng.test.pro2.lstmf
Iteration 0: GROUND  TRUTH : A MOVE to stop Mr.Gaitskell from
Iteration 0: ALIGNED TRUTH : A MOVE to stop Mr.Gaittsskell from
Iteration 0: BEST OCR TEXT : k MOVE t0 stoe Mr. GarkkeldR Prom
File eng.test.pro0.lstmf line 0 :

And then nothing. It opens a new terminal prompt. Could it be using windows the cause of this issue? 

P.S. Thank you for all your time that you pass answering me.

Il giorno giovedì 16 gennaio 2020 13:28:06 UTC+1, shree ha scritto:
Full path should work.
Are you using Windows? Check the EOL character. It needs to be in Unix format.

Shree Devi Kumar

unread,
Jan 16, 2020, 8:26:50 AM1/16/20
to tesseract-ocr
I haven't trained on windows. If you want to do training, it will be better to use Linux.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.


--

____________________________________________________________

Fabio Lugli

unread,
Jan 16, 2020, 8:42:15 AM1/16/20
to tesseract-ocr
Yes i am switching to Linux to restart the whole process, but the error in particular that i described down here, have you ever encountered it?

Shree Devi Kumar

unread,
Jan 16, 2020, 10:34:40 AM1/16/20
to tesseract-ocr
The amount of messages depends on debug-level (debug-interval) -please check exact variable name.

With -1 you will get msgs for every iteration.

With 0 for every 100.

With 1 or higher, it will start scrollview and show other windows with visual info.

Please check wiki for details. I usually use only -1.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Fabio Lugli

unread,
Jan 16, 2020, 11:06:35 AM1/16/20
to tesseract-ocr
Yes, i was setting debug_interval -1, but in Windows it didn't show the error after the iteration 0. Installing Ubuntu on a WSL and repeating all the process it showed that the problem was eng.traineddata that i was mistakingly using wasn't from tessdata_best, so the already seen integer model error was showing. Correcting this the lstmtraining finally started correctly.

Fabio Lugli

unread,
Jan 20, 2020, 3:22:26 AM1/20/20
to tesseract-ocr
After working a couple of days on my dataset, I have seen that the fine tuned model on handwritten text gets better on some lines of text, but worse on others, so i trained again and the results didn't change. Is it normal that the model gets better on some text but worse on another over each training? Another doubt is: the .traineddata  file has the same size after every training, should not it increase, at least of some kilobytes, after each training?

Shree Devi Kumar

unread,
Jan 20, 2020, 4:07:44 AM1/20/20
to tesseract-ocr

Sometimes using multiple models (last three) from training gives better results.

On Mon, Jan 20, 2020 at 1:52 PM 'Fabio Lugli' via tesseract-ocr <tesser...@googlegroups.com> wrote:
After working a couple of days on my dataset, I have seen that the fine tuned model on handwritten text gets better on some lines of text, but worse on others, so i trained again and the results didn't change. Is it normal that the model gets better on some text but worse on another over each training? Another doubt is: the .traineddata  file has the same size after every training, should not it increase, at least of some kilobytes, after each training?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages