Trying to add chars to tesseract 4.0

1,127 views
Skip to first unread message

J Klein

unread,
Dec 6, 2017, 1:37:59 AM12/6/17
to tesseract-ocr

[this might be a repost; the first attept didn't show up]

I'm using the C API of tesseract 4.0 on OS X, and I tried to add some more characters.   (4.0 seems much better than 3.x, I should add - thanks to everyone who made this possible!)


as a guide to construct the following script:  https://pastebin.com/4n2mRSpq     

Before running, I modified  langdata/eng/eng.training_text with the extra chars, maybe 15 instances of each, as instructed.

I'm using only a subset of the original training fonts, but I figure it is OK, since I'm adding only a few distinctive chars.   

The NN optimizer lstmtraining ran, and gave a bunch of checkpoints, and a final file $train_output_dir/eng/eng.trainedata

But this eng.traineddata was 5MB when the original one was 15.4MB.    And when I tried to copy it over the pre-loaded 'best' eng.traineddata and run tesseract it failed in TessBaseAPIinit3 with error=-1.


Does anyone know why 1) my eng.trainedata is so much smaller and 2) why it fails to even load in API init()?

Thanks for any tips!








J Klein

unread,
Dec 7, 2017, 7:48:51 PM12/7/17
to tesseract-ocr
As an addendum, is there an easy way to diagnose why my eng.traineddata won't load?  All I have have is a -1 error in API Init3

I put it here:    https://filebin.ca/3jvP3FKuvp4G/eng.traineddata  in case anyone knows how to diagnose a bad eng.traineddata

Thanks in advance for any tips!

ShreeDevi Kumar

unread,
Dec 7, 2017, 9:02:11 PM12/7/17
to tesser...@googlegroups.com
Re smaller traineddata size, it could possibly be related to the word list dictionary size.

You can unpack the original traineddata and compare the word list size with the one you used.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/50c6b233-602e-4479-a518-3bfd6baa10c9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

J Klein

unread,
Dec 7, 2017, 9:34:20 PM12/7/17
to tesseract-ocr


On Thursday, December 7, 2017 at 9:02:11 PM UTC-5, shree wrote:
Re smaller traineddata size, it could possibly be related to the word list dictionary size.

You can unpack the original traineddata and compare the word list size with the one you used.


Thank you for the hint.

I ran the following (-u is 'unpack all' I think), 

  combine_tessdata  -u /usr/local/share/tessdata/eng.traineddata eng.

and I got:

-rw-r--r--  1 klein  staff  11689099 Dec  7 21:22 eng.lstm

-rw-r--r--  1 klein  staff      4738 Dec  7 21:22 eng.lstm-number-dawg

-rw-r--r--  1 klein  staff      4322 Dec  7 21:22 eng.lstm-punc-dawg

-rw-r--r--  1 klein  staff      1012 Dec  7 21:22 eng.lstm-recoder

-rw-r--r--  1 klein  staff      6360 Dec  7 21:22 eng.lstm-unicharset

-rw-r--r--  1 klein  staff   3694794 Dec  7 21:22 eng.lstm-word-dawg

-rw-r--r--  1 klein  staff        80 Dec  7 21:22 eng.version -- CONTENT is 4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]


Now I tried to unpack the one I created by adding the characters, and I get


xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx eng.lstm is missing!

-rw-r--r--  1 klein  staff     3506 Dec  7 21:26 eng.lstm-number-dawg

-rw-r--r--  1 klein  staff     4322 Dec  7 21:26 eng.lstm-punc-dawg

-rw-r--r--  1 klein  staff     1030 Dec  7 21:26 eng.lstm-recoder

-rw-r--r--  1 klein  staff     9379 Dec  7 21:26 eng.lstm-unicharset

-rw-r--r--  1 klein  staff  4153402 Dec  7 21:26 eng.lstm-word-dawg

-rw-r--r--  1 klein  staff       12 Dec  7 21:26 eng.version  -- CONTENT IS '4.00.00alpha'


So you're right that the word-list is different. 

But more importantly it seems that eng.lstm isn't in the final eng.traineddata.   Do I not understand something about how the process works?  Is this my mistake, or a glitch!

Thanks for helping me to make progress.




ShreeDevi Kumar

unread,
Dec 7, 2017, 11:55:53 PM12/7/17
to tesser...@googlegroups.com
Please check the last section on


Regarding combining files to know the correct syntax for building the new traineddata file.


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

ShreeDevi Kumar

unread,
Dec 7, 2017, 11:59:57 PM12/7/17
to tesser...@googlegroups.com
It is possible that you are treating the 'starter' traineddata file as the final one. Please read the training wiki page fully as the training process has been changed by Ray in his last update.

Fahad Al-Saidi

unread,
Dec 8, 2017, 7:16:01 AM12/8/17
to tesseract-ocr

On Wednesday, December 6, 2017 at 10:37:59 AM UTC+4, J Klein wrote:

But this eng.traineddata was 5MB when the original one was 15.4MB.

I have the same problem, why not the new fine tuned traineddata include the old wordlist? It suppose to do so. I followed the instructions in the wiki but I got the same issue. Any help?

ShreeDevi Kumar

unread,
Dec 8, 2017, 7:29:51 AM12/8/17
to tesser...@googlegroups.com
The langdata repository has not been updated by Ray for 4.0alpha. If you want the same word list unpack the traineddata from tessdata repositories.

Also read the last section of training wiki page re combining files.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Fahad Al-Saidi

unread,
Dec 8, 2017, 7:44:57 AM12/8/17
to tesser...@googlegroups.com
 Great, then how I combine the wordlist into the new traineddata? It wiki page isn't clear about that.

J Klein

unread,
Dec 11, 2017, 10:54:22 PM12/11/17
to tesseract-ocr

On Thursday, December 7, 2017 at 11:55:53 PM UTC-5, shree wrote:
 
Thank you for this tip.   I'm getting farther than before.  I thought --trainedata was my final traineddata output file.
I now made the final eng.trainedata  'lstmtraining --stop_training ...." as follows

    $tesstrain_dir/lstmtraining \
--stop_training \
--continue_from $train_output_dir/pluschars_checkpoint \
--traineddata $train_output_dir/eng/eng.traineddata \
--U  $train_output_dir/eng/eng.unicharset \   # not sure if this is necessary; doesn't make a difference
--model_output $final_trained_data_file

And I get a $final_trained_data_file that I can use to replace /usr/local/share/tessdata/eng.traineddata and it doesn't fail on init3() any more.  But it doesn't recognize any of the new chars either.    However, in running
  
  /usr/local/bin/tesseract-training/lstmeval \
    --model ./trained_plus_chars/pluschars_checkpoint  \
    --traineddata ./trained_plus_chars/eng/eng.traineddata \
    --eval_listfile ./trained_plus_chars/eng.training_files.txt 

it DID recognize the new chars most of the time.  So I think there may still be something something wrong with the construction of the --model_output $final_trained_data_file.

My entire training sequence bash script is here:  https://pastebin.com/gNLvXkiM

Can you tell if there is anything obviously wrong?


Thanks



ShreeDevi Kumar

unread,
Dec 11, 2017, 11:41:45 PM12/11/17
to tesser...@googlegroups.com
Your script seems to look ok.

--U  $train_output_dir/eng/eng.unicharset \   # not sure if this is necessary; doesn't make a difference
is NOT required

I will suggest that you remove files from an earlier run, before running the script.

Take a look at  $train_output_dir/eng directory and review the unicharset there to see whether your new characters are included in the unicharset.

Take a look at the log file, specially in the initial portion to see whether it shows increase in number of characters.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

ShreeDevi Kumar

unread,
Dec 11, 2017, 11:44:22 PM12/11/17
to tesser...@googlegroups.com
You can add 
  --debug_interval -1
to your lstmtraining command to get debug info with each training iteration on console

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

shree

unread,
Dec 15, 2017, 3:13:36 AM12/15/17
to tesseract-ocr


On Friday, December 8, 2017 at 5:46:01 PM UTC+5:30, Fahad Al-Saidi wrote:

I have the same problem, why not the new fine tuned traineddata include the old wordlist? It suppose to do so. I followed the instructions in the wiki but I got the same issue. Any help?

If you want the wordlist included in 'old'/best traineddata, please unpack it with combine_tessdata -u ... then run dawg2wordlist to get the uncompressed wordlists from the old traineddata. Review the lists to make sure they look ok.

Replace the wordlist in langdata with this file before running training. 

Fahad Al-Saidi

unread,
Dec 15, 2017, 3:26:15 AM12/15/17
to tesser...@googlegroups.com
Thanks, I have read that new tesseract-ocr 4.0 doesn't use wordlist anymore. It meat for older version? is that right?


--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/QrEC7IWnwnY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

ShreeDevi Kumar

unread,
Dec 15, 2017, 4:50:27 AM12/15/17
to tesser...@googlegroups.com
>>Thanks, I have read that new tesseract-ocr 4.0 doesn't use wordlist anymore. It meat for older version? is that right?

New 4.0alpha version does not REQUIRE the wordlist, but uses it, if available, and the accuracy is improved based on the wordlist.

So, basically, 4.0alpha will work without wordlist, but OCR results will be better with it.
Reply all
Reply to author
Forward
0 new messages