Trying to add chars to tesseract 4.0

J Klein

unread,

Dec 6, 2017, 1:37:59 AM12/6/17

to tesseract-ocr

[this might be a repost; the first attept didn't show up]

I'm using the C API of tesseract 4.0 on OS X, and I tried to add some more characters. (4.0 seems much better than 3.x, I should add - thanks to everyone who made this possible!)

I used this manual section: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters

as a guide to construct the following script: https://pastebin.com/4n2mRSpq

Before running, I modified langdata/eng/eng.training_text with the extra chars, maybe 15 instances of each, as instructed.

I'm using only a subset of the original training fonts, but I figure it is OK, since I'm adding only a few distinctive chars.

The NN optimizer lstmtraining ran, and gave a bunch of checkpoints, and a final file $train_output_dir/eng/eng.trainedata

But this eng.traineddata was 5MB when the original one was 15.4MB. And when I tried to copy it over the pre-loaded 'best' eng.traineddata and run tesseract it failed in TessBaseAPIinit3 with error=-1.

Does anyone know why 1) my eng.trainedata is so much smaller and 2) why it fails to even load in API init()?

Thanks for any tips!

J Klein

unread,

Dec 7, 2017, 7:48:51 PM12/7/17

to tesseract-ocr

As an addendum, is there an easy way to diagnose why my eng.traineddata won't load? All I have have is a -1 error in API Init3

I put it here: https://filebin.ca/3jvP3FKuvp4G/eng.traineddata in case anyone knows how to diagnose a bad eng.traineddata

Thanks in advance for any tips!

ShreeDevi Kumar

unread,

Dec 7, 2017, 9:02:11 PM12/7/17

to tesser...@googlegroups.com

Re smaller traineddata size, it could possibly be related to the word list dictionary size.

You can unpack the original traineddata and compare the word list size with the one you used.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/50c6b233-602e-4479-a518-3bfd6baa10c9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

J Klein

unread,

Dec 7, 2017, 9:34:20 PM12/7/17

to tesseract-ocr

On Thursday, December 7, 2017 at 9:02:11 PM UTC-5, shree wrote:

Re smaller traineddata size, it could possibly be related to the word list dictionary size.

You can unpack the original traineddata and compare the word list size with the one you used.

Thank you for the hint.

I ran the following (-u is 'unpack all' I think),

combine_tessdata -u /usr/local/share/tessdata/eng.traineddata eng.

and I got:

-rw-r--r-- 1 klein staff 11689099 Dec 7 21:22 eng.lstm

-rw-r--r-- 1 klein staff 4738 Dec 7 21:22 eng.lstm-number-dawg

-rw-r--r-- 1 klein staff 4322 Dec 7 21:22 eng.lstm-punc-dawg

-rw-r--r-- 1 klein staff 1012 Dec 7 21:22 eng.lstm-recoder

-rw-r--r-- 1 klein staff 6360 Dec 7 21:22 eng.lstm-unicharset

-rw-r--r-- 1 klein staff 3694794 Dec 7 21:22 eng.lstm-word-dawg

-rw-r--r-- 1 klein staff 80 Dec 7 21:22 eng.version -- CONTENT is 4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]

Now I tried to unpack the one I created by adding the characters, and I get

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx eng.lstm is missing!

-rw-r--r-- 1 klein staff 3506 Dec 7 21:26 eng.lstm-number-dawg

-rw-r--r-- 1 klein staff 4322 Dec 7 21:26 eng.lstm-punc-dawg

-rw-r--r-- 1 klein staff 1030 Dec 7 21:26 eng.lstm-recoder

-rw-r--r-- 1 klein staff 9379 Dec 7 21:26 eng.lstm-unicharset

-rw-r--r-- 1 klein staff 4153402 Dec 7 21:26 eng.lstm-word-dawg

-rw-r--r-- 1 klein staff 12 Dec 7 21:26 eng.version -- CONTENT IS '4.00.00alpha'

So you're right that the word-list is different.

But more importantly it seems that eng.lstm isn't in the final eng.traineddata. Do I not understand something about how the process works? Is this my mistake, or a glitch!

Thanks for helping me to make progress.

ShreeDevi Kumar

unread,

Dec 7, 2017, 11:55:53 PM12/7/17

to tesser...@googlegroups.com

Please check the last section on

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

Regarding combining files to know the correct syntax for building the new traineddata file.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0dc37684-c454-4993-9387-ad641f22f016%40googlegroups.com.

ShreeDevi Kumar

unread,

Dec 7, 2017, 11:59:57 PM12/7/17

to tesser...@googlegroups.com

It is possible that you are treating the 'starter' traineddata file as the final one. Please read the training wiki page fully as the training process has been changed by Ray in his last update.

Fahad Al-Saidi

unread,

Dec 8, 2017, 7:16:01 AM12/8/17

to tesseract-ocr

On Wednesday, December 6, 2017 at 10:37:59 AM UTC+4, J Klein wrote:

But this eng.traineddata was 5MB when the original one was 15.4MB.

I have the same problem, why not the new fine tuned traineddata include the old wordlist? It suppose to do so. I followed the instructions in the wiki but I got the same issue. Any help?

ShreeDevi Kumar

unread,

Dec 8, 2017, 7:29:51 AM12/8/17

to tesser...@googlegroups.com

The langdata repository has not been updated by Ray for 4.0alpha. If you want the same word list unpack the traineddata from tessdata repositories.

Also read the last section of training wiki page re combining files.

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1ad4687d-395d-476c-90c4-05d4b99a47cb%40googlegroups.com.

Fahad Al-Saidi

unread,

Dec 8, 2017, 7:44:57 AM12/8/17

to tesser...@googlegroups.com

Great, then how I combine the wordlist into the new traineddata? It wiki page isn't clear about that.

J Klein

unread,

Dec 11, 2017, 10:54:22 PM12/11/17

to tesseract-ocr

On Thursday, December 7, 2017 at 11:55:53 PM UTC-5, shree wrote:

Please check the last section on
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

Thank you for this tip. I'm getting farther than before. I thought --trainedata was my final traineddata output file.

I now made the final eng.trainedata 'lstmtraining --stop_training ...." as follows

$tesstrain_dir/lstmtraining \

--stop_training \

--continue_from $train_output_dir/pluschars_checkpoint \

--traineddata $train_output_dir/eng/eng.traineddata \

--U $train_output_dir/eng/eng.unicharset \ # not sure if this is necessary; doesn't make a difference

--model_output $final_trained_data_file

And I get a $final_trained_data_file that I can use to replace /usr/local/share/tessdata/eng.traineddata and it doesn't fail on init3() any more. But it doesn't recognize any of the new chars either. However, in running

/usr/local/bin/tesseract-training/lstmeval \

--model ./trained_plus_chars/pluschars_checkpoint \

--traineddata ./trained_plus_chars/eng/eng.traineddata \

--eval_listfile ./trained_plus_chars/eng.training_files.txt

it DID recognize the new chars most of the time. So I think there may still be something something wrong with the construction of the --model_output $final_trained_data_file.

My entire training sequence bash script is here: https://pastebin.com/gNLvXkiM

Can you tell if there is anything obviously wrong?

Thanks

ShreeDevi Kumar

unread,

Dec 11, 2017, 11:41:45 PM12/11/17

to tesser...@googlegroups.com

Your script seems to look ok.

--U $train_output_dir/eng/eng.unicharset \ # not sure if this is necessary; doesn't make a difference

is NOT required

I will suggest that you remove files from an earlier run, before running the script.

Take a look at $train_output_dir/eng directory and review the unicharset there to see whether your new characters are included in the unicharset.

Take a look at the log file, specially in the initial portion to see whether it shows increase in number of characters.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/10194cda-9e8d-494c-ae4a-157e3d25f913%40googlegroups.com.

ShreeDevi Kumar

unread,

Dec 11, 2017, 11:44:22 PM12/11/17

to tesser...@googlegroups.com

You can add

--debug_interval -1

to your lstmtraining command to get debug info with each training iteration on console

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

shree

unread,

Dec 15, 2017, 3:13:36 AM12/15/17

to tesseract-ocr

On Friday, December 8, 2017 at 5:46:01 PM UTC+5:30, Fahad Al-Saidi wrote:

I have the same problem, why not the new fine tuned traineddata include the old wordlist? It suppose to do so. I followed the instructions in the wiki but I got the same issue. Any help?

If you want the wordlist included in 'old'/best traineddata, please unpack it with combine_tessdata -u ... then run dawg2wordlist to get the uncompressed wordlists from the old traineddata. Review the lists to make sure they look ok.

Replace the wordlist in langdata with this file before running training.

Fahad Al-Saidi

unread,

Dec 15, 2017, 3:26:15 AM12/15/17

to tesser...@googlegroups.com

Thanks, I have read that new tesseract-ocr 4.0 doesn't use wordlist anymore. It meat for older version? is that right?

--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/QrEC7IWnwnY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/476ba4aa-8404-48d2-a1b5-b1bfc3940458%40googlegroups.com.

ShreeDevi Kumar

unread,

Dec 15, 2017, 4:50:27 AM12/15/17

to tesser...@googlegroups.com

>>Thanks, I have read that new tesseract-ocr 4.0 doesn't use wordlist anymore. It meat for older version? is that right?

New 4.0alpha version does not REQUIRE the wordlist, but uses it, if available, and the accuracy is improved based on the wordlist.

So, basically, 4.0alpha will work without wordlist, but OCR results will be better with it.

Reply all

Reply to author

Forward