Training for a specific wordlist and font

Daniel Ferenc

unread,

Jan 28, 2019, 11:07:32 AM1/28/19

to tesseract-ocr

Hi,

I need to train Tesseract for only a specific wordlist (about 13600 words) and one specific font. I tried following the training tutorial on the Wiki but I'm not sure if i'm doing anything wrong - the traineddata file is about 7 megabytes and i combined it with the eng.traineddata to get any traineddata file because after finishing the training I had no traineddata file at all. Can anyone please help me?

Shree Devi Kumar

unread,

Jan 29, 2019, 12:40:14 PM1/29/19

to tesser...@googlegroups.com

Finetune with your specific font - see eg. below which uses IMPACT font.

#!/bin/bash

time ~/tesseract/src/training/tesstrain.sh \

--fonts_dir /usr/share/fonts \

--lang eng --linedata_only \

--noextract_font_properties \

--langdata_dir ~/langdata \

--tessdata_dir ~/tessdata \

--fontlist "Impact Condensed" \

--training_text ~/langdata/eng/eng.training_text \

--workspace_dir ~/tmp/ \

--save_box_tiff \

--output_dir ~/tesstutorial/engtrainfont

time ~/tesseract/src/training/tesstrain.sh \

--fonts_dir /usr/share/fonts \

--lang eng --linedata_only \

--noextract_font_properties \

--langdata_dir ~/langdata \

--tessdata_dir ~/tessdata \

--fontlist "Impact Condensed" \

--training_text ~/langdata/eng/eng.mywordlist.training_text \

--workspace_dir ~/tmp/ \

--save_box_tiff \

--output_dir ~/tesstutorial/engevalwordlist

# https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact

echo "/n ****** Finetune one of the fully-trained existing models: ***********"

mkdir -p ~/tesstutorial/impact_from_full

combine_tessdata -e ~/tessdata_best/eng.traineddata \

~/tesstutorial/impact_from_full/eng.lstm

time ~/tesseract/src/training/lstmtraining \

--model_output ~/tesstutorial/impact_from_full/impact \

--continue_from ~/tesstutorial/impact_from_full/eng.lstm \

--traineddata ~/tessdata_best/eng.traineddata \

--train_listfile ~/tesstutorial/engtrainfont/eng.training_files.txt \

--debug_interval -1 \

--max_iterations 400

echo -e "\n*********** eval on training data ******\n"

time ~/tesseract/src/training/lstmeval \

--model ~/tesstutorial/impact_from_full/impact_checkpoint \

--traineddata ~/tessdata_best/eng.traineddata \

--eval_listfile ~/tesstutorial/engtrainfont/eng.training_files.txt

echo -e "\n***********eval on eval data ******\n"

time ~/tesseract/src/training/lstmeval \

--model ~/tesstutorial/impact_from_full/impact_checkpoint \

--traineddata ~/tessdata_best/eng.traineddata \

--eval_listfile ~/tesstutorial/engevalwordlist/eng.training_files.txt

echo -e "\n*********** convert to traineddata ******\n"

time ../tesseract/src/training/lstmtraining \

--stop_training \

--continue_from ~/tesstutorial/impact_from_full/impact_checkpoint \

--traineddata ~/tessdata_best/eng.traineddata \

--model_output ~/tesstutorial/engtrainfont/eng.traineddata

On Mon, Jan 28, 2019 at 9:37 PM Daniel Ferenc <voo...@gmail.com> wrote:

Hi,

I need to train Tesseract for only a specific wordlist (about 13600 words) and one specific font. I tried following the training tutorial on the Wiki but I'm not sure if i'm doing anything wrong - the traineddata file is about 7 megabytes and i combined it with the eng.traineddata to get any traineddata file because after finishing the training I had no traineddata file at all. Can anyone please help me?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1909bad8-d28d-4660-812d-47d0310e67c2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Daniel Ferenc

unread,

Jan 30, 2019, 9:41:09 AM1/30/19

to tesseract-ocr

So, I have figured out what was I doing wrong:

- I am using tesseract packages I got from apt on ubuntu 18.04 LTS and they were obviously missing some langdata which I downloaded from the repository
- There was also a need to get the Latin.unicharsert file

- And finally I didn't notice an error in one of the late steps that said radical-stroke.txt is missing and that resulted in traineddata not getting generated for my tesstrain.sh script run

- And since the last step required the traineddata and I didn' t have one so I used the package provided eng.traineddata which came with the package and it all resultet in very poor recognition performance

At this moment I'm running the training with a wordlist of possible ~13600 words that can appear with ~100 fonts that can be used... Waiting for 175000 iterations to finish because at 150k I stil had an error rate of ~2.4

(I'm creating a piece of software that should recognize Magic: the Gathering card names. I have a database of all currently existing cards (english ones) and created a word list of unique words that can appear in their name and am training tesseract with these words with all the possible fonts that were ever used for these cards. I will let you know how this worked out once the training is done.)

Thank you for your support.

Lorenzo Bolzani

unread,

Jan 30, 2019, 9:53:43 AM1/30/19

to tesser...@googlegroups.com

If you have images of the cards with the corresponding text you could train it on the cropped/cleaned text directly.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/72fd001f-137c-45b2-93c8-9f36d776e2f1%40googlegroups.com.

Daniel Ferenc

unread,

Jan 30, 2019, 10:08:37 AM1/30/19

to tesseract-ocr

I'm not sure how exactly would I setup that (regarding tesseract training) BUT there are about 44000 (english) cards at this time and a high resolution image of each is about 2 megs (at least from the resource I can get them from). Also, not each card is the same format so a generic crop function would not work. Over 90% of the cards would be OK like this but the rest would cause issues. It's easier for me to try and teach tesseract this way and then have the software try different rotations/crops if the default one doesn't return anything meaningful in means of OCR. Just preparing the images for this is a massive task while retrieving the word list from the database was about 20 seconds, a minute to download the fonts and ~4 hours of training for a result that will be, hopefully, good enough.

Daniel Ferenc

unread,

Jan 30, 2019, 10:10:05 AM1/30/19

to tesseract-ocr

Oh, and one more thing - the same card with the same name can appear in different editions of Magic, so pure recognition by name is not enough, I'm also training my software to recognize the edition of the card by using different means so all that in combination should be quite enough.

Lorenzo Bolzani

unread,

Jan 31, 2019, 5:18:35 AM1/31/19

to tesser...@googlegroups.com

Yes, generating text is faster and easier.

But the real extracted and cleaned text you are going to eventually recognize is going to be different from this, more or less depending on a lot of factors:

- how similar your training font actually is

- how good your cleanup will be (test this in advance)

- difference in text size, border, rotations, shearing from the generated text (for example you train with 0px border and later provide text with 4px border).

Using the real data, in general, should be better, unless you have very little data.

If the real images differ from the generated ones you may try to add some corruption mimicking the real one before the training: noise, perspective deformations, small rotations, etc.

And/or you can try to mix real and generated samples in the training.

You say 90% of the samples are easy to process: these can be enough if you can isolate these easily. Consider that real life samples will not be much better than these (I suppose).

About the rotations you can do perspective correction with opencv findHomography or with hough lines.

I realize this is A LOT of work as I'm doing this right now.

If you have time, try different ways and see what works best.

Bye

Lorenzo

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/815c9bf1-cde1-4192-9e07-dde865df8c5f%40googlegroups.com.

Daniel Ferenc

unread,

Jan 31, 2019, 5:35:18 AM1/31/19

to tesseract-ocr

Is there a guide somewhere how to setup training like this? How to pair the images and text, etc..? And thank you for the insight, it really is helpful.

Lorenzo Bolzani

unread,

Jan 31, 2019, 7:16:01 AM1/31/19

to tesser...@googlegroups.com

You can have a look at ocrd-train

https://github.com/OCR-D/ocrd-train

You just have to prepare cropped tiff and txt files with the same name containing a single line of text.

At the same time, if you already set up everything for the font based training, I'd give it a try (time permitting): you get something working today, you can make a comparison with different methods, etc.

Lorenzo

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a2beeae2-d433-44da-93e3-f20d9473e4c5%40googlegroups.com.

Reply all

Reply to author

Forward