Groups

Reduce the weight of eng.traineddata using only one font

79 views

Skip to first unread message

Brais Gabín Moreira

unread,

Sep 11, 2016, 8:02:54 AM9/11/16

to tesseract-ocr

I'm using tesseract to recognice some screenshots. I'm building this in an Android app so ~20MB of traineddata is a lot of weight. I know the font in those screenshots.

How can I reproduce the steps to generate the eng.traineddata? I want to use the same data: text, dictionary, patterns, etc. Once I have that, I'll strip out all the "useless" fonts and add the one I want.

Quan Nguyen

unread,

Sep 12, 2016, 8:18:50 AM9/12/16

to tesseract-ocr

You may consider using the old versions of eng.traineddata file, one of which is only 3MB.

https://sourceforge.net/projects/tesseract-ocr-alt/files/

Brais Gabín Moreira

unread,

Sep 12, 2016, 11:22:11 AM9/12/16

to tesseract-ocr

Wow! This file works as good as the 20MB! (at least in my case)

Any way it'll be great to know the steps to generate one of those files.

Quan Nguyen

unread,

Sep 12, 2016, 8:33:28 PM9/12/16

to tesseract-ocr

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract

Some of the training source data for English are here:

https://github.com/tesseract-ocr/langdata/tree/master/eng

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu