Modyfying existing traineddata

555 views
Skip to first unread message

Devon Yoo

unread,
Feb 22, 2016, 1:43:33 PM2/22/16
to tesseract-ocr
I don't know why my previous post has been rejected but I repost my question anyways.

I have test set that only has "uppercase English alphabets" and "numbers". But the provided eng.traineddata returns symbols and lower case alphabets sometimes. Is there a way to modify the existing traineddata file so that it only reads upper case alphabets and numbers?


thanks in advance

Nick White

unread,
Feb 23, 2016, 4:00:22 AM2/23/16
to tesser...@googlegroups.com
Hi Devon,

On Mon, Feb 22, 2016 at 10:43:33AM -0800, Devon Yoo wrote:
> I have test set that only has "uppercase English alphabets" and "numbers". But
> the provided eng.traineddata returns symbols and lower case alphabets
> sometimes. Is there a way to modify the existing traineddata file so that it
> only reads upper case alphabets and numbers?

Use the 'tessedit_char_whitelist' config variable. You can create a
config file like the 'digits' one;
https://github.com/tesseract-ocr/tesseract/wiki/FAQ#how-do-i-recognize-only-digits
or just use 'tesseract -c tessedit_char_whitelist=ABC...123...' on
the command line.

Nick

Devon Yoo

unread,
Feb 23, 2016, 5:40:33 PM2/23/16
to tesseract-ocr
Hi Nick,

Thanks for your reply. It helped much! So I got some idea of setting variables which is listed in

And I would like to ask you a quick following questions.
Is there a way to give TesseractEngine a hint of expected text format? For example, can I set a format like 00XXX00 XX-000 where 0 represents number and X represents alphabet?

Tom Morris

unread,
Feb 24, 2016, 1:08:13 PM2/24/16
to tesseract-ocr
On Tuesday, February 23, 2016 at 5:40:33 PM UTC-5, Devon Yoo wrote:

Is there a way to give TesseractEngine a hint of expected text format? For example, can I set a format like 00XXX00 XX-000 where 0 represents number and X represents alphabet?

Reply all
Reply to author
Forward
0 new messages