Modyfying existing traineddata

Devon Yoo

unread,

Feb 22, 2016, 1:43:33 PM2/22/16

to tesseract-ocr

I don't know why my previous post has been rejected but I repost my question anyways.

I have test set that only has "uppercase English alphabets" and "numbers". But the provided eng.traineddata returns symbols and lower case alphabets sometimes. Is there a way to modify the existing traineddata file so that it only reads upper case alphabets and numbers?

thanks in advance

Nick White

unread,

Feb 23, 2016, 4:00:22 AM2/23/16

to tesser...@googlegroups.com

Hi Devon,

On Mon, Feb 22, 2016 at 10:43:33AM -0800, Devon Yoo wrote:
> I have test set that only has "uppercase English alphabets" and "numbers". But
> the provided eng.traineddata returns symbols and lower case alphabets
> sometimes. Is there a way to modify the existing traineddata file so that it
> only reads upper case alphabets and numbers?

Use the 'tessedit_char_whitelist' config variable. You can create a
config file like the 'digits' one;
https://github.com/tesseract-ocr/tesseract/wiki/FAQ#how-do-i-recognize-only-digits
or just use 'tesseract -c tessedit_char_whitelist=ABC...123...' on
the command line.

Nick

Devon Yoo

unread,

Feb 23, 2016, 5:40:33 PM2/23/16

to tesseract-ocr

Hi Nick,

Thanks for your reply. It helped much! So I got some idea of setting variables which is listed in

http://stackoverflow.com/questions/13087252/where-i-can-find-the-list-of-available-property-name-for-tesseract-setvariable

And I would like to ask you a quick following questions.

Is there a way to give TesseractEngine a hint of expected text format? For example, can I set a format like 00XXX00 XX-000 where 0 represents number and X represents alphabet?

Tom Morris

unread,

Feb 24, 2016, 1:08:13 PM2/24/16

to tesseract-ocr

On Tuesday, February 23, 2016 at 5:40:33 PM UTC-5, Devon Yoo wrote:

Is there a way to give TesseractEngine a hint of expected text format? For example, can I set a format like 00XXX00 XX-000 where 0 represents number and X represents alphabet?

See the answer to this question: http://stackoverflow.com/questions/14858514/tesseract-ocr-is-it-possible-to-force-a-specific-pattern

Tom

Reply all

Reply to author

Forward