Detect only AlphaNumberic characters

John Nilson

unread,

Sep 3, 2014, 11:50:56 AM9/3/14

to tesser...@googlegroups.com

Any help would be greatly appreciated.

I would like to do something fairly simple and that's reduce the types of characters Tesseract looks for to be just AlphaNumeric, 0-9 a-z A-Z . I'm using the very latest version 3.02.02. I want to do this because Tesseract is doing things like confusing M with |'U'| . Notice the pipe and single quotes. I'd like to remove any punctuation like that to reduce errors.

My first attempt was to

1) Edit the default eng.cub.lm and eng.cub.lm_ files in the tessdata directory.

2) Remove the non-AlphaNumeric punctuation characters.

3) Run combine_tessdata to generate a new eng.traineddata

Unfortunately this isn't working. Here's the directory listing and the output I get in red below.

C:\Program Files (x86)\Tesseract-OCR\tessdata_NoPunctuation>dir

Volume in drive C has no label.

Volume Serial Number is F0DD-A475

Directory of C:\Program Files (x86)\Tesseract-OCR\tessdata_NoPunctuation

09/03/2014 11:43 AM <DIR> .

09/03/2014 11:43 AM <DIR> ..

09/03/2014 10:30 AM <DIR> configs

02/03/2012 02:47 AM 21,876,572 eng - Copy.jpg

02/03/2012 03:15 AM 171,918 eng.cube.bigrams

02/03/2012 03:15 AM 38 eng.cube.fold

09/03/2014 10:38 AM 137 eng.cube.lm

09/03/2014 10:38 AM 137 eng.cube.lm_

02/03/2012 03:15 AM 857,304 eng.cube.nn

02/03/2012 03:15 AM 254 eng.cube.params

02/03/2012 03:15 AM 13,020,078 eng.cube.size

02/03/2012 03:15 AM 2,444,187 eng.cube.word-freq

02/03/2012 03:15 AM 996 eng.tesseract_cube.nn

09/03/2014 11:46 AM 0 eng.traineddata

09/03/2014 11:44 AM 0 lang.traineddata

02/03/2012 03:15 AM 10,562,727 osd.traineddata

09/03/2014 10:30 AM <DIR> tessconfigs

13 File(s) 48,934,348 bytes

4 Dir(s) 666,501,136,384 bytes free

C:\Program Files (x86)\Tesseract-OCR\tessdata_NoPunctuation>combine_tessdata eng.

Combining tessdata files

Error opening unicharset file

Error combining tessdata files into eng.traineddata

C:\Program Files (x86)\Tesseract-OCR\tessdata_NoPunctuation>

Shree Devi Kumar

unread,

Sep 3, 2014, 9:23:08 PM9/3/14

to tesser...@googlegroups.com

did you unpack the eng.traineddata first to get all the files?

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b2356fd1-be8c-45c7-9f18-afa2a459eef9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

John Nilson

unread,

Sep 3, 2014, 10:19:30 PM9/3/14

to tesser...@googlegroups.com

"did you unpack the eng.traineddata first to get all the files?"

No. How do I do that?

Shree Devi Kumar

unread,

Sep 4, 2014, 5:30:19 AM9/4/14

to tesser...@googlegroups.com

http://tesseract-ocr.googlecode.com/svn/trunk/doc/combine_tessdata.1.html

Combine_tessdata -u to unpack and get all files from the traineddata file - that will have in it the unicharset also.

I am not familiar with the cube files that you are changing, so can't comment about that.

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7bdffe5d-cdfd-43e4-805c-340231b0a112%40googlegroups.com.

John Nilson

unread,

Sep 4, 2014, 11:44:24 AM9/4/14

to tesser...@googlegroups.com

Thanks. That did the trick. I was able to switch to Alpha Numeric only. Here are the steps I took:

1) copied eng.* files into a new "Unpacked" directory I created. Then ran combine_tessdata -u to unpack:

...\tessdata\Unpacked>combine_tessdata -u eng.traineddata ./eng2.

Extracting tessdata components from eng.traineddata

Wrote ./eng.config

Wrote ./eng.unicharset

Wrote ./eng2.unicharambigs

Wrote ./eng2.inttemp

Wrote ./eng.pffmtable

Wrote ./eng.normproto

Wrote ./eng.punc-dawg

Wrote ./eng.word-dawg

Wrote ./eng.number-dawg

Wrote ./eng.freq-dawg

Wrote ./eng.cube-unicharset

Wrote ./eng.cube-word-dawg

Wrote ./eng.shapetable

Wrote ./eng.bigram-dawg

2) Edited eng.config and added the line:

tessedit_char_whitelist abcdefghijklmnopqrtsuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789

3)created a new eng.traineddata file using the following command:

...\tessdata\Unpacked>combine_tessdata eng.

Combining tessdata files

TessdataManager combined tesseract data files.

Offset for type 0 is 140

Offset for type 1 is 358

Offset for type 2 is 7643

Offset for type 3 is 8690

Offset for type 4 is 980283

Offset for type 5 is 981099

Offset for type 6 is 997382

Offset for type 7 is 1001704

Offset for type 8 is 2085898

Offset for type 9 is 2112548

Offset for type 10 is -1

Offset for type 11 is 2113958

Offset for type 12 is 2115469

Offset for type 13 is 3177575

Offset for type 14 is 3240921

Offset for type 15 is -1

Offset for type 16 is -1

4) ran Tesseract on the image file I wanted to extract AlphaNumeric only characters and IT WORKED!

Shree Devi Kumar

unread,

Sep 4, 2014, 10:02:04 PM9/4/14

to tesser...@googlegroups.com

You may also be able to do this by giving a config file as parameter at runtime. I haven't tried with 'whitelist' though.

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ae30bcd7-7831-4181-a32d-cc7ba511788a%40googlegroups.com.

Reply all

Reply to author

Forward