I would like to do something fairly simple and that's reduce the types of characters Tesseract looks for to be just AlphaNumeric, 0-9 a-z A-Z . I'm using the very latest version 3.02.02. I want to do this because Tesseract is doing things like confusing M with |'U'| . Notice the pipe and single quotes. I'd like to remove any punctuation like that to reduce errors.
1) Edit the default eng.cub.lm and eng.cub.lm_ files in the tessdata directory.
2) Remove the non-AlphaNumeric punctuation characters.
Unfortunately this isn't working. Here's the directory listing and the output I get in red below.
C:\Program Files (x86)\Tesseract-OCR\tessdata_NoPunctuation>dir
Volume in drive C has no label.
Volume Serial Number is F0DD-A475
Directory of C:\Program Files (x86)\Tesseract-OCR\tessdata_NoPunctuation
09/03/2014 11:43 AM <DIR> .
09/03/2014 11:43 AM <DIR> ..
09/03/2014 10:30 AM <DIR> configs
02/03/2012 02:47 AM 21,876,572 eng - Copy.jpg
02/03/2012 03:15 AM 171,918 eng.cube.bigrams
02/03/2012 03:15 AM 38 eng.cube.fold
09/03/2014 10:38 AM 137 eng.cube.lm
09/03/2014 10:38 AM 137 eng.cube.lm_
02/03/2012 03:15 AM 857,304 eng.cube.nn
02/03/2012 03:15 AM 254 eng.cube.params
02/03/2012 03:15 AM 13,020,078 eng.cube.size
02/03/2012 03:15 AM 2,444,187 eng.cube.word-freq
02/03/2012 03:15 AM 996 eng.tesseract_cube.nn
09/03/2014 11:46 AM 0 eng.traineddata
09/03/2014 11:44 AM 0 lang.traineddata
02/03/2012 03:15 AM 10,562,727 osd.traineddata
09/03/2014 10:30 AM <DIR> tessconfigs
13 File(s) 48,934,348 bytes
4 Dir(s) 666,501,136,384 bytes free
C:\Program Files (x86)\Tesseract-OCR\tessdata_NoPunctuation>combine_tessdata eng.
Combining tessdata files
Error opening unicharset file
Error combining tessdata files into eng.traineddata
C:\Program Files (x86)\Tesseract-OCR\tessdata_NoPunctuation>