Advice needed on effective hexadecimal recognition

scott...@gmail.com

unread,

Jun 28, 2014, 12:39:21 AM6/28/14

to tesser...@googlegroups.com

Hi all. Firstly let me say I am totally blown away by Tesseract, it vastly exceeded my expectations for an open source OCR project. I have an application (http://hackaday.io/project/1569-NSA-Away) that involves OCR of hexadecimal information from a computer screen using a hand held Android device. I've been able to use the tess-two API wrapper to successfully run Tesseract OCR in an Android emulator and am developing various unit tests to better tune by Tesseract configuration. The data I am OCR'ing will look something like:

2C B7 CF 07 1F C6 62 1C 8E 53 10 B1 75 06 06 C9 01 6A 08 DA

D4 B5 F9 CF 71 0E 7A DB 04 F3 8B 2A 0D 8E EC 41 50 83 CB E4

Where each pair of hex digits represents one byte of information. I can include error correction if that will be needed.

Steps I have taken so far (the code snippets are the tess-two Java wrapper):

1) Constrained the character whitelist to just the hex digits:

baseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, "0123456789ABCDEF");

this helped a lot

2) Created a custom dictionary with only 256 words, the possible 00 to FF hex "words", and using the instructions at:

http://stackoverflow.com/questions/9568165/custom-dictionary-for-tesseract

used the combine_tessdata and wordlist2dawg programs to replace the existing eng.word-dawg in a eng.traineddata file.

3) Attempted to increase the strength of dictionary matches as discussed on the FAQ (https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_to_increase_the_trust_in/strength_of_the_dictionary?), both via API calls to setVariable and via a configuration file (tess-two uses tesseract 3.0.3):

language_model_penalty_non_freq_dict_word 1

language_model_penalty_non_dict_word 1

However, I still occasionally get words that are three characters long and not in the dictionary, e.g. "C9" will be recognized as "129". When this happens it wrecks havoc with the base 16 decoding, as there are an odd number of hex digits. Since I can include additional error correction data, I'd be fine with dictionary words being hallucinated, but having three characters returned causes a problem.

This makes me wonder if I am properly following the instructions to increase the strength of dictionary matches. In this case, I'd be happy to constrain results to strictly only dictionary words.

I'm also wondering if people have advice about this use case in particular. Would you recommend upper or lower case hex digits (lower seemed worse in my unit testing), two spaces between words, etc.

Thanks in advance,

Scott

Tom Morris

unread,

Jun 28, 2014, 12:42:18 PM6/28/14

to tesser...@googlegroups.com

On Saturday, June 28, 2014 12:39:21 AM UTC-4, scott...@gmail.com wrote:

I have an application (http://hackaday.io/project/1569-NSA-Away) that involves OCR of hexadecimal information from a computer screen using a hand held Android device. I've been able to use the tess-two API wrapper to successfully run Tesseract OCR in an Android emulator and am developing various unit tests to better tune by Tesseract configuration. The data I am OCR'ing will look something like:

2C B7 CF 07 1F C6 62 1C 8E 53 10 B1 75 06 06 C9 01 6A 08 DA
D4 B5 F9 CF 71 0E 7A DB 04 F3 8B 2A 0D 8E EC 41 50 83 CB E4

Where each pair of hex digits represents one byte of information. I can include error correction if that will be needed.

...

I'm also wondering if people have advice about this use case in particular. Would you recommend upper or lower case hex digits (lower seemed worse in my unit testing), two spaces between words, etc.

I'm wondering if you wouldn't be better off with a different symbol set / code alphabet. If it's just 'C' that's giving you fits, replace with 'Y' or 'K' or some other letter that Tess can easily distinguish. Alternatively, pick 256 (or more) short dictionary words to represent your code points. Or encode the cipher text in a PNG of a 1-D or 2-D bar code. Or use something like OpenCV where you can more tightly control how the symbol recognition is done.

Tom

Tom Morris

unread,

Jun 28, 2014, 12:48:39 PM6/28/14

to tesser...@googlegroups.com

p.s.

On Saturday, June 28, 2014 12:39:21 AM UTC-4, scott...@gmail.com wrote:

3) Attempted to increase the strength of dictionary matches as discussed on the FAQ (https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_to_increase_the_trust_in/strength_of_the_dictionary?), both via API calls to setVariable and via a configuration file (tess-two uses tesseract 3.0.3):

language_model_penalty_non_freq_dict_word 1
language_model_penalty_non_dict_word 1

However, I still occasionally get words that are three characters long and not in the dictionary, e.g. "C9" will be recognized as "129". When this happens it wrecks havoc with the base 16 decoding, as there are an odd number of hex digits. Since I can include additional error correction data, I'd be fine with dictionary words being hallucinated, but having three characters returned causes a problem.

This makes me wonder if I am properly following the instructions to increase the strength of dictionary matches. In this case, I'd be happy to constrain results to strictly only dictionary words.

Since these are doubles, you might want to try 0.9 (or even 0.5) to make sure that you're not running into some type of boundary condition. I haven't played with them myself, so I'm not sure how they're handled internally.

Tom

scott...@gmail.com

unread,

Jun 28, 2014, 1:30:44 PM6/28/14

to tesser...@googlegroups.com

I tried 0.9 for both with the same result of words not in the dictionary being returned. I'll do some more study to see if, for instance it is always one or a couple of characters causing problems. Your idea of using a different character would be workable. I want to avoid bar codes in order to keep the data easily human-verifiable. Using 256 short dictionary words is a good idea as well, and it is somewhat reminiscent of a phonetic alphabet which fits with the spirit of the project.

It still seems that for my use case, forcing dictionary words, if it worked, would be a preferable solution. Are there any known defects or test cases where configuring the documented variables actually does work?

I was able to get more reliable results by using two space characters between "words" and then iterating the results word-by-word looking for words containing three letters, setting those to "FF" and letting the error correction take care of it, but again it seems like constraining the results to dictionary words would be more elegant.

Thanks for the advice!

Scott

Nick White

unread,

Jun 30, 2014, 1:44:36 PM6/30/14

to tesser...@googlegroups.com

Hi Scott,

On Fri, Jun 27, 2014 at 09:39:21PM -0700, scott...@gmail.com wrote:
> Hi all. Firstly let me say I am totally blown away by Tesseract, it vastly
> exceeded my expectations for an open source OCR project. I have an application
> (http://hackaday.io/project/1569-NSA-Away) that involves OCR of hexadecimal
> information from a computer screen using a hand held Android device. I've been
> able to use the tess-two API wrapper to successfully run Tesseract OCR in an
> Android emulator and am developing various unit tests to better tune by
> Tesseract configuration. The data I am OCR'ing will look something like:
>
> 2C B7 CF 07 1F C6 62 1C 8E 53 10 B1 75 06 06 C9 01 6A 08 DA
> D4 B5 F9 CF 71 0E 7A DB 04 F3 8B 2A 0D 8E EC 41 50 83 CB E4
>
> Where each pair of hex digits represents one byte of information. I can
> include error correction if that will be needed.

One thing that may help would be looking into the user-patterns
stuff Tesseract has. I'm not sure how much of an effect this would
have, or whether you can increase Tesseract's reliance on it, but it
ought to help. There's a bit of information on user-patterns on
Tesseract's manpage, possibly something elsewhere, I'm not sure.

Nick

Reply all

Reply to author

Forward