Hi all. Firstly let me say I am totally blown away by Tesseract, it vastly exceeded my expectations for an open source OCR project. I have an application (
http://hackaday.io/project/1569-NSA-Away) that involves OCR of hexadecimal information from a computer screen using a hand held Android device. I've been able to use the tess-two API wrapper to successfully run Tesseract OCR in an Android emulator and am developing various unit tests to better tune by Tesseract configuration. The data I am OCR'ing will look something like:
2C B7 CF 07 1F C6 62 1C 8E 53 10 B1 75 06 06 C9 01 6A 08 DA
D4 B5 F9 CF 71 0E 7A DB 04 F3 8B 2A 0D 8E EC 41 50 83 CB E4
Where each pair of hex digits represents one byte of information. I can include error correction if that will be needed.
Steps I have taken so far (the code snippets are the tess-two Java wrapper):
1) Constrained the character whitelist to just the hex digits:
baseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, "0123456789ABCDEF");
this helped a lot
2) Created a custom dictionary with only 256 words, the possible 00 to FF hex "words", and using the instructions at:
used the combine_tessdata and wordlist2dawg programs to replace the existing eng.word-dawg in a eng.traineddata file.
language_model_penalty_non_freq_dict_word 1
language_model_penalty_non_dict_word 1
However, I still occasionally get words that are three characters long and not in the dictionary, e.g. "C9" will be recognized as "129". When this happens it wrecks havoc with the base 16 decoding, as there are an odd number of hex digits. Since I can include additional error correction data, I'd be fine with dictionary words being hallucinated, but having three characters returned causes a problem.
This makes me wonder if I am properly following the instructions to increase the strength of dictionary matches. In this case, I'd be happy to constrain results to strictly only dictionary words.
I'm also wondering if people have advice about this use case in particular. Would you recommend upper or lower case hex digits (lower seemed worse in my unit testing), two spaces between words, etc.
Thanks in advance,
Scott