First, let me say that I am virtually brand-new to OCR. It is only within the past couple of weeks that I have developed an appreciation that today's OCR is much much more than "character" recognition. It can probably be better described as "message" recognition, where all the nuances of language, words, dictionaries, and syntax/grammar are as important as individual characters. Still, the OCR application I am tackling is really just concerned with recognizing, as accurately as possible, strings of characters where there is no correlation between characters in a string or from one string to the next. Maybe attempting to use today's OCR for this application is analogous to using a jack-hammer to drive a tack. If so, please let me know, and if you can point me towards a technology other than OCR I'd appreciate it.
I apologize for the length of this post. I thought it best to describe as fully as I can the application driving my interest in Tesseract, and also what simple OCR experiments I've done to try to assess what should be possible to achieve with Tesseract.
First, the application. This is a real-world application that probably comes up in many contexts. I will not describe the specifics of the application, but will describe it notionally as follows. In an enterprise there are two sites -- A and B. Information can be produced at site A and needs to be utilized at site B. However, there is no means for transferring the information, electronically, from A to B. The information is in electronic form at site A, and to be utilized efficiently needs to be in electronic form at site B. To make things simple, assume that at site A the information exists in tabular form as a spreadsheet, and it needs to be re-entered into that tabular/spreadsheet form at site B. A current process exists, consisting of printing the spreadsheet at site A, physically transferring the printout to site B, where it is scanned to a PDF. Site B has the Adobe Acrobat Pro tool, enabling the scanned PDF to be subjected to OCR and the result saved in a variety of formats, one being an Excel file. However, there are numerous errors in this process, requiring painstaking and time-consuming editing of the Excel file, and there is no good way to be assured that all errors have been found and corrected.
It can be further assumed that in this application the information in the spreadsheet at site A, i.e., the contents of the spreadsheet cells, consists of a very restricted character set -- generally, the characters are upper case letters A-Z, digits 0-9, the decimal point, and possibly the comma. Moreover, in printing the spreadsheet at site A, various things are under control, such as the font used, the font size, and the printing quality. And, at site B, the scan resolution is under control. Clearly, it should be possible to configure these "control variables" so that the hard-copy printout, once scanned, is as "OCR-able as possible" with the smallest probability of character errors. This seems "obvious" to me. However, testing this out is more difficult than I ever expected. Here are some observations, and what I've done so far:
- I have done some research on what fonts are recognizable with the best accuracy, and have narrowed down to a set of about 6. The OCR A Extended font, a default font on Windows systems, seems likely to be best in this regard. It will be possible to print a quality hard-copy at site A in any of these fonts, and with a font size large enough to allow inter-character discrimination, and it will be possible to scan-to-PDF at site B at 600 dpi, which should be as good as needed.
- Unfortunately, the OCR performance of Adobe Acrobat Pro on the scanned PDF is worst with the OCR A Extended font. This is because there are no means for configuring the Adobe Acrobat OCR engine -- it is probably pre-configured to expect scanned material in a variety of more conventional fonts -- and OCR A Extended is probably a font in which few materials for actual human reading are printed (it's not a very "pleasing" font). If it were possible to configure Adobe Acrobat to expect a particular font, and even better a restricted alphabet of characters, that might be the ideal tool to use, since it's already at site B.
- In researching OCR SW, I have come to the opinion that Adobe Acrobat Pro and also ABBYY FineReader are held in the highest regard. They generally lead the pack in reviews of commercial OCR. So, I have also tested out, on a trial basis, ABBYY FineReader. Here is what I have found with ABBYY:
- It is possible to "train" ABBYY to expect input in a specific font and in a restricted character alphabet (in my case, A-Z, 0-9, decimal point, and comma). It's not as simple as just specifying, say, OCR A Extended, and a set of characters. Rather, material with the alphabet and in the chosen font must be read in, and ABBYY can be made to map each character shape in the input to a given character. Except, see 2 bullets below ...
- When I trained ABBYY to the restricted character alphabet, in the OCR A Extended font, I was able to successfully complete the training -- i.e. each character "shape" in the training input could be mapped to the appropriate character. After doing that, the "trained" ABBYY performed as well as could be expected in correctly recognizing two pages of scanned material (about 2000 random characters), printed in the OCR A Extended font. There were no errors at all. I was not able, due to restrictions in the ABBYY evaluation trial to test the recognition performance with more than two pages, but it appears that ABBYY and the OCR A Extended font could produce the recognition accuracy I'm looking for in my application. However, there is the expense of acquiring ABBYY at site B (actually, there are many site B's), as well as other issues I won't get into.
- When I attempted to train ABBYY in one of the other fonts, I found that in the training, after having mapped a number of character shapes to correct characters, inevitably there would be a next character shape, not yet mapped, that ABBYY considered to be already mapped, on the basis of similarity to shapes that had been mapped. As an example, ABBYY might allow correct mapping of the font shapes for 0-9 and A to the characters 0-9 and A, but when the font shape for B occurred, ABBYY considered this already mapped to the character 8, and there was no way to cancel that default mapping. Thus, for all these other fonts, it was not possible to correctly train ABBYY to recognize all the character shapes.
- The last point above is partial confirmation that the OCR A Extended font is probably the best for OCR recognizably, on a per-character basis.
So, now this FINALLY gets down to my interest in Tesseract. (Sorry for all the verbiage above, but it helps put into context what I'd like to do with Tesseract.) I'm hoping that, as far as ability to recognize individual characters correctly, Tesseract should be as good as the leading commercial OCR engines -- Adobe Acrobat Pro and ABBYY FineReader. Also, from everything I've dug into about Tesseract, it appears that it should be possible to "train" or "configure" Tesseract to expect input consisting of
- Characters only in a single well-defined, specific font. Currently I am focusing on OCR A Extended, due to the positive character recognizability results I was able to produce in my experiments with ABBYY.
- Characters only in a specific subset -- currently I would want to assume just 0-9 and A-Z.
- Completely "random" characters, meaning no predetermined dictionary of words, no language, no aspects of the input that would make certain character strings more likely to be correct than others.
I know it's possible to "train" Tesseract, and probably that has to be done with the OCR A Extended font. But the documentation on how to do that is extremely difficult to follow, and frankly I don't have a lot of time to spend on becoming a relative Tesseract expert. It would be so much better/easier (for my purposes) if, when it is known a priori that the image input is all in a specific font, to be able to convey that fact to Tesseract on the command line. Possibly someone else has already done the training to the OCR A Extended font, and I could reuse what was done there?
I also know that it's possible to specify "white-lists" and "black-lists" of characters, and that that should be a way allowing me to instruct Tesseract that the characters are known to be 0-9 or A-Z. However, I have yet to find the best description of the actual mechanics of setting everything up and properly configuring the command line to effect this definition of a restricted character set.
I'm more than a little "bothered" by the role that "language" plays in Tesseract, since in my application there are completely random character strings. Is there a way to specify "no language" in the execution of Tesseract, meaning no a priori set of acceptable words, nothing else that might cause Tesseract to choose one character string over another aside from simple the best choice in an individual character-by-character basis?
The above describes what motivates what I'm trying to do with Tesseract. I'm hoping I can stimulate getting, from all of you Tesseract experts, some of the following:
- Anything that's already been done that I could reuse. Specifically, if Tesseract has already been trained to expect inputs in the OCR A Extended font, having the results of that, and directions on how to utilize it, would be ideal.
- Better documentation that allows me to more fully understand everything about the Tesseract execution environment (e.g., config files, how they are constructed, where they reside, etc.) and also all the specific little options in the command line that I might need to know about.
Again, sorry about how long this turned out to be. It should be possible for subsequent pots to be much shorter and to the point of specific issues. Think of this post as providing background and definition of what I want to accomplish with Tesseract.
Thanks.