You're right. OCR is something of a black art. I have very little direct
experience with OCR but I have spent the last 20 years of my life
messing with document imaging systems. I had to be able to read bar
codes and QR codes and experimented with OCR. For what it is worth, here
are some tips that have helped me over the years.
Always scan documents in black and white. One bit per pixel images are
easier to decode that colour or grey scale.
If your OCR tool supports scanning TIFF images, always scan to TIFF
using CCITT3 or CCITT4 compression. These compression algorithms are
lossless, meaning that you get out exactly what you put in. Other
compression schemes like JPEG or PNG are lossy algorithms. What you get
out is different and "fuzzier" that what you put in. It is good enough
to fool the human eye but terrible for bar coding or OCR.
In general, higher resolution images are better than lower. Having said
that I always had good luck bar coding images at 300DPI or 400DPI.
Anything less was chancy at best.
I experimented with OCR to try to extract things like freight bill
numbers and bill of lading numbers from scanned documents. I never found
anything that came close to being accurate enough for my purposes but I
did learn a few things.
In general OCR is geared to reading prose, like novels or newspapers. So
the first thing that many OCR kits do is try to find the columns in the
image so that it can format the scanned text similarly. For our
purposes, if your OCR kit has an option I would suggest telling it that
everything is in a single column.
Many OCR kits will ask you for the language of the text. These are
usually horrible for things like program listings because they use a
heuristic algorithm to try to guess what the next character should be
based on what the previous characters were. Since assembler listings
don't follow the rules of English spelling and grammar OCR kits like
these just get hopelessly confused. Some kits of this class use a
dictionary to help with word recognition. If your kit allows it try
adding the assembler mnemonics to the dictionary. That might help. My
guess would be that a plain old dumb ass OCR kit with no heuristics
might work best for us.
Another problem is that OCR is notoriously bad at recognizing strings of
mixed numbers and letters, like hex. It really has no way of knowing if
that O should be an O or a zero, the 1 a 1 or an I or an L. I don't know
of any way around this particular problem.
I don't know if any of the above will be of any help to you. If you
don't have any luck I would be more than happy to run your scanned
listing through Nuance. Nuance only supports PDF format.
BTW, I tried about 15 or 20 the the free, online PDF to text tools on
mscan.pdf. None of them gave anything but garbage.
Steve B