advice for OCR'ing 9-pin dot matrix BASIC code

1,325 views
Skip to first unread message

Keith M

unread,
Dec 14, 2020, 1:41:00 AM12/14/20
to tesseract-ocr
Hi there,

I've been circling a problem with OCR'ing 90-pages of 30 year old BASIC code. I've been working on optimizing my scanning settings, and pre-processing, stuck in photoshop for hours messing around. Long couple days with this stuff!

I've been through tessdoc, through the FAQ, through wikipedia reading about morphological operators. Through PPAs for 5.0.0-alpha-833-ga06c.

I'm getting OK results so far, but need to process more images, my workflow is tedious.

Sample image here

150dpi image extracted via pdftoppm -png from a 1200dpi scan. While it's not super clear to me why, higher res scans are resulting in WORSE OCR's.

TLDR; What should be the ideal configuration of tesseract for my application? Disable the dictionary? Can I add BASIC commands and keywords to eng.user-words? From the manual "CONFIG FILES AND AUGMENTING WITH USER DATA" section ??

I could use some help, thanks!

Keith

Alex Santos

unread,
Jan 1, 2021, 3:18:11 PM1/1/21
to tesseract-ocr
HI Keith

Interesting project.

On the why hi res would yield poorer results. As you know, dot matrix printer character output was a series of rows and columns (a grid of circles) to render text and characters. As you scan these prints at higher resolutions those otherwise indistinct individual dots become isolated from the others dots and begin to appear as individual objects by the OCR engine. To overcome this you might need to preprocess the scanned images with some image editing software to find a sweet spot. I would probably start by doing a high contrast medium resolution scan, then add some gaussian blue to effectively marry the dots into a continuous shape, rather than individual dots and then use some leveling tool to tighten the soft blur around the edges. This will bring back some sharpness. You really need to experiment with a 150ppi scan and from there explore a sequence of image manipulations that can essentially eliminate the gaps (the white) between the dots. Blending them together will reduce OCR from confusing what it might interpret as white space. So creating a continuous path of black will allow OCR to know what is ink and what is not.

Googling "ocr dot matrix prints" without quotes yielded some interesting results. Some have explored this more deeply than I. 

I attached a zip file with two tests based on the sample image you provided. I didn't get a good chance to make all the comparisons but I created a PNG with some gaussian blur and then contracting the levels gave me what appear to be decent results. I also scaled the processed image to 200% and saved it as a TIF.

I used the following command to generate the sidecar text file and PDFs.

ocrmypdf -v --output-type pdfa-3 --image-dpi 300 --optimize 0 --jpeg-quality 100 --pdfa-image-compression lossless --sidecar text.txt /Users/admin/Desktop/FNBBS/FNBBS-02_crop\ copy.png test.pdf
FNBBS.zip

Ger Hobbelt

unread,
Jan 1, 2021, 4:01:05 PM1/1/21
to tesser...@googlegroups.com
Another technique specifically for dot-matrix might be to blend multiple copies of the scan at small offsets. The idea here is that back in the old days of dot matrix, a few DTP applications had printing modes which would print dot patterns several times on the same line, but ever so slightly offset from one another to 'fill the character up'. The poor man's way to print BOLD characters that way was to print the same line multiple times at slight offsets.

Hence to simulate this sort of 'gap closing', one could scan at higher resolution, then offset the image multiple times in various directions by "half a printer dot" (or less) and blend the copies using a blending mode like Photoshop Darken.

If # is a *pixel* in the image scan (which should be much small that a dot matrix printer dot), possible patterns to try to fatten & fill the print are:

XXX            X X
X#X     or      #
XXX            X X

where X is the # pixel, now offset in various directions. The first example would thus take 8 copies and the second 4.
Combine this with other image processing to get a better filled-out character.

If that doesn't work, there's also the option of training the dot matrix matrix, but I haven't done that sort of thing yet, so can't help you there. Search this mailing list for multiple questions and answers re custom font training to improve results. 


Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web:    http://www.hobbelt.com/
        http://www.hebbut.net/
mail:   g...@hobbelt.com
mobile: +31-6-11 120 978
--------------------------------------------------


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1f0015f6-96ae-4ec7-812f-482dac337f92n%40googlegroups.com.

shree

unread,
Jan 1, 2021, 10:03:37 PM1/1/21
to tesseract-ocr
Please see old thread at https://groups.google.com/g/tesseract-ocr/c/ApM_TqwV7aE/m/z5jZV0I0AgAJ for link to a completed project for dot matrix

Keith M

unread,
Jan 1, 2021, 11:32:40 PM1/1/21
to tesseract-ocr
Ger,

Thanks for taking the time to reply.

On 1/1/2021 4:00 PM, Ger Hobbelt wrote:
Another technique specifically for dot-matrix might be to blend multiple copies of the scan at small offsets. The idea here is that back in the old days of dot matrix, a few DTP applications had printing modes which would print dot patterns several times on the same line, but ever so slightly offset from one another to 'fill the character up'. The poor man's way to print BOLD characters that way was to print the same line multiple times at slight offsets.


The printer's manual actually details so much of this internal working. Besides schematics and BOM lists, descriptions of theory of operation, etc I had forgotten the level of detail we used to get when we bought a multi-hundred dollar product.

Hence to simulate this sort of 'gap closing', one could scan at higher resolution, then offset the image multiple times in various directions by "half a printer dot" (or less) and blend the copies using a blending mode like Photoshop Darken.

I *believe* that morphological dilation is similar to what you're talking about here.

"Dilation [...] adds a layer of pixels to both the inner and outer boundaries of regions."

from

https://www.cs.auckland.ac.nz/courses/compsci773s1c/lectures/ImageProcessing-html/topic4.htm

I tried a few different techniques similar to what you've mentioned. While conceptually it should help, practically speaking I saw only minimal improvement.

While it's still a work in progress, I'm describing my current best efforts/results in the other reply here.

Thanks,
Keith

Keith M

unread,
Jan 1, 2021, 11:33:37 PM1/1/21
to tesseract-ocr
Alex,

Thanks for replying, appreciate the time. Especially the command line with various options specified!

I've spent hours and hours googling both before posting here, and afterwards. There's SOME information out there, but no real smoking gun. Most of the ideas in the first 10 pages of google results have not panned out in terms of EFFECTIVE results.

https://github.com/ameera3/OCR_Expiration_Date

looks pretty interesting, but it felt overly complicated to me.

other responses in-line


On 1/1/2021 3:07 PM, Alex Santos wrote:
To overcome [the fact that the dots when scanned in hi-res are individual] > you might need to preprocess the > scanned images with some image editing software to find a sweet spot. I
would probably start by doing a high contrast medium resolution scan, then add some gaussian blue to effectively marry the dots into a continuous shape, rather than individual dots and then use some leveling tool to tighten the soft blur around the edges.

Spent a few hours messing around with this.

https://jeffreymorgan.io/articles/improve-dot-matrix-ocr-performance-tutorial/

I get idea, and if I read you right, you're saying basically the same thing. However, it really didn't pan out. Yes, the characters look more like traditional text, but there was no dramatic improvement in recognition. Part of the problem is that there are so many variables and it's hard to isolate minor improvements.

I attached a zip file with two tests based on the sample image you provided. I didn't get a good chance to make all the comparisons but I created a PNG with some gaussian blur and then contracting the levels gave me what appear to be decent results. I also scaled the processed image to 200% and saved it as a TIF.

Thanks for doing this.

Here's my current state process that is yielding very good results:

* Use Windows scanning software(linux works too, but more cumbersome) with Fujitsu IX500 scanner: Setting Black and White adjusted 75% dark, 1200 dpi.

* Use pdftoppm with -gray option to spit out a *.pgm file at full resolution.

* Use unpaper (https://github.com/unpaper/unpaper) with default options to pre-process the scanned image. This really helps!

* Convert to *.png and resize 50%. Doing this because AWS Textract can't take such a large image: 8.5x11 at 1200 dpi is 10,200 x 13,200!

* Use AWS's Textract(https://aws.amazon.com/textract/) to perform the OCR. I can't recommend this service enough. It's practically free. Super easy to use (10) lines of python to call from Linux. You get feedback per line/word/block/page with confidence values. Average confidence value is 98%+.

I'm going to type a more comprehensive document but some basic results on LIMITED testing, comparison using Text Compare in Beyond Compare 4:

Amazon's Textract: Only 1 wrong character in 1020. Two other smaller excusable defects (an extra : detected) and a one-letter mistake. This simply works out of the box with zero configuration.

Tesseract: With whitelisting only characters, and added BASIC keywords to eng.user-words. Definitely can't get under about 12 lines worth of mistakes. Approximately 80% accuracy with this one test document. I feel like there's room for optimization, but I'm not sure I'm going to chase it.

Alex test1: 25 different lines (not great)
Alex test2: 18 different lines (A little worse than my best tesseract run with any configuration)

Abbyy FineReader 15: Pretty horrible results

Abbyy Cloud OCR: Better than the application, but can't easily evaluate results.

ReadIris 17: Pretty horrible results


Without sounding too much like an Amazon commercial (no relation beyond happy customer here), Amazon Textract has a feature called A2I which routes low confidence value recognition lines through machine learning, and then implements Human Review using Amazon Mechanical Turk. I'm not using A2I, but I *am* going to manually route my results through MTurk. It's a couple extra manual steps, and I have to pay for this human review (maybe $50 by the time I'm done), but I think it's neat, and I like learning about new technology.

Hope the group finds this info useful.
Thanks,

Keith

Alex Santos

unread,
Jan 4, 2021, 7:42:09 PM1/4/21
to tesser...@googlegroups.com, kmo...@gmail.com
Hi Keith

I read your reply with great interest because your case appears to be rather unique in that you are try to OCR lines and lines of dot matrix characters and it’s an interesting project to translate those old BASIC listings to a PDF or a txt file.

So I followed your links and your adventure and I am fascinated by what you found to be the most helpful, https://aws.amazon.com/textract/. If it is the most frictionless and most effective for your circumstances then I am delighted that you found a solution that fits your OCR needs. This is what I understood you eventually chose to align your process with.

If you eventually complete your OCR project will you be willing to upload a copy to the internet archive (archive.org) or if you can’t be inconvenienced I will be happy to do so in your behalf.

If you need more help in any way please let me know and thank you for posting the question and for the interesting conversation.

Kindest regards
—Alex

Without sounding too much like an Amazon commercial (no relation beyond happy customer here), Amazon Textract has a feature called A2I which routes low confidence value recognition lines through machine learning, and then implements Human Review using Amazon Mechanical Turk. I'm not using A2I, but I *am*going to manually route my results through MTurk. It's a couple extra manual steps, and I have to pay for this human review (maybe $50 by the time I'm done), but I think it's neat, and I like learning about new technology. 
-- 
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/Yd3ncAlr8Os/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2f945abb-0eab-4504-877f-8dc7c61d5a0an%40googlegroups.com.

Keith M

unread,
Jan 4, 2021, 10:56:44 PM1/4/21
to Alex Santos, tesser...@googlegroups.com
Hello again Alex,

Thanks for the conversation.

I have someone who has offered to modify a similar, but slightly
different, font for me. This would potentially allow some optimization
on recognition. For instance, Abbyy FineReader accepts a font file, and
providing a matching one, it's supposed to increase the accuracy. I have
half-entertained the mental exercise of doing simple graphic
comparisons. I'll be interested to see exactly how close the output
from, say Microsoft Word with the font selected, matches the physical
printout. Obviously the Word screenshot will be much sharper, but the
same dots are in the same locations relative to each other, and I'm sure
I could get size close.

I have chosen AWS Textract for the initial pass, however I think
combining multiple tools may yield better result. The overall average
recognition confidence is 88% across one full document. I have multiple
docs. These numbers are tricky, because I think I can easily throw out a
portion of these results, which would raise the average. I will say that
a high confidence number so far DOES correlate with the correctness.
Currently 75% of the document has an accuracy of over 85%.

Many of the AWS errors are due to the fact that it truncates a line too
early. It leaves off a close parenthesis or double quote.

I have already played with Mechanical Turk from the last time I sent a
message. I am routing low-confidence results through mturk. Humans check
the OCR results vs an image of the line, and fix them. This is working
but I'm really not leveraging them ideally, yet.

So my strategy may be multifaceted. Collect AWS result, which also
includes x/y coordinates for the lines, and then run the sub-image
through tesseract, and heck through abbyy cloud ocr, and then have the
mturk workers review. Surely if I get agreement across multiple
platforms then I have to be close.

Regarding archive.org, I'm happy to submit the software, but I'm not
sure why they'd want it. I'm a fan of the site, and donate every year.
Happy to send it there. But would they want it?

I will type up a blog post detailing some of this, because there's no
sense in NOT writing this down after all the research.

Thanks,

Keith

P.S. Yes, simply typing the 100 page document in, or paying someone to
do so would be faster and cheaper. But there's no reason, given that's
2021 that this shouldn't be a computer-solvable problem.


On 1/4/2021 7:41 PM, Alex Santos wrote:
> Hi Keith
>
> I read your reply with great interest because your case appears to be
> rather unique in that you are try to OCR lines and lines of dot matrix
> characters and it’s an interesting project to translate those old
> BASIC listings to a PDF or a txt file.
>
> So I followed your links and your adventure and I am fascinated by
> what you found to be the most helpful,
> https://aws.amazon.com/textract/ <https://aws.amazon.com/textract/>.
> If it is the most frictionless and most effective for your
> circumstances then I am delighted that you found a solution that fits
> your OCR needs. This is what I understood you eventually chose to
> align your process with.
>
> If you eventually complete your OCR project will you be willing to
> upload a copy to the internet archive (archive.org
> <http://archive.org>) or if you can’t be inconvenienced I will be

Ben Bongalon

unread,
Jan 5, 2021, 12:28:51 PM1/5/21
to tesseract-ocr
Hi Keith,

Interesting project. Having looked at the sample OCR results that Alex posted, I think the poor recognition from Tesseract is more likely due to the underlying language model used (I'm assuming you used 'eng'?). For example, the "test1" OCR results correctly transcribes the variables "mainlen", "mainmenutext", etc and does a reasonable job with the BASIC keywords (with some mistakes such as 'WENL!' for 'WEND'). Where it is failing is in recognizing characters such as '$', especially when juxtaposed next to '('

Given this, I'm not sure how much improvement a better font would buy you. Have you tried training with more data containing BASIC syntax similar to your document? The standard Tesseract language models were trained on corpora (Wiki articles? not sure) which have a very different character frequency and pattern compared to BASIC programs.

rgds,
Ben

Keith M

unread,
Jan 5, 2021, 4:53:16 PM1/5/21
to tesser...@googlegroups.com, Ben Bongalon
Ben,

Thanks for the interest and chiming in.

Yes, I used tesseract 5.0, eng, BASIC command keywords in
eng.user-words, white-listed only allowed characters, and loading/not
loading user dictionary/freq.

I haven't tried training yet. I could probably find and even generate,
assuming new ink cartridges arrive in the promised condition, new sets
of synthetic data(right word choice here?) Is this
(https://github.com/tesseract-ocr/tesstrain) the correct resource to
learn how to do this? And this is supported for version 5? Does 5 offer
advantages over 4, in this respect? Is it essentially creating
groundtruth files of TIF/PNG, associating the correct translation
.gt.txt files, and then make training? And then referencing the new
language via -l when called?

Something pretty cool has occurred to me. I have a large number of lines
(at least thousands) of high confidence AWS textract results and the
associated png's. I could actually use one OCR system to train another!

It does make me wonder how AWS gets such good results out of the box.
They definitely have something trained/tailored to scanned dot-matrix
printouts. Of course I don't tell it what language(english, BASIC, or
otherwise), type of document, DPI/resolution, font, or anything.....I
know I sound like a broken record. Current numbers include stats like
44% of the 100-page document is 95% or better confidence. Now those
lines could still be wrong, but they look pretty decent in a quick scan.

I must admit this is a pretty cool problem space.

Thanks,

Keith
> Regarding archive.org <http://archive.org>, I'm happy to submit
> > <http://archive.org <http://archive.org>>) or if you can’t be
> inconvenienced I will be
> > happy to do so in your behalf.
> >
> > If you need more help in any way please let me know and thank
> you for
> > posting the question and for the interesting conversation.
> >
> > Kindest regards
> > —Alex
> >
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to tesseract-oc...@googlegroups.com
> <mailto:tesseract-oc...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/38074b74-65e6-48c5-9208-ec67af47e2d7n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/38074b74-65e6-48c5-9208-ec67af47e2d7n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Ben Bongalon

unread,
Jan 5, 2021, 11:56:17 PM1/5/21
to Keith M, tesser...@googlegroups.com
The link you cited prescribes a method where you must provide an image file for each line of text
in your groundtruth data. So if you print out pages of sample BASIC programs on
your dot-matrix printer, you would then: 1. scan the pages, 2. crop each text line,
3. save each cropped image into a separate file, 4. create the corresponding gt text.

I'm guessing many people would instead use tesstrain.sh (tutorial) which automates that process. 
If you go through the tesstrain tutorial, you'll see the series of low-level commands that get called 
on the console output. If you go this route, you need to force the text2image program to render 
TIFs in the font resembling your printer's output.

Afaik v5 and v4 are functionally equivalent. The developers refactored V4 in a way that made the 
API incompatible so they changed the version.

good luck!

kmongm

unread,
Jan 6, 2021, 12:35:33 AM1/6/21
to Ben Bongalon, tesser...@googlegroups.com
Thanks much for the links. 

Here's the best part of doing the first one: when I ran my first program through AWS, I get a ton of useful data back which I'm parsing using python and saving into files. Beyond the confidence data, I get x1/y1, x2/y2 pairs of a box surrounding the lines of text. (And each word as well).

I took those coordinates and fed them into imagemagick's convert -crop command and generated .pngs per line of text. So ~4000 .pngs. I also have a spreadsheet for filenames and lines of translated text. Now some of them are wrong, but I've got thousands of correct lines.

This becomes excellent feeder material for training and it already exists!

I do have a custom font being built that matches this printer, so I can go that route too.

I used these pairs (line of text image.png and the OCR guess) and developed a small html interface that mechanical turk displays to workers. The workers correct any differences via an interface. You feed the same job to multiple workers to help eliminate human error. I've only done proof of concept tests, but this clearly works. 

Thanks much for pointers to resources. I'll follow up w the group if I see more success with the training. I'll also make my models available publicly so going forward I can help the next person. 

Keith

Ben Bongalon

unread,
Jan 6, 2021, 1:15:06 AM1/6/21
to kmongm, tesser...@googlegroups.com
Sounds cool, I look forward to your update Keith.
/Ben

Graham Toal

unread,
Mar 21, 2024, 12:43:23 PM3/21/24
to tesseract-ocr
I believe that for fixed font width listings, it is preferable to segment the page into characters geometrically rather than by locating the bounding boxes of individual glyphs as tesseract does.  Once a character is confined to a grid rectangle it is easier to recognise.  I've been working on code to identify line spacing and text width, and it works acceptably well for examples like this one:
FNBBS-02_crop@125,250@40,10.png

Once you have extracted the individual letters using the inferred grid, you can perform an initial recognition; then having found a few examples of each letter you can either merge them into a single exemplar of what the character should look like, to do a bitmap comparison, or you can use them to train a neural network to recognise the character, e.g. 

col006.png 

e.png

Tesseract doesn't work this way unfortunately but it would be an addition worth considering...

Graham
Reply all
Reply to author
Forward
0 new messages