Dot-matrix woes

179 views
Skip to first unread message

Slartybartfast

unread,
Oct 24, 2023, 12:40:20 PM10/24/23
to tesseract-ocr
Hi
I am a new tesseract user, and I'm really struggling to get it to produce any kind of sensible results, especially with numerical text. I have some text that looks like this:
example_input.jpg
I've read the documentation, and looked through the parameter list, and I added the following to the command line:
--psm 6
-c preserve_interword_spaces=1
-c textord_dotmatrix_gap=6
-c classify_bln_numeric_mode=1
-c rej_alphas_in_number_perm=1

But I just get garbage out:

Oo -250 6 3a
190 & So
190 6 -100
1 $1290 6 ~140
1 $130 6 ~150

I've tried all sorts of additional image processing to try and improve the look of the text, but none of it works. In fact, this is the best output of seen. It's usually worse. I'm really hoping someone who has worked with dot-matrix input can offer some magic incantation to make tesseract come to its senses. Thanks.

Slartybartfast

unread,
Nov 1, 2023, 4:30:51 PM11/1/23
to tesseract-ocr
Doesn't anybody have any ideas?  :-(

La Monte H. P. Yarroll

unread,
Nov 2, 2023, 8:35:59 AM11/2/23
to tesser...@googlegroups.com
I had a little success applying 2.5 pixels of blur and then thresholding at 217-255. FWIW, I used gimp for the preprocesing. Here's what I got after just a few minutes:
a i @)

-230 & 50
90 6 50
90 6 -100

130 6
130 6

~100
-130

I don't know what happened to the first column or why the last 2 lines got split the way they did.


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/15797f86-58c9-4e71-b316-54f663d04cbfn%40googlegroups.com.

La Monte H. P. Yarroll

unread,
Nov 2, 2023, 8:43:12 AM11/2/23
to tesser...@googlegroups.com
I added more white space around the target text by scaling the canvas to 500 pixels wide, and then scaled up the whole image by a factor of 2.

-230 6 5O

90 6 50

90 6 -100
130 6 -100
130 6 -150
unnamed_blurred_2.5_threshold_217_resize_500_scale_2x.jpg

Slartybartfast

unread,
Nov 2, 2023, 6:36:14 PM11/2/23
to tesseract-ocr
Thank you! The original has much more border around it. I just cropped it for easier viewing here. I already did a little bit of pre-processing but looks like I need to do more. Interesting that scaling up improved things. According to one analysis done, accuracy depends on character height. According to that - I had the optimum character height, but maybe things have changed. The original scan was done at 300 dpi. I'll try 600.

Incidentally ... I got so frustrated I wrote my own OCR program today. Only took me a few hours. Much more accurate than Tesseract, though working with fixed-width fonts makes life a lot easier!! Just divide the image up into a grid, and pattern match each "cell". As I was only interested in the numbers, I only had 16 (hex digits) to match against.

Cheers

La Monte H. P. Yarroll

unread,
Nov 3, 2023, 9:28:49 AM11/3/23
to tesser...@googlegroups.com
I think the biggest improvement came from the blur followed by the right thresholding. That improves the division of the page into separate letters.

The added border allowed tesseract to pick up the right-hand side of the numbers better. I was hoping to pick up the 1's down the left-hand side, but that didn't work.

Scaling up is a heuristic trick I've used in the past, and it helped here.

Message has been deleted

Des Bw

unread,
Nov 5, 2023, 5:26:22 AM11/5/23
to tesseract-ocr

Des Bw
1:21 PM (now) 
to tesseract-ocr
Dear piggy, can you elaborate what you did with the images please?
The tools you used; and the  modifications you did. 
I was trying to replicate what you did. But, I am not getting what you get. 
Is scaling up the image the same thing as increasing the DPI of the image?
Can you look at the images that I have attached, see what I could have improved so that the ocr would be better. 

This is what I am getting with the attached image: 

0 -230 & 50
1 Q0 & S0

1 90 &6 —100
1 130 & —-100
1 130 &6 —-150
eg.tif

La Monte H. P. Yarroll

unread,
Nov 6, 2023, 5:06:59 PM11/6/23
to tesser...@googlegroups.com
All of the transformations were applied with gimp 2.10.30. I don't think the tools are going to be much different for any recent version.

Blur is Filters -> Blur -> Gaussian Blur. Set SizeX and Size Y to 2.50.

Colors -> Threshold... Set the left number to 217. The right number should be 255 already. You might be able to get better 5's and 3's by playing with these numbers a little bit.

We now have a binary image which is generally best for OCR performance.

Next is Image -> Canvas size... Lock the Width:Height ratio with the rectangular chain thingy, set Height to 500, click the Center button and Resize.

Image -> Scale Image... Lock the Width:Height ratio by clicking the square chain thingy. Change the Height to 1000 pixels. The default interpolation of Cubic is fine. Hit "Scale".

Now File -> Export as... and save it as "fixedup.png". Don't use jpeg for OCR if you can possibly avoid it.








On Sun, Nov 5, 2023 at 5:21 AM Des Bw <desal...@gmail.com> wrote:
Dear piggy, can you elaborate what you did with the images please?
The tools you used; and the  modifications you did. 
I was trying to replicate what you did. But, I am not getting what you get. 
Is scaling up the image the same thing as increasing the DPI of the image?

On Friday, November 3, 2023 at 4:28:49 PM UTC+3 piggy wrote:

La Monte H. P. Yarroll

unread,
Nov 6, 2023, 5:26:12 PM11/6/23
to tesser...@googlegroups.com
Unfortunately, gimp is an interactive application, so it is difficult to make it part of a cleanup pipeline. It can be done, and there is a tutorial on doing exactly that: https://www.gimp.org/tutorials/Automate_Editing_in_GIMP/

Once I work out the steps to clean my images, I usually code something up using the imagemagick suite. If it is exotic enough that I need to write C code for it, I generally use the Leptonica library.
Reply all
Reply to author
Forward
0 new messages