Dot Matrix Fonts and Tesseract's Connected Component Analysis

1,647 views
Skip to first unread message

ameera...@gmail.com

unread,
Mar 22, 2019, 3:11:11 AM3/22/19
to tesseract-ocr

I am trying to fine-tune Tesseract for dot-matrix fonts such as that in the picture below.  When the dots are closely spaced together and touch, Tesseract can more or less handle the dot-matrix font with some fine-tuning and image processing.  However, when the dots do not touch, as in the picture below, Tesseract struggles.


I read in An Overview of the Tesseract OCR Engine that the first step in Tesseract's processing pipeline is a connected component analysis (second paragraph of Section 2).  Since the letters in a dot-matrix font do not form connected components, I am wondering if Tesseract's connected component analysis may be one reason that Tesseract struggles on the image below.  


Is there a command to see how Tesseract performs connected component analysis on this image?  


ex_20.jpg



Shree Devi Kumar

unread,
Mar 22, 2019, 5:22:37 PM3/22/19
to tesser...@googlegroups.com
> I read in An Overview of the Tesseract OCR Engine that the first step in Tesseract's processing pipeline is a connected component analysis (second paragraph of Section 2).

That applies to base/legacy tesseract (3.0x). Tesseract4 is neural net/LSTM based though it still supports the legacy models.

I tried some finetuning using dot matrix fonts. While the image you gave is not recognized correctly, if split in separate lines the results are better, but only with --psm 8 (single word).

/home/ubuntu/TEST/ex_20_1 dotslayer
LOT#S18FO70

/home/ubuntu/TEST/ex_20_2 dotslayer
EXP:09/2020

The traineddata file is attached.


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fbbc3452-62f5-4c34-bd9c-72fa3a52c97c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
ex_20_1.png
ex_20_2.png
dotslayer.traineddata

ameera...@gmail.com

unread,
Mar 22, 2019, 6:03:14 PM3/22/19
to tesseract-ocr
Hi Shree,

Thanks for sending these images and the traineddata file.  I confirmed that they worked.  Would you please tell me a little bit more about what kind of image processing you used to make the .png images and how you created your traineddata file using fine-tuning?

Thank you,
Ameera

Shree Devi Kumar

unread,
Mar 22, 2019, 10:13:40 PM3/22/19
to tesser...@googlegroups.com
Hi Ameera,

Please do check with other images too as I tested with only one image that you sent.

I had initially tried fine tuning (impact and plus) but those were not giving accurate results for 2nd line.

Then I tried replace the top layer, using new training text all in UPPER case, with many lines in the same format as the image u sent. I used just a couple of fonts that looked similar to the image.

Regarding the image, I tested different versions by changing it interactively in irfanview. Mainly, straighten the image, convert to black and white , resize to half and then half again. I haven't tested the new traineddata with the original image.

I will email you the training text and fonts used, if you want.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Shree Devi Kumar

unread,
Mar 22, 2019, 10:15:54 PM3/22/19
to tesser...@googlegroups.com
Also changed image to 300 dpi and used --dpi 300.

ameera...@gmail.com

unread,
Mar 23, 2019, 4:55:12 AM3/23/19
to tesseract-ocr
Hi Shree,

Thanks for the files!  That's interesting that you tried replacing the top layer.  I haven't tried that yet.  How many iterations did you use?

I was thinking today that it is difficult to create a single strong learner with tesseract because training from scratch requires so much data.  However, with fine-tuning, it is easy to create a lot of weak learners.  I am wondering if you know of any successes of an ensemble model with tesseract.

Thanks again,
Ameera

On Friday, March 22, 2019 at 12:11:11 AM UTC-7, ameera...@gmail.com wrote:

Shree Devi Kumar

unread,
Mar 23, 2019, 6:50:52 AM3/23/19
to tesser...@googlegroups.com
That's interesting that you tried replacing the top layer.  I haven't tried that yet.  How many iterations did you use?

In this case the unicharset was limited to UPPERCASE letters, 0-9 numbers , : and /.
I used a training_text which followed the pattern of the image - lines starting with LOT# and EXP: and using similar pattern.
I used 2 fonts which were very similar to the image.
So this was narrowly focussed on single use and only 2000 iterations were needed with tessdata_best/eng to get error rate down to 0.2 or so.
The # of iterations for plus training were also similar but they did not give same accuracy (also, the traineddata file size is much smaller using this method).

Balaji Gurunathan

unread,
Jun 7, 2019, 9:18:39 AM6/7/19
to tesseract-ocr
Hi,

I've a similar requirement to read dot-matrix fonts but I'm not sure where to begin this from since I'm new to Tesseract. Could you please share references/guide.

Thanks.

ameera3

unread,
Jul 9, 2019, 8:34:42 PM7/9/19
to tesseract-ocr
Hi Balaji,

You may find this GitHub repository useful, https://github.com/ameera3/OCR_Expiration_Date

Best wishes,
ameera3
Reply all
Reply to author
Forward
0 new messages