Microscopy label, poor recognition

80 views
Skip to first unread message

Martin Weihrauch

unread,
Dec 21, 2021, 5:08:21 AM12/21/21
to tesseract-ocr

I have an image (label of a microscopy slide), which I thought would be easy to OCR, because it is easily readable for humans. I am using the latest Tesseract V5 as a command line under Windows However, with

tesseract image.jpg image.txt --oem 1 --psm x

with "--psm x" x being any number, which I tried, the results are poor (it misses the bottom line with "LOT40446" and thinks "+" is a "4" after binarization of the image I post here. Is there anything I can do to improve the results?

I tried:

- Binarizing the image

- Setting DPI to 300 dpi

With these latter, it produced:

| +125 PROCock tai

 | 12/03/2021

| 36729/21 344


Do you have any suggestion for improvements? On a side note, I tried the in Windows 10 available library a9t9, which was a lot better, but had also weaknesses.

JBOBF.jpg

Merlijn B.W. Wajer

unread,
Dec 21, 2021, 5:53:44 AM12/21/21
to tesser...@googlegroups.com
Hi Martin,

Some of the advice below applies to Tesseract 5 only...

On 21/12/2021 09:38, 'Martin Weihrauch' via tesseract-ocr wrote:
>
>
> I have an image (label of a microscopy slide), which I thought would be
> easy to OCR, because it is easily readable for humans. I am using the
> latest Tesseract V5 as a command line under Windows However, with
> tesseract image.jpg image.txt --oem 1 --psm x
>
> with "--psm x" x being any number, which I tried, the results are poor (it
> misses the bottom line with "LOT40446" and thinks "+" is a "4" after
> binarization of the image I post here. Is there anything I can do to
> improve the results?
>
> I tried:
>
> - Binarizing the image
>
> - Setting DPI to 300 dpi
>
> With these latter, it produced:
>
> *| +125 PROCock tai*
>
> * | 12/03/2021*
>
> *| 36729/21 344*

This seems to work decent for reading the text you pasted above:

> $ tesseract --dpi 600 -c thresholding_method=2 -l eng /tmp/JBOBF.jpg -
> | +125 PROCock tai
>
> | 12/03/2021
> | 36729/21 3+4

But it still doesn't pick up the other text, which seems more like
segmentation problem. You can try to experiment with other psm values
(with --psm 11 it finds '40446').
You can try other thresholding_method's (0, 1, 2) as well:

> $ tesseract --psm 11 --dpi 600 -c thresholding_method=2 -l eng /tmp/JBOBF.jpg -
> ay els
>
> 12/03/2021
>
> 36729/21 3+4
>
> LOT
>
> 40446

If the segmentation isn't what you hoped for, you could also try
manually segmenting the image, or at least cropping it a bit more (to
make it more clear) before passing it to Tesseract.

For microfiche labels (not microscopy), I resorted to manual
segmentation (with prior knowledge of the material) and also had to
retrain Tesseract to deal with dot matrix fonts, but you don't seem to
need that. Probably with a bit more tweaking of either image cleanup or
segmentation you can get pretty decent results.

Regards,
Merlijn

Martin Weihrauch

unread,
Dec 21, 2021, 5:57:43 AM12/21/21
to tesseract-ocr
Thank you so much for your efforts!

Art Rhyno

unread,
Dec 21, 2021, 9:25:54 AM12/21/21
to tesser...@googlegroups.com

One other idea that might help in a case like this is to use a threshold, using Imagemagick for example (though it adds some garbage):

 

$ convert -threshold 20% sample.jpg sample.png

$ tesseract --psm 11 sample.png sample

$ more sample.txt

+125

 

PROCock tai

 

2

 

12/03/2021

 

36729/21 3+4

 

|

 

> 

 

Nb

 

41

 

LOT, 40446

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3c104995-5a73-41cf-9893-cdbd4dbcdfd6n%40googlegroups.com.

Keith M

unread,
Dec 21, 2021, 11:30:45 AM12/21/21
to tesseract-ocr
Martin,

I'd normally reply privately here, but I don't think that's an option given google groups configuration.

I know you didn't ask this specifically, but I ran your sample image, unmodified, through AWS Textract,  and got great results. I'm happy to run a small subset of images through it if you have a wide range of inputs, quality of images, etc.

Please contact me off-list keith a_ t_ techtravels dot org.

Thanks,
Keith
Reply all
Reply to author
Forward
0 new messages