I have an image (label of a microscopy slide), which I thought would be easy to OCR, because it is easily readable for humans. I am using the latest Tesseract V5 as a command line under Windows However, with
tesseract image.jpg image.txt --oem 1 --psm xwith "--psm x" x being any number, which I tried, the results are poor (it misses the bottom line with "LOT40446" and thinks "+" is a "4" after binarization of the image I post here.
Is there anything I can do to improve the results?
I tried:
- Binarizing the image
- Setting DPI to 300 dpi
With these latter, it produced:
| +125 PROCock tai
| 12/03/2021
| 36729/21 344
Do you have any suggestion for improvements? On a side note, I tried the in Windows 10 available library a9t9, which was a lot better, but had also weaknesses.
One other idea that might help in a case like this is to use a threshold, using Imagemagick for example (though it adds some garbage):
$ convert -threshold 20% sample.jpg sample.png
$ tesseract --psm 11 sample.png sample
$ more sample.txt
+125
PROCock tai
2
12/03/2021
36729/21 3+4
|
>
Nb
41
LOT, 40446
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
tesseract-oc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/3c104995-5a73-41cf-9893-cdbd4dbcdfd6n%40googlegroups.com.