Tesseract accuracy.

197 views
Skip to first unread message

Kyle Zeneki

unread,
Mar 25, 2023, 3:39:08 AM3/25/23
to tesseract-ocr
Hello, I have these images and I'm trying to print their output using Tesseract. I spent 2 hours fine-tuning Tesseract for a specific font, and the error rate was 0.163. I used multiple font-detecting websites, and the closest match was "Futura Now." However, Tesseract sometimes fails to read the "E" from "D V E O" but successfully reads the "E" from "EOPEO." It also occasionally misreads "S E G I E" as "Ss Ee G I E." etc. I'm wondering if there's a way to train Tesseract by image rather than by font. Alternatively, is there a better tool than Tesseract, such as EasyOCR?"
capture9.pngcapture4.pngcapture5.pngcapture6.pngcapture7.pngcapture8.png

Zdenko Podobny

unread,
Apr 1, 2023, 3:20:00 AM4/1/23
to tesser...@googlegroups.com
As the first step, I would suggest you read https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md

Next: LSTM model is training on words/lines of text so it could have a problem with "code". For images like these legacy mode is perfect. E.g.:

tesseract WCAZ.png - --psm 6 --oem 0
W C A Z
tesseract DVEO.png - --psm 6 --oem 0
D V E O

The legacy engine model is available in languages files in tessdata repository (https://github.com/tesseract-ocr/tessdata). Many installations prefer to use fast model (without legacy model)

Zdenko


so 25. 3. 2023 o 8:39 Kyle Zeneki <kylez...@gmail.com> napísal(a):
Hello, I have these images and I'm trying to print their output using Tesseract. I spent 2 hours fine-tuning Tesseract for a specific font, and the error rate was 0.163. I used multiple font-detecting websites, and the closest match was "Futura Now." However, Tesseract sometimes fails to read the "E" from "D V E O" but successfully reads the "E" from "EOPEO." It also occasionally misreads "S E G I E" as "Ss Ee G I E." etc. I'm wondering if there's a way to train Tesseract by image rather than by font. Alternatively, is there a better tool than Tesseract, such as EasyOCR?"
capture9.pngcapture4.pngcapture5.pngcapture6.pngcapture7.pngcapture8.png

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fffda6e4-5754-4b87-b397-0365793d8c4en%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages