You could try making it smaller, something like:
convert -resize 50% text_l.png text_s.png
Best,
art
From: tesser...@googlegroups.com <tesser...@googlegroups.com>
On Behalf Of Mishal Shanavas
Sent: Wednesday, December 20, 2023 7:29 AM
To: tesseract-ocr <tesser...@googlegroups.com>
Subject: [tesseract-ocr] inaccuracy in plane text
You don't often get email from mishals...@gmail.com. Learn why this is important |
i can not extract text with reliable accuracy of a simple text
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
tesseract-oc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/f86e2d35-4c35-4643-835f-109994e46632n%40googlegroups.com.
--
tesseract expects black text (lettering) on a white background: that's what is has been trained on and that's what will work best. Hence: try to convert anything to look like that before feeding it to Tesseract.
(Someone did in depth research about this many years ago, published on this list including charts, but i can't find the link within 60 seconds. Lazy me, sorry)
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAFP60foDK7hCgpUEQES5aKFW-1Qfcs8R1H-1L%2BQQ%3D71G%2B8DNEQ%40mail.gmail.com.
tesseract expects black text (lettering) on a white background: that's what is has been trained on and that's what will work best. Hence: try to convert anything to look like that before feeding it to Tesseract.This is not needed (in all cases ;-) ): tesseract inverts a image by itself for LSTM and uses OCR results with the best confidence. Practically it does not work for 100%. But if somebody cares about speed the best way is to use a binarized image with a white background and black text + usage of parameter tessedit_do_invert=0 (or new parameter invert_threshold=0.0)
(Someone did in depth research about this many years ago, published on this list including charts, but i can't find the link within 60 seconds. Lazy me, sorry)"Willus Dotkom" - link is part of most ignored tesseract part (documentation) - see https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md#rescaling :-)
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wXu9eLPzh7KWfRt1d0F2um7XExjyYa3L%3DO1W5HYN%2Bo3g%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpcuHDGbCAxyg%2B2jNGLcxc96gu_qYzXomS0DTpkf9ehYQ%40mail.gmail.com.