Re: [tesseract-ocr] Hey

101 views
Skip to first unread message
Message has been deleted

Ger Hobbelt

unread,
Jul 20, 2024, 10:15:44 AM7/20/24
to tesser...@googlegroups.com
Too little information provided for anyone to try and (at least) reproduce your problem.

Besides, if this is your source image you're toast anyway. For you and others:

mekur-bad-rez2.webp


your image reports as ~ 400x500-something pixels in size. (In the chart image above numbers' unit is *hundreds of pixels* i.e. '4' = 400 px) and for tesseract to have a chance at all a single text line's C[apitals]-height should be around 30px; higher can be scaled down if needed, during image preprocessing done before feeding your stuff to tesseract.

TL;DR: that '30' number means the number of text lines in a section of 100 pixels should be about 3 (or rather less as line-height > C-height > x-height), not *9* lines as counted in your image!

I don't know this language, but for you & anyone else who likes to have at least a fighting chance of OCR-ing something: 30px D-height implies a ball-park number of 20px for x-height and "reasonable" line heights to be 40px or more. And, please, don't get me started on "I resize the image if you want it to be bigger!" 🤦  To the machine, the above image is just a bunch of pixelated noise, alas, irrespective of what language the original was ever written in. Lower pixel measurement values, not surpassing the benchmark of 30px per line? Redo your scans, get better hardware, do a better job at the image preprocessing (this image is also failing that benchmark, incidentally, but one can write a book on that subject alone, so we'll leave that out)



Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web:    http://www.hobbelt.com/
        http://www.hebbut.net/
mail:   g...@hobbelt.com
mobile: +31-6-11 120 978
--------------------------------------------------


On Wed, Jul 10, 2024 at 11:12 AM Mekuriaw Aze <mekur...@gmail.com> wrote:
Dear All
Cooperation request
My question is, if I do it again and again in Python to change the image to text and make it readable, it give me an error, help me?
Is the image attached below? Is Geez an Ethiopian language?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4f47a021-d4ee-4994-bb1b-65009a443153n%40googlegroups.com.

Mekuriaw Aze

unread,
Jul 20, 2024, 2:00:38 PM7/20/24
to tesseract-ocr
Thanks 
Reply all
Reply to author
Forward
0 new messages