Miss lots of words in the detection

147 views
Skip to first unread message

L ht

unread,
Jan 21, 2024, 11:00:46 AM1/21/24
to tesseract-ocr
I am new to use tesseract. I found tesseract does not work as expected. I attach one example. 

tesseract 5.3.2
tesseract 272525030292764523137280353496213864766.png - -l eng --psm 3 quiet
can only detect those words
"Log in
Username
Password
Cancel"

I submit this picture to several online pic->txt converters. they work well, detecting most of the text in the pic.
For example, https://www.imagetotext.info/ it claims that it use tesseract 

I am not sure if I use tesseract correctly.
Does another can help test what's your detection result of this picture?  
Thanks

272525030292764523137280353496213864766.png

Zdenko Podobny

unread,
Jan 21, 2024, 11:03:00 AM1/21/24
to tesser...@googlegroups.com
Did you read the documentation or did you just set your expectations?


Zdenko


ne 21. 1. 2024 o 12:00 L ht <lhta...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e95fa7c6-7afb-4a08-8b11-a63a024c3c9bn%40googlegroups.com.

L ht

unread,
Jan 22, 2024, 6:42:14 PM1/22/24
to tesser...@googlegroups.com
Hi Zdenko,

Thanks for your response.
I read the Tesseract User Manual (https://tesseract-ocr.github.io/tessdoc/), but not read the code

I tried both tessdata_best and tessdata, tried different parameters of --psm, still can not get more detections. 

To provide some context, when I applied Tesseract to the entire image, it managed to identify only a few words, such as "Log in," "Username," "Password," and "Cancel," primarily within the central, well-lit portion. However, when I cropped the image to retain either the upper or left portions, Tesseract exhibited improved performance, successfully detecting numerous words in those respective areas.

Best,
Haitao

Zdenko Podobny

unread,
Jan 23, 2024, 6:02:55 AM1/23/24
to tesser...@googlegroups.com
Hi,

The most critical part is this: https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html, but I need to stress: tesseract is OCR engine not OCR suite.
Unless your input page is not a book page scan without a difficult structure, you need to do your part like image processing and document segmentation (detection of text block).

This is the reason why you get "unsatisfactory" results if you send complicated images with non uniform texts, with graphics etc.
However if you will use only text part of the image for recognition you can get very good results.

Best regards,

Zdenko


po 22. 1. 2024 o 19:42 L ht <lhta...@gmail.com> napísal(a):

L ht

unread,
Jan 23, 2024, 5:44:40 PM1/23/24
to tesser...@googlegroups.com

Hi Zdenko,

Thanks. Your insights have been instrumental in helping me grasp the concepts behind Tesseract.

I've been experimenting with various thresholding methods, such as Otsu (0), LeptonicaOtsu (1), and Sauvola (2), and I've noticed that they yield distinct outcomes when applied to my images. It seems that I might need to develop custom preprocessing procedures tailored to the images (webpage screenshots) before passing them to Tesseract.

Your guidance and suggestions are highly appreciated.


Best,

Haitao



Santhiya C

unread,
Jan 25, 2024, 12:08:18 PM1/25/24
to tesseract-ocr
Hi Guys , i will start development OCR using image and Pdf to text extraction what are the steps i need to follow , can you pleasse refer me the best model , already i had used the pytesseract engine but i did not get proper extraction ...

Best Regards,

Sandhiya
Reply all
Reply to author
Forward
0 new messages