whenever I uses an image of 300 dpi without preprocessing sometimes tesseract fails to detect the lower portion of the image Any help?

138 views
Skip to first unread message

Jyoti yadav

unread,
Sep 26, 2022, 7:15:39 AM9/26/22
to tesser...@googlegroups.com
Hy 

In the below image the tessseract fails to detect the lower portion of the image although everything is seen to be detected in the csv file. while in other images it detects properly the whole stuff .

Do i need to perform any kind of preprocessing ? If someone faced a similar issue would appreciate their guidance and help in this regard

image.png


Thanks & Regards 
Jyoti Yadav

Shubham Trivedi

unread,
Sep 26, 2022, 8:49:58 AM9/26/22
to tesser...@googlegroups.com
I had the similar issue while working on PDF's. So I used the fitz library to zoom in (increased the resolution) the individual pages and in the end converted the pages into High Definition Image which helped me in overcoming the problem. 

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAFqgP5J-d-u_exajohss3qJLHtbZo6iDFgLYVQXfYHkdEfwwPQ%40mail.gmail.com.

Jyoti yadav

unread,
Sep 27, 2022, 1:48:00 AM9/27/22
to tesser...@googlegroups.com
could you please share some code about how much zoom factor & what did you use to convert them into HDI in order to overcome this problem? 

Jyoti yadav

unread,
Sep 27, 2022, 2:14:21 AM9/27/22
to tesser...@googlegroups.com
because this issue seems to be still there when i use the below peace of code 

!pip install PyMuPDF
import os, glob, fitz


pdffile='1.pdf'
doc = fitz.open(pdffile)
for page in doc:
  zoom = 5    # zoom factor
  mat = fitz.Matrix(zoom, zoom)
  pix = page.get_pixmap(matrix = mat,dpi = 300)
  pix.save("page-%i.jpg" % page.number)

Shubham Trivedi

unread,
Sep 27, 2022, 7:09:41 AM9/27/22
to tesser...@googlegroups.com
Have you tried to play around with psm and oem flag in tesseract? 

Jyoti yadav

unread,
Sep 27, 2022, 7:14:28 AM9/27/22
to tesser...@googlegroups.com
yes I have used this custom config for pytesseract 
custom_config = r'--oem 3 --psm 6'

Shubham Trivedi

unread,
Sep 27, 2022, 7:24:21 AM9/27/22
to tesser...@googlegroups.com
Then, in that case I will have to check the code. 
And another thing are you trying to convert table in dataframe? 

Jyoti yadav

unread,
Sep 27, 2022, 7:32:01 AM9/27/22
to tesser...@googlegroups.com
No just there are other fields as well in the pdf which are in the form of "form" and also tables which is irrelevants for now so i am assigning the label of others to them but tesseract fails to recognise the text till the bottom of the page for some pdf's but those things are exactly being detected in the csv file of tesseract. 
So i am delimma what's wrong with this 

Reply all
Reply to author
Forward
0 new messages