The text is not recognized from png

amrapalli karan

unread,

Apr 7, 2020, 1:38:25 AM4/7/20

to tesseract-ocr

I have this .pdf file which I am able to read only partially. I am using R language to fetch the data from the pdf file which is uploaded in the form of an image.

The expected output is:

CONTINUOUS CAST COPPER WIRE ROD 11 MM 441567
CATHODE FULL 434122
CONTINUOUS CAST COPPER WIRE ROD NS 439678
CONTINUOUS CAST COPPER WIRE ROD 16 MM 443056...etc

The actual output which I am getting:

CONTINUOUS CAST COPPER WIRE ROD 11 MM 44567 
CONTINUOUS CAST COPPER WIRE ROD NS 439678
CONTINUOUS CAST COPPER WIRE ROD 16 MM 443056...etc.

The highlighted part of the text is missing when I am extracting the data. A part of the code that I am using in R is :

pdf_convert(event_url, 
            pages = 1, 
            dpi = 850, 
            filenames = "page1.png")
# what does the data look like
text <- ocr("page1.png")
cat(text)
What changes should I make that would help me fetch the complete data? Thanks in advance

0-637189269505122500-AnnualReport.pdf

Zdenko Podobny

unread,

Apr 7, 2020, 6:57:44 AM4/7/20

to tesser...@googlegroups.com

You can start with reading docs and then searching issue tracker and forum for "table".

Zdenko

ut 7. 4. 2020 o 7:38 amrapalli karan <amrapal...@gmail.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/bd4e9b31-6264-4ba7-81ec-b7960b626a5e%40googlegroups.com.

Message has been deleted

Lakshay Saini

unread,

Apr 7, 2020, 8:36:55 AM4/7/20

to tesseract-ocr

Hi,

1. Deskew the image to get straight text lines.
2. Use tesseract's PSM 6 mode, this mode helps you scan the pdf horizontally which can be very useful in table extraction.

Tesseract engine can provide great results depending on the quality of the image provided to it. It cannot give you 100% results all the time. Although if the image quality is great, it's possible to get 100% results. :)

I have attached the results after deskewing the image. Kindly look into the same. I have done the same in python.

0-637189269505122500-AnnualReport-1-deskew-converted_ocr.txt

0-637189269505122500-AnnualReport-1-deskew-converted_ocr.pdf

amrapalli karan

unread,

Apr 8, 2020, 12:42:18 AM4/8/20

to tesseract-ocr

Thanks for the post but while I am trying to use deskew in R , its throwing error while installation. But I have a work around which gave somewhat similar results. The magick package has image_deskew but that didn't seem to work. The output is generating a '|' and 'CATHODEFULL'. and I am not sure why. Is there any way out?

Code:

library magick

image=image_read_pdf('https://www.hindustancopper.com/Upload/Reports/0-637189269505122500-AnnualReport.pdf')

text=image %>%

image_rotate(3)%>%

image_ocr()

Lakshay Saini

unread,

Apr 8, 2020, 8:09:47 AM4/8/20

to tesseract-ocr

If you have deskewed the image, and you are not getting desired results, you can play with tesseract's engine modes, psm modes and dpi settings. Other than that, I don't think anything else can be done to improve the results. As I mentioned earlier, tesseract's output greatly depends on the quality of image provided to it. Play with the image settings and see if you can somehow improve the output.

And, if magik's library is not working, try with older versions of the same as they can be somewhat more reliable than the newest one. You can also read the documentations for the same, those can be really helpful.