CONTINUOUS CAST COPPER WIRE ROD 11 MM 441567
CATHODE FULL 434122
CONTINUOUS CAST COPPER WIRE ROD NS 439678
CONTINUOUS CAST COPPER WIRE ROD 16 MM 443056...etcThe actual output which I am getting:
CONTINUOUS CAST COPPER WIRE ROD 11 MM 44567
CONTINUOUS CAST COPPER WIRE ROD NS 439678
CONTINUOUS CAST COPPER WIRE ROD 16 MM 443056...etc.The highlighted part of the text is missing when I am extracting the data. A part of the code that I am using in R is :pdf_convert(event_url, pages = 1, dpi = 850, filenames = "page1.png") # what does the data look like text <- ocr("page1.png") cat(text)What changes should I make that would help me fetch the complete data? Thanks in advance
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/bd4e9b31-6264-4ba7-81ec-b7960b626a5e%40googlegroups.com.
1. Deskew the image to get straight text lines.
2. Use tesseract's PSM 6 mode, this mode helps you scan the pdf horizontally which can be very useful in table extraction.
Tesseract engine can provide great results depending on the quality of the image provided to it. It cannot give you 100% results all the time. Although if the image quality is great, it's possible to get 100% results. :)
I have attached the results after deskewing the image. Kindly look into the same. I have done the same in python.
And, if magik's library is not working, try with older versions of the same as they can be somewhat more reliable than the newest one. You can also read the documentations for the same, those can be really helpful.
Also, if you have older version of tesseract, upgrade to new one. :)
Thanks
Lakshay