We are trying to extract text content from normal pdf and scanned pdf (image) using tesseract-ocr.
We have observed following issues for the pdf's with table as table Contents are not getting extracted properly.
Tried image_to_string ,image_to_data ,opencv approach
Sample code used is:
from PIL import Image
import pytesseract from pytesseract import image_to_string from pytesseract import image_to_boxes
image=(pytesseract.image_to_string(Image.open('table_number.jpg'))) print(image)
It should extract rows and columns properly which it is not extracting as of now. Kindly suggest function or method to enhance the results for table content extraction using tesseract.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6ddfa19c-8025-40f8-8f17-a393e5b5b2cc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJ7g%3DmSUBC4aK0L%3De-9bbtBX5%3DCiFF%3DkLW8Wcmvr4YjQG13pmQ%40mail.gmail.com.
Did you try changing psm?
On Fri, May 31, 2019, 15:57 Manasi sarode <manasi....@gmail.com> wrote:
That's fair enough.
On Fri, May 31, 2019, 3:55 PM Sayali begampure <sayalisb...@gmail.com> wrote:
--We are trying to extract text content from normal pdf and scanned pdf (image) using tesseract-ocr.
We have observed following issues for the pdf's with table as table Contents are not getting extracted properly.
- Contents from few cells(rows/columns) are not visible.Sometimes heading of the table is missing.
- If numbers are there inside table, all the numbers are not getting extracted.
- Some letters are extracted wrongly . eg. i is misinterpreted as l.
- Column sequence is getting interchanged as it is parsing horizontally.
- Some extra characters are getting extracted along with normal one.
Tried image_to_string ,image_to_data ,opencv approach
Sample code used is:
from PIL import Image
import pytesseract from pytesseract import image_to_string from pytesseract import image_to_boxes
image=(pytesseract.image_to_string(Image.open('table_number.jpg'))) print(image)
It should extract rows and columns properly which it is not extracting as of now. Kindly suggest function or method to enhance the results for table content extraction using tesseract.
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6ddfa19c-8025-40f8-8f17-a393e5b5b2cc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b9c108dc-141c-4eb1-8bea-654410e42e05%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b1816f57-a23f-4cad-b8a2-686fa40364bf%40googlegroups.com.