Tesseract ocr

Mohammad Waqas Shoukat Ali

unread,

Apr 24, 2021, 1:58:11 AM4/24/21

to tesseract-ocr

hi team,

i want to understand how i can teach my tesseract model for different files format.

Zdenko Podobny

unread,

Apr 24, 2021, 2:50:47 AM4/24/21

to tesser...@googlegroups.com

Please be more specific: provide an example of what your input is and what you want to achieve.

Zdenko

so 24. 4. 2021 o 7:58 Mohammad Waqas Shoukat Ali <vickv...@gmail.com> napísal(a):

hi team,

i want to understand how i can teach my tesseract model for different files format.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/83783733-7696-410f-9400-54b3608da396n%40googlegroups.com.

Mohammad Waqas Shoukat Ali

unread,

Apr 24, 2021, 3:19:08 AM4/24/21

to tesser...@googlegroups.com

Hi Zdenko,

My input is different pdf documents that contain things like salary slips and some other financial documents. We want to use tesseract feature to extract the name,email address,amounts type of fields from documents.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8w92H3MqRYA%2Bpz8q7aavH_BUnct3mZUGt9pOGt8ZrbYNg%40mail.gmail.com.

Zdenko Podobny

unread,

Apr 24, 2021, 3:40:24 AM4/24/21

to tesser...@googlegroups.com

Hi,

pdf is a document format (like odt, doc, docx, rtf). tesseract is processing images.

You did not mention what programing language(s) you plan to use, but there plenty of tool for pdf text extraction e.g. textract (python) [1]

If you have "stupid pdf" (just somebody embed to pdf scanned images), just extract images from pdf and then you can use them in tesseract.

Another option is to convert pdf to images (so you can process them with tesseract).I have very good experience with mupdf, but people use ghostscript also. There are plenty examples how to do it on the internet (e.g. in python [2]) .

Few days ago I found tesseract-ocr-wrapper[3], that focus on OCRing of "stupid pdfs". So maybe this can help you.

Just use the already available tools.

[1] https://textract.readthedocs.io/en/latest/

[2] https://bucket401.blogspot.com/2021/03/pdf-to-imagemultipage-in-python.html

[3] https://github.com/Altabeh/tesseract-ocr-wrapper

Zdenko

so 24. 4. 2021 o 9:19 Mohammad Waqas Shoukat Ali <vickv...@gmail.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CABG9Oc%3DvsjUiZLkXg6TMS_C4EWienEqfpxUvKP_%2BEF%3DWrsCnxg%40mail.gmail.com.

Reply all

Reply to author

Forward