Tesseract ocr

100 views
Skip to first unread message

Mohammad Waqas Shoukat Ali

unread,
Apr 24, 2021, 1:58:11 AM4/24/21
to tesseract-ocr
hi team,

i want to understand how i can teach my tesseract model for different files format. 

Zdenko Podobny

unread,
Apr 24, 2021, 2:50:47 AM4/24/21
to tesser...@googlegroups.com
Please be more specific: provide an example of what your input is and what you want to achieve.

Zdenko


so 24. 4. 2021 o 7:58 Mohammad Waqas Shoukat Ali <vickv...@gmail.com> napísal(a):
hi team,

i want to understand how i can teach my tesseract model for different files format. 

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/83783733-7696-410f-9400-54b3608da396n%40googlegroups.com.

Mohammad Waqas Shoukat Ali

unread,
Apr 24, 2021, 3:19:08 AM4/24/21
to tesser...@googlegroups.com
Hi Zdenko,

My input is different pdf documents that contain things like salary slips and some other financial documents. We want to use tesseract feature to extract the name,email address,amounts type of fields from documents. 

Zdenko Podobny

unread,
Apr 24, 2021, 3:40:24 AM4/24/21
to tesser...@googlegroups.com
Hi,

pdf is a document format (like odt, doc, docx, rtf). tesseract is processing images. 
You did not mention what programing language(s) you plan to use, but there plenty of tool for pdf text extraction e.g. textract (python) [1]

If you have "stupid pdf" (just somebody embed to pdf scanned images), just extract images from pdf and then you can use them in tesseract.

Another option is to convert pdf to images (so you can process them with tesseract).I have very good experience with mupdf, but people use ghostscript also. There are plenty examples how to do it on the internet (e.g. in python [2]) .
Few days ago I found  tesseract-ocr-wrapper[3], that focus on OCRing of "stupid pdfs". So maybe this can help you.

Just use the already available tools.


so 24. 4. 2021 o 9:19 Mohammad Waqas Shoukat Ali <vickv...@gmail.com> napísal(a):
Reply all
Reply to author
Forward
0 new messages