pytesseract - how to improve quality of text

yoganand

unread,

Mar 22, 2019, 12:06:47 PM3/22/19

to tesseract-ocr

Hello,

Im building a OCR to read selected fields from invoices. i used tesseract, problems im facing are

1)not able to get table structures as is, atleast expecting a pipe symbol, which wil help in parsing text

2)few of characters were not extracted correctly. how to improve quality. does training tesseract4 helps?

3)why do you train tesseract4 additionally?

4)is there any option that i can use to get white spaces between words and text alignment as is in image after converting

i almost spent 1 mnth on this, could able to build ocr tool with a 40% accuracy

Shree Devi Kumar

unread,

Mar 22, 2019, 1:25:27 PM3/22/19

to tesser...@googlegroups.com

If the invoices have a fixed format, you can try with uzn.

See

https://github.com/jsoma/tesseract-uzn

https://jsoma.github.io/kull/#/

Or checkout OPENCV

See

https://www.learnopencv.com/deep-learning-based-text-recognition-ocr-using-tesseract-and-opencv/

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8ea1b021-5e96-43f4-a862-07da94eae9e6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

kailash hambarde

unread,

Apr 1, 2019, 1:12:19 AM4/1/19

to tesseract-ocr

Same problem here, did you find the solution

Reply all

Reply to author

Forward