Need Help with extracting info from Invoice

Vinay Matam

unread,

Nov 18, 2014, 2:53:08 PM11/18/14

to tesser...@googlegroups.com

Hi All,

I really need your help with one of the projects that I am working on. I am using Tesseract 3.02 on a Ubuntu machine.

I have an invoice (please see the attached file). I want to extract some information from that invoice like Advisor Name, Invoice Number, Invoice Date, License No, Mileage etc..

I have tried to extract the whole data from the image to a text file. By doing some pre-processing on the image using Imagemagick, I was able to extract the info to some extent. However, I am not totally satisfied with the output.
I need your inputs on how I should extract the information. Shall I first crop the specific portion of the image to different rectangles and then OCR them individually..? I tried this way and gained great results. But again in this case, not all the images are in the same size with same resolution and hence the rectangles co-ordinates will not work on all the cases. I thought this method will not work on all images (scanned, taken from mobile or pdf files).

Then I thought of using Regular expressions on the extracted data and then pick up the data that I require from the whole text file. But this method also does not seem to be working.

I am totally in a confused state now. Any help or inputs are much appreciated. .. :) I have attached a sample image and the extracted output.

Thanks,
Vinay.

as.jpg

asout.txt

Allistair C

unread,

Nov 18, 2014, 6:56:34 PM11/18/14

to tesser...@googlegroups.com

I wonder if there is anything consistent about the invoice design?

For instance I notice that your invoice has "Honda" logos on the top left and top right essentially providing 2 anchors from which you could extrapolate resolution and location/orientation of the table of data.

You could also look at techniques for table recognition thereby automating your rectangular cropping modes.

http://www.researchgate.net/publication/220781373_Automatic_Table_Detection_in_Document_Images/links/0fcfd5107ee667db68000000

http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6628801&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6628801

I would suggest rather than using ImageMagick you look to use Open CV instead as it provides more advanced algorithms for understanding your image (such as edge detection/pattern recognition, e.g. for the Honda logos).

I think the problem you have is best served by trying to identify the discrete rectangles else you will get noise that is difficult to filter for what you need, e.g. a person's name.

Cheers

Art W Rhyno

unread,

Nov 18, 2014, 7:12:56 PM11/18/14

to tesser...@googlegroups.com

> Shall I first crop the specific portion of the image to different rectangles and then OCR them individually..?

Hi Vinay,

If you are getting better OCR results from individual rectangles, you might look at Olena [1], there are some sample programs under the "contests" directory in the Olena distribution that give examples of identifying text sections in images. I have attached a sample that shows text blocks in green, that's without preprocessing the image, generally anything that helps Tesseract will also assist Olena in identifying text sections.

art
---
1. http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/WebHome

olena_as.png

Vinay Matam

unread,

Nov 19, 2014, 2:21:12 PM11/19/14

to tesser...@googlegroups.com

Thanks Allistair for replying. I have a wide variety of invoice types which are of no particular type. But all the invoice types will have the necessary fields that I have mentioned earlier in my post but they may exist at different locations in the image. Our solution should be able to extract the necessary fields of data irrespective of whatever the invoice format is.

I will surely check the links that you have provided.. I also got another thought.. I will try to implement and update here.. :)

Thanks again.. :)
Vinay

Vinay Matam

unread,

Nov 19, 2014, 2:22:47 PM11/19/14

to tesser...@googlegroups.com

Hi Art Rhyno, Thanks for your response. I will check it..

Djibril Kaba

unread,

Dec 6, 2017, 2:32:50 PM12/6/17

to tesseract-ocr

Hi Vinay,

I am trying to solve the same problem here. Have you managed to get some solution to your problem. Your help would be greatly appreciated. Looking forward to hearing from you.

Many thanks!!

Ha Hien

unread,

Jan 4, 2018, 7:16:40 AM1/4/18

to tesseract-ocr

Hi Djibril,
I am afraid that this is an old topic and he may not work with invoices anymore. I am also interested in extracting information from invoices. Have you tried to use tesseract with a dictionary
to improve accuracy? Because invoices have some particular data fields. You can see the manual here:
https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#config-files-and-augmenting-with-user-data
Tell me if you have better result. I will also tell you if I have.
Best,

Vào 20:32:50 UTC+1 Thứ Tư, ngày 06 tháng 12 năm 2017, Djibril Kaba đã viết:

saumitra mallick

unread,

Jan 10, 2018, 4:24:28 AM1/10/18

to tesseract-ocr

Hello all ,

I'm working on similar project , in my case i'm reading bank statements. I noticed the following

1. when you have a single line of text tesseract performs much better

2. I'm using openCV to cut individual cells from a table (you always know the order of cells since you cut them )

3. once you have data in individual cells (image files ), single line data gives much accurate results than multiline data ( anyone tried LSTM , instead of reading full text , maybe cut down individual cells to individual line and use line recognition with tesseract ?? Please let me know the results )

I need help for :

- how do I use tesseract in my C++ code , for the time being I'm using tesseract from command line

- Please post a sample program for me ,which does the following

- make tesseract read an image

- generate text output from it and write it to a file

If you guys are facing bumps in generating traineddata this post might help

http://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/

Please let me know if anyone is interested in sharing knowledge with me about the same .

Contact me at saumitr...@gmail.com

Best Regards

Saumitra Mallick

ShreeDevi Kumar

unread,

Jan 10, 2018, 4:47:01 AM1/10/18

to tesser...@googlegroups.com

See https://github.com/tesseract-ocr/tesseract/wiki/APIExample

For example of using tesseract in a program.

The training tutorial you refer to is old.

See tesstrain.sh for creating synthetic training data.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b97b440c-3ecd-4cf5-9bad-f94a98b54654%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Afreen Ferdoash

unread,

Jan 10, 2018, 10:43:40 AM1/10/18

to tesseract-ocr

I am trying to solve a similar problem, that of reading forms. Tesseract 4 is doing well but is DROPPING lots of words withing boxes. I thought this problem of dropping words existed with Indic languages but here I am having this issue for English too!
I tried to fool around with some parameters but whatever handful I tried didn't lead to *any* change in the output.

@Shree : Can you please suggest something since you too faced this issue earlier with another language ?

On Wednesday, January 10, 2018 at 3:17:01 PM UTC+5:30, shree wrote:

See https://github.com/tesseract-ocr/tesseract/wiki/APIExample

For example of using tesseract in a program.

The training tutorial you refer to is old.
See tesstrain.sh for creating synthetic training data.

On 10-Jan-2018 2:54 PM, "saumitra mallick" <saumitr...@gmail.com> wrote:

Hello all ,
I'm working on similar project , in my case i'm reading bank statements. I noticed the following
1. when you have a single line of text tesseract performs much better
2. I'm using openCV to cut individual cells from a table (you always know the order of cells since you cut them )
3. once you have data in individual cells (image files ), single line data gives much accurate results than multiline data ( anyone tried LSTM , instead of reading full text , maybe cut down individual cells to individual line and use line recognition with tesseract ?? Please let me know the results )

I need help for :
- how do I use tesseract in my C++ code , for the time being I'm using tesseract from command line
- Please post a sample program for me ,which does the following
- make tesseract read an image
- generate text output from it and write it to a file

If you guys are facing bumps in generating traineddata this post might help
http://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/

Please let me know if anyone is interested in sharing knowledge with me about the same .

Contact me at saumitr...@gmail.com

Best Regards
Saumitra Mallick

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,

Jan 10, 2018, 10:57:20 AM1/10/18

to tesser...@googlegroups.com

On Wed, Jan 10, 2018 at 8:07 PM, Afreen Ferdoash <afree...@gmail.com> wrote:

I am trying to solve a similar problem, that of reading forms. Tesseract 4 is doing well but is DROPPING lots of words withing boxes. I thought this problem of dropping words existed with Indic languages but here I am having this issue for English too!
I tried to fool around with some parameters but whatever handful I tried didn't lead to *any* change in the output.

@Shree : Can you please suggest something since you too faced this issue earlier with another language ?

Please see https://github.com/tesseract-ocr/tesseract/issues/681#issuecomment-356358284

@amido has offered a patch.

Afreen Ferdoash

unread,

Jan 11, 2018, 12:56:33 AM1/11/18

to tesseract-ocr

it is still not making any difference

saumitra mallick

unread,

Jan 11, 2018, 5:54:47 AM1/11/18

to tesseract-ocr

Hello Shree,

Thanks for the API example ,I'm facing issue with base api.

I get perfect output, when On my ubuntu terminal when I do

$ tesseract Row0_0.tif Row0_0_out

But When I try to read same file with BaseAPI code Ex

tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();

I'm running tesseract on a number of files via a c++ code and I'm missing output in some cases, but when I try to OCR the same image via terminal , it works like a charm .

Is there a difference between the two methods (linux terminal and C++ code ) , or shall I pull latest repo to get the changes done by Amitdo in this link https://github.com/tesseract-ocr/tesseract/issues/681#issuecomment-356358284

Here is info about my tesseract , Please let me know if I need any changes in installed dependencies.

saumitra@~$tesseract -v

tesseract 4.00.00alpha

leptonica-1.74.4

libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8

Found AVX2

Found AVX

Found SSE

You feedback is much appreciated

Best Regards

Saumitra Mallick

Afreen Ferdoash

unread,

Jan 12, 2018, 12:21:28 PM1/12/18

to tesseract-ocr

dropping words issue resolved with default psm mode. I had been using psm 6 earlier

Reply all

Reply to author

Forward