Training tesseract-ocr unicharset_extractor, mftraining, cntraining

401 views
Skip to first unread message

Alain Ghawi

unread,
Apr 21, 2017, 2:43:56 AM4/21/17
to tesseract-ocr
Hello all,

I am surprised by how many people tell me that tesseract is the best open-source OCR tool but yet there is no video explaining step-by-step the problems that you can encounter, or a good explanation and documentation for OCR.

Well even though, everyone loves challenges! So here's the challenge I faced. I brought many pdf files that are invoices and I want to train tesseract to be able to ocr them as scanned images. 
So first of all, I transformed these pdf files into tif files using: magick -density 300 -depth 4   2151.pdf -background white -fill white -alpha Off  2151%d.tif
This is ImageMagick. Nothing important here other than we have a 300 dpi image with an alpha channel off.

You must rename them so : rename .tif files to: [lang].[name_font].exp0.tif (com.test_font.exp0.tif) This is for my example

Great! After this step you must create your box file right? So I simply called: 
tesseract com.test_font.exp0.tif com.test_font.exp0 batch.nochop makebox
tesseract com.test_font.exp0.tif com.test_font.exp1 batch.nochop makebox

Then I fixed my files with CowBoxEditor as I wasn't finding the famous jTessBoxEditor online (weird right?) which did the job.

After that, I created my .tr files:
tesseract com.test_font.exp0.tif com.test_font.exp0 nobatch box.train
tesseract com.test_font.exp1.tif com.test_font.exp1 nobatch box.train

And here comes the surprises!!!
After having your .tr files you call unicharset_extractor. 
First question: Why the glyph metrics are all 0,255,0,255,0,0,0,0,0,0? Which is wrong according to the documentation: https://github.com/tesseract-ocr/tesseract/blob/a3ba11b030345d32829b1e8355afea5419978d82/doc/unicharset.5.asc
Second question: Should I write a box file, then the other or combine them? Option 1: unicharset_extractor com.test_font.exp0.box   or Option 2: unicharset_extractor com.test_font.exp0.box com.test_font.exp1.box  
Third question: set_unicharset_extractor why should I use it? It doesn't fix the metrics only specify if Latin or Common! Link: https://github.com/tesseract-ocr/tesseract/issues/318

After all these unanswered questions, I used mftraining and cntraining (no problems). Finally, I renamed my inttemp, normproto, pffmtable, shapetable  and I combined them using combine_tessdata com.

Final question: If I named com.inttemp1 com.inttemp2 does it work? Same for shapetable, normproto, pffmtable

I think these questions are asked more than once by all new users to tesseract. Please if any expert in tesseract can answer these questions it will be a great help for all the community.
Kindly find the attached 2 tif files and the boxes generated. 
com.test_font.exp0.box
com.test_font.exp0.tif

ShreeDevi Kumar

unread,
Apr 21, 2017, 4:55:03 AM4/21/17
to tesser...@googlegroups.com
If you want to OCR an invoice like the sample you posted, just use the eng.traineddata and OCR the page. You do not need to do any training.

Here is the output I get 



8633 0410 NO RP 11 07122015 NNNNNYNN 01 000001 0001 Page 2 Of 3


Did you know?


Your Comcast Business Internet

service gives you access to millions

of WiFi hotspots with the fastest WiFi

and even more coverage. Find out

more at businesscomcast.com/wifi.



Need help? We’re here for you.


9 Visit business.comcast.com/help

Call 1-800—391 -3000

A


Billing support

Open 6 am-9 pm MTN, Mon through Fri

and 7 am—8 pm Sat


Technical support

Open 24 hours, 7 days a week



Did you know?


Never miss a payment with text alerts.

Receive text message reminders when your

bill is ready to pay or past due. Sign up at

business.comcast.com/myaccount.



Your bill is ready




Please notify us immediately with any

questions regarding charges billed to your

account. Comcast will issue a credit or

refund for any verified billing error which is

brought to our attention within sixty (60) days

of the bill.


llllllllllllllllllllllllllllllllll


Additional payment options Moving? Let us help.


Automatic payment

Sign up at business.comcast.com/myaccount


a Oniine


Visit business.comcast.com/myaccount


a By phone

Call 1-800-391 -3000


if you're moving, give us as much

advanced notice as possible so we

can help make a smooth transition.


Call 1 -800-391 -3000


|||||||llllllllllllllllllllllllll




ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/beb558f3-d52c-4eca-a668-501a9804ffb0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Giriraj Bhojak

unread,
Apr 25, 2019, 11:49:36 PM4/25/19
to tesseract-ocr
Hello Shree,

I realize this post is more than two years old now, but would appreciate any help.
I tried your suggestion on the same attached sample using tesseract v4 and I am unable to get the result as you have posted.
I have tried all page segmentation modes, but none of them produced the result you have posted. 
Could you please let me know what I might be doing wrong?

Here is the version detail for the tessreact on my machine:

tesseract 4.0.0
 leptonica-1.77.0
  libgif 5.1.4 : libjpeg 9c : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.1 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found SSE

Here is the output I get for most of the psm modes:


8633 0410 NO RP 1107122016 NNNNNYNN 07 000001 0001 Page 20f3

Did you know? Did you know?

Your Comcast Business Internet Never miss a payment with text alerts.
service gives you access to millions Receive text message reminders when your
of WiFi hotspots with the fastest WiFi bill is ready to pay or past due. Sign up at
and even more coverage. Find out business.comcast.com/myaccount.

more at business.comcast.conm/wifi.

Your bill is ready

   

Need help? We’re here for you.

 

> Visit business.comcast.com/help Please notify us immediately with any
Call 1-800-391-3000 questions regarding charges billed to your
aa account. Comcast will issue a credit or
Billing support refund for any verified billing error which is
Open 6 am-9 pm MTN, Mon through Fri brought to our attention within sixty (60) days
and 7 am-8 pm Sat of the bill.

Technical support
Open 24 hours, 7 days a week

TT

Automatic payment If you’re moving, give us as much
Sign up at business.comcast.com/myaccount advanced notice as possible so we

Se Online can help make a smooth transition.

IME

 

 

Regards,
Giriraj.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Shree Devi Kumar

unread,
Apr 26, 2019, 3:04:34 AM4/26/19
to tesser...@googlegroups.com
Which eng.traineddata did you use?

There are three options
From tessdata, tessdata_best and tessdata_fast.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Giriraj Bhojak

unread,
Apr 26, 2019, 11:54:17 AM4/26/19
to tesseract-ocr
Hi Shree,

Thank you for quick response.

I ran following commands for each of these datasets and changed psm from 1 to 13 , but more or less the output is like the one I posted. Couldn't get the output as you have posted that has data in the right order of the context.

tesseract --tessdata-dir tessdata_best-master "sample.tif" sample --psm 1
tesseract --tessdata-dir tessdata_fast-master "sample.tif" sample --psm 1
tesseract --tessdata-dir tessdata-master "sample.tif" sample --psm 1

Not sure what I am doing wrong here, appreciate your help with this.

Regards,
Giriraj

Shree Devi Kumar

unread,
Apr 26, 2019, 12:35:05 PM4/26/19
to tesser...@googlegroups.com
April 2017 - It is probably the 3.0x version. Try the 3.05 branch.

@zdenop zdenop released this on Jun 1, 2017 · 26 commits to 3.05 since this release 

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.


--

Giriraj Bhojak

unread,
Apr 26, 2019, 12:55:14 PM4/26/19
to tesseract-ocr
Thank you, I will try it out next.
I wanted to use version 4 of tesseract since it uses LSTM based OCR engine. Higher accuracy is one of the essential requirements for my usecase.
Would you know if v4 supports extracting text from a  two column text structure image file at all?
Thank you for your quick response Shree!

Regards,
Giriraj.

Shree Devi Kumar

unread,
Apr 26, 2019, 1:42:17 PM4/26/19
to tesser...@googlegroups.com
@zdenko Please check this image (from the first post) with 3.0x and current 4.0x code to see if there is a regression in terms of recognition of 2 columns.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.

Giriraj Bhojak

unread,
Apr 26, 2019, 5:27:50 PM4/26/19
to tesseract-ocr
Hi Shree,

I just tried the v3.05.02 as well for different modes and I still couldn't produce the output as you posted with the image file.
I am wondering if I am doing anything wrong.
Here is the command I have run for the v3.05.02 tesseract and changed psm mode from 1 to 13:

/usr/local/Cellar/tesseract/3.05.02/bin/tesseract --tessdata-dir /usr/local/Cellar/tesseract/3.05.02/share/ "sample.tif" test --psm 3

It still produced the same output as earlier.
Please let me know what I might be doing incorrectly here.
Once again, thank you for your prompt responses.


Regards,
Giriraj.

Shree Devi Kumar

unread,
Apr 26, 2019, 11:04:17 PM4/26/19
to tesser...@googlegroups.com
I did not post the command that I used, it was probably with default psm and code as of April 2017. If you really want to investigate, use the commit from master branch as of that time and test.

In theory tesseract 4 should recognize two columns with the default psm. But there seem to be some issues with layout analysis.

You could try other means of selecting text regions and using tesseract on those.


To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Giriraj Bhojak

unread,
Apr 28, 2019, 4:03:23 PM4/28/19
to tesser...@googlegroups.com
Hi Shree,

Does this mean there is a bug in tesseract 4 and should I create one in GitHub for two columns text with default psm?

Also, could you please expand on what you meant by ' other means of selecting text region' ? Is there anything in tesseract that I can try to identify text regions ?

Regards,
Giriraj

Reply all
Reply to author
Forward
0 new messages