Issue 627 in tesseract-ocr: Need Help: OCRA font for English for simple numeric glyphs

32 views
Skip to first unread message

tesser...@googlecode.com

unread,
Feb 18, 2012, 2:47:26 PM2/18/12
to tesserac...@googlegroups.com
Status: New
Owner: ----

New issue 627 by jlpool...@gmail.com: Need Help: OCRA font for English for
simple numeric glyphs
http://code.google.com/p/tesseract-ocr/issues/detail?id=627

I have a blob of numbers, in OCRA font, that I want to recognize. Other
fonts such as Arial, Times New Roman, Courier, & Palatino work fine for
recognizing the numeric glyphs. Ironically, OCRA which was designed to
assure accuracy in optical character recognition is failing using a
standard Tesseract install.

My problem is that OCRA font was used on output that now needs to be
optically recognized. So, I figured I needed to train Tesseract to handle
OCRA font numeric glyphs. I'm unable to successfully train Tesseract and
need help or a pointer on where I ran afoul.

Attached (OCRA_numbers_variety.png) is a sample PNG file showing a variety
of fonts, but most importantly a sample of OCRA font for the number set
0..9.

Tesseract's attempt to recognized a combination of
characters, "0123456789", results in: ULE3H5E?Bq

Here is what I did in an attempt to train tesseract.

I created a two samples in Open Office. The first was a single line
with "0123456789" on it. The Second was each number on its own line so
there is a column of 0-9. I printed to Adobe Acrobat at 400 dpi. In
Acrobat, I exported the images at 400 dpi to PNG images.

On a Gentoo Linux box, I did the following, using Sample set 2 for the

#
# create the box file
#
tesseract eng.ocra.exp2.png eng.ocra.exp2 batch.nochop makebox

jlpoole@hermes ~/work/tess/samples $ tesseract eng.ocra.exp2.png
eng.ocra.exp2 batch.nochop makebox
Tesseract Open Source OCR Engine v3.02 with Leptonica
jlpoole@hermes ~/work/tess/samples $ cat eng.ocra.exp2.box
U 322 4027 348 4068 0
L 322 3957 348 3998 0
E 322 3888 348 3929 0
3 322 3818 348 3859 0
H 323 3749 347 3790 0
5 322 3679 348 3720 0
E 322 3610 348 3651 0
? 322 3541 348 3582 0
B 322 3471 348 3512 0
q 322 3402 348 3443 0
jlpoole@hermes ~/work/tess/samples $
#
# edit the box file to correct the character
# after the edits:
#
jlpoole@hermes ~/work/tess/samples $ nano eng.ocra.exp2.box
jlpoole@hermes ~/work/tess/samples $ cat eng.ocra.exp2.box
0 322 4027 348 4068 0
1 322 3957 348 3998 0
2 322 3888 348 3929 0
3 322 3818 348 3859 0
4 323 3749 347 3790 0
5 322 3679 348 3720 0
6 322 3610 348 3651 0
7 322 3541 348 3582 0
8 322 3471 348 3512 0
9 322 3402 348 3443 0
jlpoole@hermes ~/work/tess/samples $

#
# run in the training mode, "Run Tesseract for Training"
#
tesseract eng.ocra.exp2.png eng.ocra.exp2 nobatch box.train

jlpoole@hermes ~/work/tess/samples $ tesseract eng.ocra.exp2.png
eng.ocra.exp2 nobatch box.train
Tesseract Open Source OCR Engine v3.02 with Leptonica
APPLY_BOXES:
Boxes read from boxfile: 10
Found 10 good blobs.
TRAINING ... Font name = ocra
Generated training data for 1 words
jlpoole@hermes ~/work/tess/samples $


#
# "Compute the Character Set"
#
unicharset_extractor eng.ocra.exp2.box

jlpoole@hermes ~/work/tess/samples $ unicharset_extractor eng.ocra.exp2.box
Extracting unicharset from eng.ocra.exp2.box
Wrote unicharset file ./unicharset.
jlpoole@hermes ~/work/tess/samples $ cat unicharset
11
NULL 0 NULL 0
0 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 1 0 0 # 0 [30 ]0
1 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 2 0 0 # 1 [31 ]0
2 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 3 0 0 # 2 [32 ]0
3 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 4 0 0 # 3 [33 ]0
4 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 5 0 0 # 4 [34 ]0
5 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 6 0 0 # 5 [35 ]0
6 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 7 0 0 # 6 [36 ]0
7 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 8 0 0 # 7 [37 ]0
8 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 9 0 0 # 8 [38 ]0
9 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 10 0 0 # 9 [39 ]0
jlpoole@hermes ~/work/tess/samples $
#
# Create file "font_properties"
#
jlpoole@hermes ~/work/tess/samples $ cat font_properties
ocra 0 0 1 0 0

jlpoole@hermes ~/work/tess/samples $

#
# Run MF Training, "Clustering" Step 1: mftraining
#
mftraining -F font_properties -U unicharset -O eng.unicharset
eng.ocra.exp2.tr

jlpoole@hermes ~/work/tess/samples $ mftraining -F font_properties -U
unicharset -O eng.unicharset eng.ocra.exp2.tr
Read shape table shapetable of 0 shapes
Reading eng.ocra.exp2.tr ...
Warning: no protos/configs for 0 in CreateIntTemplates()
Warning: no protos/configs for 1 in CreateIntTemplates()
Warning: no protos/configs for 2 in CreateIntTemplates()
Warning: no protos/configs for 3 in CreateIntTemplates()
Warning: no protos/configs for 4 in CreateIntTemplates()
Warning: no protos/configs for 5 in CreateIntTemplates()
Warning: no protos/configs for 6 in CreateIntTemplates()
Warning: no protos/configs for 7 in CreateIntTemplates()
Warning: no protos/configs for 8 in CreateIntTemplates()
Warning: no protos/configs for 9 in CreateIntTemplates()
Done!
jlpoole@hermes ~/work/tess/samples $
#
# cntraining, "Clustering" Step 2: cntraining
#
cntraining eng.ocra.exp2.tr

jlpoole@hermes ~/work/tess/samples $ cntraining eng.ocra.exp2.tr
Reading eng.ocra.exp2.tr ...
Clustering ...

Writing normproto ...
jlpoole@hermes ~/work/tess/samples $
#
# Was a file "unicharambigs" created?
# conclusion: no
#

jlpoole@hermes ~/work/tess/samples $ ls uni*
unicharset
jlpoole@hermes ~/work/tess/samples $

#
# "Putting It Altogether"
#
combine_tessdata eng.

jlpoole@hermes ~/work/tess/samples $ combine_tessdata eng.
Combining tessdata files
TessdataManager combined tesseract data files.
Offset for type 0 is -1
Offset for type 1 is 140
Offset for type 2 is -1
Offset for type 3 is -1
Offset for type 4 is -1
Offset for type 5 is -1
Offset for type 6 is -1
Offset for type 7 is -1
Offset for type 8 is -1
Offset for type 9 is -1
Offset for type 10 is -1
Offset for type 11 is -1
Offset for type 12 is -1
Offset for type 13 is -1
Offset for type 14 is -1
Offset for type 15 is -1
Offset for type 16 is -1
jlpoole@hermes ~/work/tess/samples $

#
# try
#

tesseract OCRA_numbers_variety.png output -l eng

jlpoole@hermes ~/work/tess/samples $ cat output.txt
0123456789 Aï¬al

0123456789 Tnnes

0123456789 Courier
0123456789 Courier SWA
0123456789 Palatino

0123456789 Djvu Sans Mono
ULE3H5E?Bq OCRA

ULE3H5b?Bq OCRA-A -Std

jlpoole@hermes ~/work/tess/samples $

#
# Not being root for final combination affect outcome?
# Conclusion: no.
#

jlpoole@hermes ~/work/tess/samples $ su
Password:
hermes samples # /usr/local/bin/combine_tessdata eng.
Combining tessdata files
TessdataManager combined tesseract data files.
Offset for type 0 is -1
Offset for type 1 is 140
Offset for type 2 is -1
Offset for type 3 is -1
Offset for type 4 is -1
Offset for type 5 is -1
Offset for type 6 is -1
Offset for type 7 is -1
Offset for type 8 is -1
Offset for type 9 is -1
Offset for type 10 is -1
Offset for type 11 is -1
Offset for type 12 is -1
Offset for type 13 is -1
Offset for type 14 is -1
Offset for type 15 is -1
Offset for type 16 is -1
hermes samples # tesseract OCRA_numbers_variety.png output -l eng
bash: tesseract: command not found
hermes samples # /usr/local/bin/tesseract OCRA_numbers_variety.png output
-l eng
Tesseract Open Source OCR Engine v3.02 with Leptonica
hermes samples # cat output.txt
0123456789 Aï¬al

0123456789 Tnnes

0123456789 Courier
0123456789 Courier SWA
0123456789 Palatino

0123456789 Djvu Sans Mono
ULE3H5E?Bq OCRA

ULE3H5b?Bq OCRA-A -Std

hermes samples #


It looks like something went wrong at the MF Training Step 1, as indicated
by the warnings.


Attachments:
OCRA_numbers_variety.png 19.6 KB

tesser...@googlecode.com

unread,
Feb 18, 2012, 7:19:23 PM2/18/12
to tesserac...@googlegroups.com

Comment #1 on issue 627 by jlpool...@gmail.com: Need Help: OCRA font for

I neglected to copy three generated files so they have an "eng" prefix:

cp normproto eng.normproto
cp inttemp eng.inttemp
cp pffmtable eng.pffmtable

There was no Microfeat in my directory, so I concluded it is not needed.
After creating these prefixed fileds, I reran the combine command.

I also determined that I had to deploy the eng.traineddata to
/usr/local/share/tessdata (after copy the existing eng.traineddata that
came with tesseract to preserve a working solution). After deploying
eng.traineddata, I got an an error as follows:

jlpoole@hermes ~/work/tess/samples $ tesseract OCRA_numbers_variety.png
output -l eng
tesseract: unicharmap.cpp:105: bool UNICHARMAP::contains(const char*)
const: Assertion `*unichar_repr != '\0'' failed.
Aborted
jlpoole@hermes ~/work/tess/samples $


tesser...@googlecode.com

unread,
Feb 26, 2012, 6:53:41 AM2/26/12
to tesserac...@googlegroups.com

Comment #2 on issue 627 by withbles...@gmail.com: Need Help: OCRA font for

tested under version 3.02. please see attached files which are self
explanatory.

tesser...@googlecode.com

unread,
Feb 26, 2012, 7:13:44 AM2/26/12
to tesserac...@googlegroups.com

Comment #3 on issue 627 by withbles...@gmail.com: Need Help: OCRA font for

tested under tesseract 3.02. attached files which are self explanatory.
It is observed there are misspelling in the name of font in the output text
- even though box file contains correct spelling.Successfully trained
Tesseract to handle OCRA font numeric glyphs except english glyphs. I don't
know whether the expectation of poster is fulfilled.

Attachments:
testnumocra.txt 165 bytes
num.traineddata 304 KB
num.unicharset 2.2 KB
tesseract.log 1.7 KB
num.OCRA.variety.tr 225 KB
num.OCRA.variety.png 19.6 KB
num.OCRA.variety.box 3.2 KB

tesser...@googlecode.com

unread,
Feb 26, 2012, 2:37:43 PM2/26/12
to tesserac...@googlegroups.com

Comment #4 on issue 627 by jlpool...@gmail.com: Need Help: OCRA font for

Since Issue #629 embodies the same problem identified in this Issue #627,
I'm considering this issue closed and am pursuing the matter concerning
tesseract 3.02. [Version 681] in Issue #629. I updated my version of
tesseract to today's build and I still had problems. Reference should be
made to Issue #629 unless someone advises otherwise.

Thank you.

tesser...@googlecode.com

unread,
Feb 26, 2012, 11:11:07 PM2/26/12
to tesserac...@googlegroups.com

Comment #5 on issue 627 by withbl...@gmail.com: Need Help: OCRA font

reg:"I still had problems" -please elaborate/explain in detail what exact
problems still existed. I like to test after downloading the latest version
r-683 in WinXp.Upload sample text- based on which I can generate tif/box
files myself for testing purpose
and feedback.

tesser...@googlecode.com

unread,
Feb 27, 2012, 12:02:20 AM2/27/12
to tesserac...@googlegroups.com

Comment #6 on issue 627 by jlpool...@gmail.com: Need Help: OCRA font for

When I tried to run tesseract againt a newly built traindata (build 681) I
got this error message instead of output:

jlpoole@themis ~/work/tess/samples_b681 $ tesseract num.ocra.exp0.png
output -l num


tesseract: unicharmap.cpp:105: bool UNICHARMAP::contains(const char*)
const: Assertion `*unichar_repr != '\0'' failed.
Aborted

jlpoole@themis ~/work/tess/samples_b681 $

tesser...@googlecode.com

unread,
Feb 27, 2012, 2:01:51 AM2/27/12
to tesserac...@googlegroups.com
Updates:
Status: No-longer-an-issue

Comment #7 on issue 627 by zde...@gmail.com: Need Help: OCRA font for

(No comment was entered for this change.)

tesser...@googlecode.com

unread,
Feb 28, 2012, 10:10:38 AM2/28/12
to tesserac...@googlegroups.com

Comment #8 on issue 627 by withbles...@gmail.com: Need Help: OCRA font for

@jlpool)
(since I dont have your emailaddress and as such posted here)

since I am not able to generate exe files for r-684 due to confusion
due to new procedure,
will you kindly forward all exe files generated under r-681 by you to
me(email:withblessing....@gmail.com)and also step by step
procedure
followed by you to generate exe files under VS2008 - for which I shall
be thankful to you.
With regards, Sriranga(79yrs)

tesser...@googlecode.com

unread,
May 10, 2012, 2:31:02 AM5/10/12
to tesserac...@googlegroups.com
Updates:
Status: New

Comment #9 on issue 627 by zde...@gmail.com: Need Help: OCRA font for
@jlpoole56:

if you have still this problem, please post your files.

tesser...@googlecode.com

unread,
May 10, 2012, 11:25:06 AM5/10/12
to tesserac...@googlegroups.com

Comment #10 on issue 627 by jlpool...@gmail.com: Need Help: OCRA font for
I solved my problem in a later bug where I posted a perl script that can be
used to train. This bug may be closed.

tesser...@googlecode.com

unread,
May 10, 2012, 12:30:01 PM5/10/12
to tesserac...@googlegroups.com
Updates:
Status: No-longer-an-issue

Comment #11 on issue 627 by zde...@gmail.com: Need Help: OCRA font for
Reply all
Reply to author
Forward
0 new messages