Status: New
Owner: ----
New issue 627 by
jlpool...@gmail.com: Need Help: OCRA font for English for
simple numeric glyphs
http://code.google.com/p/tesseract-ocr/issues/detail?id=627
I have a blob of numbers, in OCRA font, that I want to recognize. Other
fonts such as Arial, Times New Roman, Courier, & Palatino work fine for
recognizing the numeric glyphs. Ironically, OCRA which was designed to
assure accuracy in optical character recognition is failing using a
standard Tesseract install.
My problem is that OCRA font was used on output that now needs to be
optically recognized. So, I figured I needed to train Tesseract to handle
OCRA font numeric glyphs. I'm unable to successfully train Tesseract and
need help or a pointer on where I ran afoul.
Attached (OCRA_numbers_variety.png) is a sample PNG file showing a variety
of fonts, but most importantly a sample of OCRA font for the number set
0..9.
Tesseract's attempt to recognized a combination of
characters, "0123456789", results in: ULE3H5E?Bq
Here is what I did in an attempt to train tesseract.
I created a two samples in Open Office. The first was a single line
with "0123456789" on it. The Second was each number on its own line so
there is a column of 0-9. I printed to Adobe Acrobat at 400 dpi. In
Acrobat, I exported the images at 400 dpi to PNG images.
On a Gentoo Linux box, I did the following, using Sample set 2 for the
#
# create the box file
#
tesseract eng.ocra.exp2.png eng.ocra.exp2 batch.nochop makebox
jlpoole@hermes ~/work/tess/samples $ tesseract eng.ocra.exp2.png
eng.ocra.exp2 batch.nochop makebox
Tesseract Open Source OCR Engine v3.02 with Leptonica
jlpoole@hermes ~/work/tess/samples $ cat eng.ocra.exp2.box
U 322 4027 348 4068 0
L 322 3957 348 3998 0
E 322 3888 348 3929 0
3 322 3818 348 3859 0
H 323 3749 347 3790 0
5 322 3679 348 3720 0
E 322 3610 348 3651 0
? 322 3541 348 3582 0
B 322 3471 348 3512 0
q 322 3402 348 3443 0
jlpoole@hermes ~/work/tess/samples $
#
# edit the box file to correct the character
# after the edits:
#
jlpoole@hermes ~/work/tess/samples $ nano eng.ocra.exp2.box
jlpoole@hermes ~/work/tess/samples $ cat eng.ocra.exp2.box
0 322 4027 348 4068 0
1 322 3957 348 3998 0
2 322 3888 348 3929 0
3 322 3818 348 3859 0
4 323 3749 347 3790 0
5 322 3679 348 3720 0
6 322 3610 348 3651 0
7 322 3541 348 3582 0
8 322 3471 348 3512 0
9 322 3402 348 3443 0
jlpoole@hermes ~/work/tess/samples $
#
# run in the training mode, "Run Tesseract for Training"
#
tesseract eng.ocra.exp2.png eng.ocra.exp2 nobatch box.train
jlpoole@hermes ~/work/tess/samples $ tesseract eng.ocra.exp2.png
eng.ocra.exp2 nobatch box.train
Tesseract Open Source OCR Engine v3.02 with Leptonica
APPLY_BOXES:
Boxes read from boxfile: 10
Found 10 good blobs.
TRAINING ... Font name = ocra
Generated training data for 1 words
jlpoole@hermes ~/work/tess/samples $
#
# "Compute the Character Set"
#
unicharset_extractor eng.ocra.exp2.box
jlpoole@hermes ~/work/tess/samples $ unicharset_extractor eng.ocra.exp2.box
Extracting unicharset from eng.ocra.exp2.box
Wrote unicharset file ./unicharset.
jlpoole@hermes ~/work/tess/samples $ cat unicharset
11
NULL 0 NULL 0
0 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 1 0 0 # 0 [30 ]0
1 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 2 0 0 # 1 [31 ]0
2 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 3 0 0 # 2 [32 ]0
3 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 4 0 0 # 3 [33 ]0
4 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 5 0 0 # 4 [34 ]0
5 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 6 0 0 # 5 [35 ]0
6 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 7 0 0 # 6 [36 ]0
7 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 8 0 0 # 7 [37 ]0
8 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 9 0 0 # 8 [38 ]0
9 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 10 0 0 # 9 [39 ]0
jlpoole@hermes ~/work/tess/samples $
#
# Create file "font_properties"
#
jlpoole@hermes ~/work/tess/samples $ cat font_properties
ocra 0 0 1 0 0
jlpoole@hermes ~/work/tess/samples $
#
# Run MF Training, "Clustering" Step 1: mftraining
#
mftraining -F font_properties -U unicharset -O eng.unicharset
eng.ocra.exp2.tr
jlpoole@hermes ~/work/tess/samples $ mftraining -F font_properties -U
unicharset -O eng.unicharset
eng.ocra.exp2.tr
Read shape table shapetable of 0 shapes
Reading
eng.ocra.exp2.tr ...
Warning: no protos/configs for 0 in CreateIntTemplates()
Warning: no protos/configs for 1 in CreateIntTemplates()
Warning: no protos/configs for 2 in CreateIntTemplates()
Warning: no protos/configs for 3 in CreateIntTemplates()
Warning: no protos/configs for 4 in CreateIntTemplates()
Warning: no protos/configs for 5 in CreateIntTemplates()
Warning: no protos/configs for 6 in CreateIntTemplates()
Warning: no protos/configs for 7 in CreateIntTemplates()
Warning: no protos/configs for 8 in CreateIntTemplates()
Warning: no protos/configs for 9 in CreateIntTemplates()
Done!
jlpoole@hermes ~/work/tess/samples $
#
# cntraining, "Clustering" Step 2: cntraining
#
cntraining
eng.ocra.exp2.tr
jlpoole@hermes ~/work/tess/samples $ cntraining
eng.ocra.exp2.tr
Reading
eng.ocra.exp2.tr ...
Clustering ...
Writing normproto ...
jlpoole@hermes ~/work/tess/samples $
#
# Was a file "unicharambigs" created?
# conclusion: no
#
jlpoole@hermes ~/work/tess/samples $ ls uni*
unicharset
jlpoole@hermes ~/work/tess/samples $
#
# "Putting It Altogether"
#
combine_tessdata eng.
jlpoole@hermes ~/work/tess/samples $ combine_tessdata eng.
Combining tessdata files
TessdataManager combined tesseract data files.
Offset for type 0 is -1
Offset for type 1 is 140
Offset for type 2 is -1
Offset for type 3 is -1
Offset for type 4 is -1
Offset for type 5 is -1
Offset for type 6 is -1
Offset for type 7 is -1
Offset for type 8 is -1
Offset for type 9 is -1
Offset for type 10 is -1
Offset for type 11 is -1
Offset for type 12 is -1
Offset for type 13 is -1
Offset for type 14 is -1
Offset for type 15 is -1
Offset for type 16 is -1
jlpoole@hermes ~/work/tess/samples $
#
# try
#
tesseract OCRA_numbers_variety.png output -l eng
jlpoole@hermes ~/work/tess/samples $ cat output.txt
0123456789 Aï¬al
0123456789 Tnnes
0123456789 Courier
0123456789 Courier SWA
0123456789 Palatino
0123456789 Djvu Sans Mono
ULE3H5E?Bq OCRA
ULE3H5b?Bq OCRA-A -Std
jlpoole@hermes ~/work/tess/samples $
#
# Not being root for final combination affect outcome?
# Conclusion: no.
#
jlpoole@hermes ~/work/tess/samples $ su
Password:
hermes samples # /usr/local/bin/combine_tessdata eng.
Combining tessdata files
TessdataManager combined tesseract data files.
Offset for type 0 is -1
Offset for type 1 is 140
Offset for type 2 is -1
Offset for type 3 is -1
Offset for type 4 is -1
Offset for type 5 is -1
Offset for type 6 is -1
Offset for type 7 is -1
Offset for type 8 is -1
Offset for type 9 is -1
Offset for type 10 is -1
Offset for type 11 is -1
Offset for type 12 is -1
Offset for type 13 is -1
Offset for type 14 is -1
Offset for type 15 is -1
Offset for type 16 is -1
hermes samples # tesseract OCRA_numbers_variety.png output -l eng
bash: tesseract: command not found
hermes samples # /usr/local/bin/tesseract OCRA_numbers_variety.png output
-l eng
Tesseract Open Source OCR Engine v3.02 with Leptonica
hermes samples # cat output.txt
0123456789 Aï¬al
0123456789 Tnnes
0123456789 Courier
0123456789 Courier SWA
0123456789 Palatino
0123456789 Djvu Sans Mono
ULE3H5E?Bq OCRA
ULE3H5b?Bq OCRA-A -Std
hermes samples #
It looks like something went wrong at the MF Training Step 1, as indicated
by the warnings.
Attachments:
OCRA_numbers_variety.png 19.6 KB