Limited Japanese + digits

228 views
Skip to first unread message

Frank Bennett

unread,
Mar 21, 2008, 8:50:24 PM3/21/08
to tesseract-ocr
I have put together a set of training files that recognize about 60
Japanese characters in addition to digits and the other usual
suspects, and I made some little discoveries in the process. I'm sure
this is all old hat to most of the people here, but some open
questions remain (see below), and maybe some of this will be of use to
future neophytes like myself. I would be curious to know if there are
specific limitations in mftraining that would cause it to segfault
under the conditions described below. (I have the training data to
hand if anyone is interested):

- It is possible to splice additional chunks of graphic at the top of
the existing training pages. By starting with a "blank" that has
3600x3600 white space below the additional patch of characters, it is
possible to add characters to the standard training set without
altering the base training data.

- Tesseract will happily recognize Japanese characters, there seems to
be nothing in the glyphs themselves that will confound her. On the
other hand, without interword space, I assume that you have to get by
without dictionary-based heuristics.

- In training, tess behaves very badly when a glyph is not replicated
across all pages. When a page with a glyph missing in the font for one
character was used accidentally, the recognition rate deteriorated as
she progressed through a page, even though the box file correctly
described that page. This was true even with the training pages
themselves. When the missing glyph was put in place and tess
retrained, the recognition rate on training pages jumped to 100%.

- Feeding duplicate graphics and box data to tess in separate training
pages _may_ be a bad idea.

o When I cloned the images and box data for 9 Japanese fonts across
the 32 training pages, mftraining segfaulted.

o When I filtered out everything but the Japanese characters in the
box files for the same pages, I got the same error.

o When I reduced the set to 9 pages with unique Japanese fonts in
each, training completed successfully with the full complement of
English characters plus the newcomers.

o The segfault problem may have been due to other causes:

+ If tess tracks irrelevant blobs during training, it might have
been due to an overflow of some sort, since there are more potential
blobs in the amended page images.

+ Also, most of the Japanese character boxes in my training data
are displaced about 30% vertically, which could be a problem (I have a
manuscript deadline, and I'm saving all that cursor key joy for a
later time).

+ Or I guess an off-by-one error in vertical alignment might have
crept in when the graphic was spliced that eventually snowballs during
the training process and causes mftraining to die.

Frank Bennett

Frank Bennett

unread,
Mar 23, 2008, 1:54:15 AM3/23/08
to tesseract-ocr
I have pushed this as far as appears to be possible with Tesseract
2.01. We now have a "language" configuration built from 10 training
pages, each containing 123 glyphs, composed of the numbers 0-9 the
comma, assorted punctuation marks and 106 Japanese characters. This
is the limit of what mftraining will handle with this combination of
characters; adding another training page triggers a segfault. But
tess runs the 10-page language configuration without complaint.

Obviously, most of the text we get back for a financial statement fed
to tess is garble, but the numbers and the known Japanese glyphs are
recognized with a good degree of accuracy. Cleaning the output will
give us what we need for our research purposes. Looks like we've just
barely managed to thread the needle.

Thanks to Ray Smith, to HP, to Google, and to everyone who has put in
effort on tesseract. More to come, I'm sure, but even at this stage
this is great stuff.

Frank Bennett
Nagoya

Frank Bennett

unread,
Mar 24, 2008, 2:45:56 PM3/24/08
to tesseract-ocr
Aha. With a current svn checkout (#154), mftraining is happy with
an 11th training page.

FB

Ted Rolle

unread,
Mar 24, 2008, 5:36:15 PM3/24/08
to tesser...@googlegroups.com
Frank: now that you've done all the work, will you make the training pages available to all? :D




Frank Bennett

unread,
Mar 24, 2008, 7:25:09 PM3/24/08
to tesseract-ocr
Oh, _that!_

I've placed a bundle in the files area that contains the training
pages, a readme, a config file and a couple of sample pages.

http://groups.google.co.jp/group/tesseract-ocr/

If you take these out for a spin, be sure to take a look at the readme
first, it contains some hints about some small traps for the unwary.

Frank Bennett
Nagoya

pierre.inalco

unread,
Mar 29, 2008, 7:05:11 AM3/29/08
to tesseract-ocr
Hello Frank.
I am new to the OCR thing and I am trying to make it work on my
computer.
I followed the instructions in the README.npx file but it ends on a
"Unable to load unicharset file /usr/.../tessdata/npx.unicharset".
Do you have any idea of what's going on ?

Also, you mentionned a 2.02 version that I did not find anywhere...
where is it ?

Thank you,

/pierre.

Frank Bennett

unread,
Mar 29, 2008, 8:20:54 AM3/29/08
to tesseract-ocr
Pierre,

The tif and box files are just the training data. You need to process
them with tess and create some files to build the language
configuration. The cookbook for that is here:

http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract

The Tesseract 2.02 (#154) code is available for checkout from the
Google subversion repository. The page for that is here:

http://code.google.com/p/tesseract-ocr/source/checkout

To set up the code checked out directly from the subversion
repository, you need the "autoconf" utility. But you can build a
language configuration from the sample training files okay, with up to
nine or ten pages, with tess v.2.01.

Hope this helps!
Frank Bennett

Ray Smith

unread,
Apr 18, 2008, 6:54:23 PM4/18/08
to tesser...@googlegroups.com
Hi Frank,

Nice work. As you noticed, the 2.02 code (of which an incomplete preliminary version is on svn) has been fixed to handle larger character sets, and characters with more features, which I suspect was the cause of your segfaults. You can also generate and utilize muti-page tiffs for training, which should help a lot.
Ray.

2008/3/29 Frank Bennett <bierc...@gmail.com>:

Frank Bennett

unread,
May 11, 2008, 12:47:31 PM5/11/08
to tesseract-ocr
For our little project to recover income figures for Japanese non-
profits, I've built a tool to repair formatting breakage and seek
totals in Tesseract output from PDF files containing the statements.
There is room for improvement, but in the first trial run the tool was
able to identify income components and valid grand totals for 35% of
an iniital set of 2205 filing statements -- not bad considering that
the documents are in multiple fonts and formats, and that no total
will be found if a single component digit is misrecognized.

The post-parsing tool is a Python script, and includes unit tests that
exercise the validation machinery against raw Tesseract output. On
the off chance that it might be of interest, the code is available for
download from http://gsl-nagoya-u.net/appendix/software/renumerate.

Frank Bennett
Nagoya




On Apr 19, 7:54 am, "Ray Smith" <theraysm...@gmail.com> wrote:
> Hi Frank,
>
> Nice work. As you noticed, the 2.02 code (of which an incomplete preliminary
> version is on svn) has been fixed to handle larger character sets, and
> characters with more features, which I suspect was the cause of your
> segfaults. You can also generate and utilize muti-page tiffs for training,
> which should help a lot.
> Ray.
>
> 2008/3/29 Frank Bennett <biercena...@gmail.com>:

Frank Bennett

unread,
May 16, 2008, 8:54:23 AM5/16/08
to tesseract-ocr
Earlier, I reported a 35% resolution rate for a set of financial
statements. I ferreted out some bugs while extending the test suites
for our post-processing tool, and the actual rate turns out to be more
like 47.9%, and our project now has legs. :)

Frank Bennett


On May 12, 1:47 am, Frank Bennett <biercena...@gmail.com> wrote:
> For our little project to recover income figures for Japanese non-
> profits, I've built a tool to repair formatting breakage and seek
> totals in Tesseract output from PDF files containing the statements.
> There is room for improvement, but in the first trial run the tool was
> able to identify income components and valid grand totals for 35% of
> an iniital set of 2205 filing statements -- not bad considering that
> the documents are in multiple fonts and formats, and that no total
> will be found if a single component digit is misrecognized.
>
> The post-parsing tool is a Python script, and includes unit tests that
> exercise the validation machinery against raw Tesseract output. On
> the off chance that it might be of interest, the code is available for
> download fromhttp://gsl-nagoya-u.net/appendix/software/renumerate.
Reply all
Reply to author
Forward
0 new messages