Tesseract for special characters

7,628 views
Skip to first unread message

nms_uk

unread,
Apr 2, 2008, 9:03:57 AM4/2/08
to tesseract-ocr
is there a way of recognizing special characters such as delta,
lambda ....etc using tesseract
is it possible to represent one recognized character as multiple
characters?
ie, say we have a @ sign, can tesseract be trained to output is as
'at' instead of @?
or 'dash' instead of -??
if so, how????
any clues plz???????

thx all

Frank Bennett

unread,
Apr 2, 2008, 5:15:53 PM4/2/08
to tesseract-ocr
On Apr 2, 10:03 pm, nms_uk <n.sad...@gmail.com> wrote:
> is there a way of recognizing special characters such as delta,
> lambda ....etc using tesseract

Yes.

> is it possible to represent one recognized character as multiple
> characters?
> ie, say we have a @ sign, can tesseract be trained to output is as
> 'at' instead of @?
> or 'dash' instead of -??

Yes.

> if so, how????
> any clues plz???????

For special characters, you can supplement the training pages for your
target language. It's a _bit_ of work, but there is clearly written
cookbook on the wiki:

http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract

Training requires that you specify the exact locations of the Little
Boxes (hat tip to Malvina Reynolds) surrounding each character to be
recognized, for every font in the training set. There is apparently a
Windows utility for this. For another OS, there is a python script
that works, but struggles under the volume of characters in the full-
sized training files. When I did this under Linux for a trial on
Japanese characters, I ended up doing the following:

1. Used a word processor to create a line containing the new
characters.

2. Printed the line to a file, one for each font in the training
set. The font size was chosen to produce glyphs the same size as
those in the training files when the image was printed to 3600 pixels
in width (the width of the existing training pages).

3. Resized, skewed and cropped the images containing the line (the
training pages for English are skewed about 0.08 degrees from the
horizontal).

4. Ran Tess over each of these files in makebox mode to create *.box
files.

5. Used a box training utility to check and fix the boxes and
characters for each glyph in the new line.

5. Spliced the new images onto the top of the existing training
pages, one per customer.

6. Used a small script to add 3600 to the Y dimension of each entry
in the new *.box files, and appended this modded text to the top of
the *.box file of the appropriate training pages.

7. Scripted the procedures in the training cookbook for training
with all the pages.

In my case, the mftraining tool blew up with a segmentation fault
during (7), at about the 10th page processed (which _really_ made my
day, as you can probably imagine). The internal limits that caused
this have been removed, or significantly expanded, in the prerelease
of v.2.02, available under the Source tab of the wiki cited above (in
the code checked out from subversion, the wordlist2dawg utility seemed
to run away with the machine's memory, so I used the 2.01 version of
this utility instead, in our application it worked fine with the 2.01
tesseract). Your mileage may vary.

As you can see, there's a lot of monotonous work involved. But if
Tess can find something as a blob, it _can_ be trained to return it as
a character (or string).

One thought; if you're looking at formula displays, the fluctuation in
baselines is probably going to make it hard for Tess to find character
blobs and correctly assess their sequence. (Like, you know, sort of,
"why can't tesseract give me the LaTeX for this formula" and stuff
like that.)

To return an arbitrary string instead of a character (up to 8 bytes in
2.01, I think), all you need to do is modify the *.box files that go
with each of the training pages, and retrain Tess. The details for
that are also explained in the training howto. Or you can just post-
process the output, which is less cumbersome and produces the same
result.

Frank Bennett

>
> thx all

Frank Bennett

unread,
Apr 2, 2008, 5:22:28 PM4/2/08
to tesseract-ocr
Sorry, slight correction. In the above post, "in our application it
worked fine with the 2.01 tesseract" should have read "_2.02_
tesseract".

FB

nms_uk

unread,
Apr 3, 2008, 8:04:02 AM4/3/08
to tesseract-ocr
Thank you very much Frank,
I've actually managed to do it via editing the box files and writing
the symbol name there,
ie, after running tesseract in the box-mode, it detected the sigma as
6, so inside the box file, instead of: 6 X1 Y1 X2 Y2, I've made it:
sigma X1 Y1 X2 Y2
and it worked well after that,
but is there a way of getting the LaTeX of any text using tesseract?

thanks a lot

Frank Bennett

unread,
Apr 3, 2008, 9:23:21 AM4/3/08
to tesseract-ocr
Always up for a challenge ...

tesseract mytext.tif mytext -l eng
echo "\\begin{verbatim}\n`cat mytext.txt`\n\\end{verbatim}"

!!
Reply all
Reply to author
Forward
0 new messages