Work with special characters / different language

Tim_legg

unread,

Apr 30, 2008, 4:01:24 PM4/30/08

to tesseract-ocr

Hello,

My need is somewhat specialized. I would like to OCR scanned in text
of the Cherokee syllabary. The data is parsed into paragraph-sized,
monochromatic image files. There is approximately 100 hours of typing
to be trimmed if I can configure the OCR software to recognize the
symbols.

I had considered a couple approaches. First was to create a procedure
to create a linked list of vectors to draw the character and guess
what symbol it corresponds to. Second was to create a collection of
images, about 20 for each symbol, and average them. Then do an
boolean comparison of the pixels to be OCR'd and the averaged symbol
to provide scoring data.

Fortunately, the documents all use the same font. The uppercase
symbols are identical to the lowercase except that they are scaled
about 20% larger. The document does have a lot of noise. Mostly
black specks that are usually less than 2% of the rectangular area of
a symbol.

I tried ocrad and found that the routine for parsing the characters
worked very well, but the code was so cluttered, hard to read and for
the most part undocumented. I quickly abandoned that software.

I am

My strength is in C, but I did take a C++ class about 10 years ago at
UND

Also...
The software compiled cleanly in Debian Etch, but keep in mind that
Debian by default does not come with compilers or C libraries.

Tim_legg

unread,

Apr 30, 2008, 5:06:18 PM4/30/08

to tesseract-ocr

(A crappelanche on my desk caused the e-mail to send prematurely)

As I was going to say, I am interested in making an effort in trying
to convert the software into reading the Sequoyah syllabary into a
textual format. Since these represent phonetics, they will result in
more than one Roman character per symbol. I am willing to use the
UTF-8 symbols if convenience demands it.

Frank Bennett

unread,

May 1, 2008, 10:42:19 AM5/1/08

to tesseract-ocr

Tim,

Tesseract is fully trainable, if the symbols you need to read are left-
right, have space between them and are in a fixed font (printed rather
than handwritten), you should be able to handle the text by going
through the steps to build training files for the symbol set, and
installing them as a custom language. You should be able to get away
without digging into the underlying code. The training process is
explained here:

http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract

If my quick scan of the net hasn't lead me astray, the syllabary has
85 glyphs, which is not a problem, tess can handle that with ease. If
you prepare training data, it may come in handy for others later on if
you place it on the Tesseract wiki.

For noise in the text, there are various document cleaning utilities
around. Under Linux, I have had good results with a tool called
"unpaper", but my knowledge is thin, there are lots of other options.

Good luck, here's hoping it goes smoothly!

Frank Bennett

Tim_legg

unread,

May 5, 2008, 6:05:54 PM5/5/08

to tesseract-ocr

Thanks for the link. All the symbols come from one font, in a book
that is a copy of the 1860 text. I am cutting out and pasting 15
copies of each symbol into a large image file.

There are a few quirks.

The uppercase symbols are scaled slightly larger than the lowercase,
should I insert the uppercase symbols into the image file, or is the
OCR able to scale the text? Also, some of the pages scanned are at
slightly different resolutions causing some symbols to be slightly
larger than others. Also some of the lowercase symbols do not
properly reflect the detail of the uppercase (i.e. curls become dots,
90 degree serifs become 45 degree wedges). I am trying to add the
symbols anyway, but some of them are unbelievably rare.

Also for the image file, I have been finding the best and only the
finest symbols for training. Should I also include some poorly
reproduced symbols to reflect what might actually be seen 25% of the
time? For example: a very large number of 'h' will appear as two
symbols because the arch is broken.

Tim Legg

Frank Bennett

unread,

May 6, 2008, 6:44:31 AM5/6/08

to tesseract-ocr

Tim,

My experience with Tess is limited to the special conditions of
scanning a limited set of Japanese characters embedded in financial
reports, together with the finance numbers. I can't help with the
issue of relative character size. The shortest route with that one
will probably be to try and see. If the discrimination doesn't work
well, are there patterns or conventions you can leverage for post-
processing? Concerning the broken characters, I would think that
including lower quality symbols in the training set is a good idea.
Non-contiguous characters should not be a problem; Japanese has lots
of them, and Tess copes just fine.

The training process is cumbersome to perform by hand, but if you
script it in your environment, you can run an early trial with a
minimal training page, and quickly check the way that adding
particular glyph examples affects the results. It took a weekend of
hacking to get Tess trained with (IIRC) about 60 Japanese characters
in 10+ different fonts.

Frank

Ray Smith

unread,

May 8, 2008, 9:55:00 PM5/8/08

to tesser...@googlegroups.com

If you are building the training images by hand like that, there are some important tips that you need to be aware of. I think these points are made in the training wiki, but they are important:

The biggest thing is to make sure that you don't fool the baseline and x-height estimators with your training text. What this means in practice is don't group together symbols that don't sit on the baseline:
(some) (training) [words] (that) {spread} [brackets] {around} GOOD
some training words, and oh we need some brackets too (((((([[[[[[{{{}}}]]]]]))) BAD
Secondly, if your characters come in multiple sizes, it is best to put them in separate images, as Tesseract isn't very good at working with multiple sizes in a single image, but can otherwise scale them OK.

Ray.

Reply all

Reply to author

Forward