Tesseract and old fonts

679 views

Skip to first unread message

daemon-s

unread,

Jan 18, 2011, 3:24:13 AM1/18/11

to tesseract-ocr

*** On behalf of Andy Syme who could not post in this group probably
due to spam removal artefacts ***

...my problem is that I have some documents written in 1890-1920 that
I scanned & want to OCR. They are in English & using the standard
English language file I was getting 40-50% recognition. I then tried
to train a new font. I made an image file with at least 1 often 3 or
4 copies of each character & used pyTesseract to make the box file for
this new font. Rebuilt the trained data file (after some trial &
error), including adding the new font & updating the ambiguous
character sets e.g. g’ = g, \\’ = W etc.

When I rerun tesseract the OCR recognition is no better. I then
created a language file which was basically all the English files but
with only the ‘new font’ in. OCR accuracy dropped.

Is there something I’m doing wrong? The new box file had all the
letters (upper & lower) numbers & some punctuation but no newer
symbols (e.g. &s or @s ) as they are not present in these docs.

I can send the files I made if it will help you.

Will post this again if you prefer but I am desperately looking for
some help in this.

Andy

*** End Andy Syme ***

The provided file can be downloaded here:
https://docs.google.com/leaf?id=0B4FRY5H4TwI8ZWUzZDkzNjYtZTFiNC00NTBmLWIyY2ItMDFmNDAxZGI1ZTdk&hl=en

daemon-s

unread,

Jan 18, 2011, 3:26:21 AM1/18/11

to tesseract-ocr

Dear Andrew,

I've a couple of observations on your problem.

- The "standard" English language file was created using the set of
training images of the famous computer fonts like Arial, Times,
Verdana, some Ghostscript fonts and of their italic and bold versions.
Your book document's characters have strokes that are much thinner
then those in the above fonts. Moreover, it seems to me your scanner's
settings made letter strokes even thinner. These are the reasons why
Tess failed to show a good recognition rate.

- Your image is gray-scale and this means it will undergo binarization
prior to either training or recognition. Tess uses a pretty simple
Otsu binarization procedure and despite it doesn't seem to corrupt
your kind of images, it still might ruin some important character
details. To make sure it doesn't you may use the DumpPGM() method of
the TessBaseAPI class. But I think it's easier in your case to set up
your scanner to produce monochrome images and rescan.

- And the most important thing. As it is said in the TrainingTesseract
document, "training from real images is actually quite hard, due to
the spacing requirements". This is true but the sentence lacks just a
single word before "due": "particularly". I mean there are lots of
details you need to take into account when training from real images.
Describing all the nuances is a challenging task so the document says
not much on this subject. As for your image, a closer look to it will
let you notice many character imperfections. Due to scanning
artefacts, many characters are split (broken) or even totally lack
some thin stroke segments. In some rows (say the top one or the bottom
one) the situation is even worse. You may see that characters in these
rows are randomly carved and punched; probably they are intentionally
printed pale or dithered in the paper source. In fact, from Tess's
point of view, these imperfections are important parts of character
"prototype". Including such glyphs into the training set usually is
not beneficial and also might confuse Tess during recognition as well
as.

So when training Tess one should adhere to the following: if
character's imperfection is unusual or random then the glyph should
not be included into the training set. Also be aware that including a
character even with a frequent imperfection may confuse Tesseract. For
instance, the letter "B" is split in such way that it lacks the three
thin horizontal strokes. This results in that it resembles "I3" in
some fonts. If you add this disjoint sample to the training set you
may start getting "B" as a recognition result while in the source it's
really "I3".

So what you can do about your images? First of all, if you have
"unlimited" access to the paper originals, you can try to eliminate
scanning artefacts as much as possible by tweaking your scanner's
settings. Second, you may try to pre-process your images. Third, you
need to prepare your .box files more elaborately, i.e. make decisions
on damaged box/glyph pairs and remove the unwanted ones.

Not so straightforward but that's all I can help you with.

Regards,
Dmitry

Reply all

Reply to author

Forward

0 new messages