Tesseract 3.04 - Distinguishing Font Characteristics

216 views
Skip to first unread message

vroscigno

unread,
Sep 17, 2015, 2:12:20 PM9/17/15
to tesseract-ocr
Hello All,

I am using Tesseract 3.04 on Windows to analyze scanned paper forms which often contain non-contiguous text labels of various size, position, and font style. I am attempting to deduce simple typeface characteristics such as serif vs sans-serif, fixed vs variable pitch, italics, bold, etc, in an effort to loosely classify identified text labels.

I started out by using LTRResultIterator::WordFontAttributes for recognized words, but then learned that the returned font properties are from the *best* matching font, not from an accumulation of actual character attributes for the recognized word.

As an example of this, I have observed cases where sequences of ARIAL (sans-serif, variable-pitch) characters are measured and determined to be fixed-pitch (for example: "BOOK"), and the best matching font is a COURIER variant (fixed-pitch, serif). In this case, none of the characters have serifs, but the determined pitch (fixed) seems to carry significant weight when matching fonts.

I intend to study the font classification logic a bit to be sure I understand it.

I also suspect that the Adaptive Classifier may propagate this effect for 'downstream' results. (True/False? Opinions?)

I thought about exploring the following:
1. disabling fixed-pitch classification and handling
2. disabling the adaptive classifier or limiting it's influence

Does anyone have any suggestions or opinions?

Regards,

Vince Roscigno

Reply all
Reply to author
Forward
0 new messages