Mapping Tesseract Fonts to Windows?

339 views
Skip to first unread message

ch...@sc3.net

unread,
Sep 20, 2013, 4:42:14 PM9/20/13
to tesser...@googlegroups.com
I would like to show the user the OCR output in my Windows application in a graphical form (the OCR'd characters, in the specified font, in the right location), in order to do that I need to pick a font to draw the OCR output text in, and it seems like I have two choices -
1) Map the Tesseract font to something Windows can understand
2) Use the actual Tesseract font

For #1, Tesseract uses a lot of fonts that I've got on my Windows box (Times New Roman, Arial, etc.) but then it also comes up with some I don't have (Century Schoolbook).  Is there a way to enumerate all the names of the fonts that Tesseract might return?  I can then decide whether it's easier to find Windows equivalent for all the fonts, or to download fonts (if they are free and have nice licensing).

For #2, it's not enough to just display the selected portion of the source image, that doesn't tell the user anything.  I would need a way to ask Tesseract, "what is the glyph for an uppercase G in an Arial font of height 34".  Does that exist?

Thanks,
Chris

Quan Nguyen

unread,
Sep 20, 2013, 7:39:07 PM9/20/13
to tesser...@googlegroups.com
You'll need to access Tessearct API for such information, specifically, ResultIterator and ResultIteratorWordFontAttributes. Check out the API Example page.

Quan

ch...@sc3.net

unread,
Sep 21, 2013, 8:39:04 AM9/21/13
to tesser...@googlegroups.com
Thanks for the quick response, but I already know about those APIs - let me try to explain with an example.

Let's say that ResultIterator says that it found the word "hello" in the image at position (100, 100), and TessResultIteratorWordFontAttributes says it's in font "Arial" with a height of 16.  In my Windows application, I can construct a 16-high Arial font and draw the word "hello" at (100, 100) and I am doing a good job of showing the user the OCR output.

But now let's say that ResultIterator continues and says that it found the word "goodbye" in the image at position (100, 300), and TessResultIteratorWordFontAttributes says it's in font "DejaVu Sans" with a height of 16.  If I tell Windows to construct a font named "DejaVu Sans", Window won't have any idea what that is, and it will pick some random font from its list.  When I then have my Windows application draw the word "goodbye" at (100, 300), it's highly likely that the character widths in the font that Windows is using are very different from the character widths in the actual DejaVu Sans font, so the word "goodbye" will take up the wrong amount of space and I'll either end up with lots of white space or (more often) the words all run over each other.

Does that make more sense?

Thanks,
Chris

Quan Nguyen

unread,
Sep 21, 2013, 11:07:11 AM9/21/13
to tesser...@googlegroups.com
I don't think Tesseract has any knowledge about system fonts. It gets the font info from the .traineddata file which includes information defined in the font_properties file used during training. So it means the fonts used in training may not exist on the machine it's being run on. Moreover, the font name specified in font_properties may not reflect the actual font name; e.g., "Times New Roman" may be shortened to "times".

As such, you will need to map the font returned by Tesseract to some font available on your system that has similar glyph characteristics.

ch...@sc3.net

unread,
Sep 21, 2013, 11:19:07 AM9/21/13
to tesser...@googlegroups.com
When you say that I will need to "map the font returned by Tesseract to some font available on your system that has similar glyph characteristics", you have restated my original question.

So maybe this rephrasing will help you understand my question:

How can I map the font returned by Tesseract to some font available on Windows that has similar glyph characteristics?

Quan Nguyen

unread,
Sep 21, 2013, 3:09:08 PM9/21/13
to tesser...@googlegroups.com
Not from inside Tesseract. Review of the API shows Tesseract does not expose a public method that enumerates the font names in a particular .trainneddata file. Therefore, you will have to visually inspect the glyphs, identify a matching font, install it if necessary, and manually substitute with the font in your program.

Robert Komar

unread,
Sep 21, 2013, 4:02:54 PM9/21/13
to tesser...@googlegroups.com
The number of fonts used in training tesseract must
be finite. If you can get the list somehow, then
you can do the matching once by hand and use the
results of that forever more.

However, there's still no guarantee that the best
matching font in the training set is actually what
was used in the image. Even if you had the exact
same font as tesseract says it is, it still might
not be the same length as the original. So, maybe
you need to do this differently.

Cheers,
Rob Komar
Reply all
Reply to author
Forward
0 new messages