How should I make Tesseract to support multiple fonts?

smwikipedia smwikipedia

unread,

May 11, 2015, 10:31:24 PM5/11/15

to tesser...@googlegroups.com

I am trying to use Tesseract to do OCR for some screenshots.

But the characters on screen can be of multiple fonts.

I see that in the latest 3.03 version, the training tool `text2image` can easily generate training tif/box pair from training text and font files.

If I want to support multiple fonts, do I need to make multiple *.traineddata files and switch between them at runtime?

zdenop

unread,

May 13, 2015, 3:40:41 AM5/13/15

to tesser...@googlegroups.com, smwik...@gmail.com

On Tuesday, 12 May 2015 04:31:24 UTC+2, smwikipedia smwikipedia wrote:

I am trying to use Tesseract to do OCR for some screenshots.

I guess you will need to pre-proced/improve quality of images (see Wiki or VietOCR has screenshot mode)

But the characters on screen can be of multiple fonts.

This should not be problem. But image examples would help better words ;-)

I see that in the latest 3.03 version, the training tool `text2image` can easily generate training tif/box pair from training text and font files.

Why do you think you need to do training? Common experience is that image pre-processing is needed more that training. For some font (e.g. 7 segments) /cases it can be better to use other tools...

If I want to support multiple fonts, do I need to make multiple *.traineddata files and switch between them at runtime?

no.

Message has been deleted

smwikipedia smwikipedia

unread,

May 13, 2015, 10:44:59 PM5/13/15

to tesser...@googlegroups.com

I checked the `tessdata/eng.cube.size` file. It seems Tesseract supports several fonts. But unfortunately, I don't see the fonts I need, such as MSYH.

@zdenop, Could you reveal how this file is generated so I can add more fonts?

Reply all

Reply to author

Forward