TRAINING ... Font name = UnknownFont.

zdpo

unread,

Apr 17, 2010, 4:01:58 AM4/17/10

to tesseract-ocr

Hello,

Can somebody suggest me what to do, let tesseract recognize font name
during training?

When I run 'tesseract arial.tif junk nobatch box.train.stderr'

I got this message:

Tesseract Open Source OCR Engine
APPLY_BOXES:
Boxes read from boxfile: 231
Initially labelled blobs: 231 in 7 rows
Box failures detected: 0
Duped blobs for rebalance: 0
"l" has fewest samples: 1
Total unlabelled words: 0
Final labelled words: 231
Generating training data
TRAINING ... Font name = UnknownFont.
Generated training data for 231 blobs

I would like to let tesseract use correct font name during process
(e.g. arial) and not "UnknownFont".

Br,

Zd.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com.
To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

zdpo

unread,

Apr 18, 2010, 6:53:02 PM4/18/10

to tesseract-ocr

I think I find the way for tesseract 3.00 after testing and looking to
source code...
I will try to describe it this week on http://www.sk-spell.sk.cx/tesseract-ocr-en...

Zd.

MARTIN Pierre

unread,

Apr 19, 2010, 3:05:58 AM4/19/10

to tesser...@googlegroups.com

Hello Zdpo,

As said in my mail on 13th of April, as an answer to Sriranga:

>> I am extremely thankful for the attachment. I could not understand "OCRB font" - which I don't have. It is presumed any fonts can do/be used ?
> Exactly. Basically, you'll have to create your custom language which will still contain a certain number of fonts. Each font can be train with multiple pictures. That's why the file names for the boxes are decomposed this way: xxx.FFFFF.ppp.box (xxx=language, FFF=font, ppp=page if you have multiple training pictures by font), this way the files are better organised.

As you can see, the names of the input files when training Tesseract (Especially the .tr files) are determining the font names.

This is visible in the source code too, if you make a search for "CurrentFont" in the whold source code, you'll see what i mean.

Pierre.

Zdenko Podobný

unread,

Apr 24, 2010, 5:46:58 AM4/24/10

to tesser...@googlegroups.com

Dňa 19.04.2010 09:05, MARTIN Pierre wrote / napísal(a):
> Hello Zdpo,
>
> As said in my mail on 13th of April, as an answer to Sriranga:
>
>
>>> I am extremely thankful for the attachment. I could not understand "OCRB font" - which I don't have. It is presumed any fonts can do/be used ?
>>>
>> Exactly. Basically, you'll have to create your custom language which will still contain a certain number of fonts. Each font can be train with multiple pictures. That's why the file names for the boxes are decomposed this way: xxx.FFFFF.ppp.box (xxx=language, FFF=font, ppp=page if you have multiple training pictures by font), this way the files are better organised.
>>
> As you can see, the names of the input files when training Tesseract (Especially the .tr files) are determining the font names.
>
> This is visible in the source code too, if you make a search for "CurrentFont" in the whold source code, you'll see what i mean.
>
> Pierre.
>
>

When I make tests on linux I experienced crash of tesseract... I tried
to understood source code (+ to some work with debuger ;-) ) and I think
there is a bug (or at least code did not handle possible inputs
correctly). My experience (+ patch for my problems) can be found on
http://www.sk-spell.sk.cx/tesseract-ocr-en-language-training-300...

Zdenko

Reply all

Reply to author

Forward