Font Limit = 64 fonts in traineddata, really ??

199 views
Skip to first unread message

Albrecht Hilker

unread,
Jul 7, 2014, 8:55:37 PM7/7/14
to tesser...@googlegroups.com
The manual "Training Tesseract 3" says:

> Tesseract needs to know about different shapes of the same character by having different fonts separated explicitly.
> This used to be limited to 32 fonts, but the limit has been raised to 64.
> It is set by the constant MAX_NUM_CONFIGS defined in intproto.h.
> Note that runtime is heavily dependent on the number of fonts provided, and training more than 32 will result in a significant slow-down.



I analyzed the number of fonts in eng.traineddata and I was very surprised that there have been 358 fonts trained !
get_fontinfo_table().size() returns 358 !


Can anybody explain me this contradiction ?




Fonts in eng.traineddata:

 AR_PL_UKai_CN,
 AR_PL_UKai_Patched,
 AR_PL_UKai_TW,
 AR_PL_UMing_CN_Light,
 AR_PL_UMing_Patched_Light,
 AR_PL_UMing_TW_MBE_Light,
 Aboriginal_Sans,
 Aboriginal_Sans_Bold_Italic,
 Aboriginal_Sans_Italic,
 Aboriginal_Serif,
 Aboriginal_Serif_Bold,
 Aboriginal_Serif_Bold_Italic,
 Aboriginal_Serif_Italic,
 Abyssinica_SIL,
 AlArabiya,
 AlBattar,
 AlHor,
 AlManzomah,
 AlMohanad,
 Andale_Mono,
 Ani,
 AnjaliOldLipi,
 Arab,
 Arial,
 Arial_Black,
 Arial_Bold,
 Arial_Bold_Italic,
 Arial_Italic,
 BPG_Chveulebrivi,
 BPG_Chveulebrivi_Bold,
 BPG_Courier,
 BPG_Courier_Bold,
 BPG_Elite,
 BPG_Elite_Bold,
 BPG_Glaho,
 BPG_Glaho_Bold,
 BPG_Rioni,
 BPG_Rioni_Bold,
 BPG_Unicode_Standard,
 Baekmuk_Batang,
 Baekmuk_Batang_Patched,
 Baekmuk_Dotum,
 Baekmuk_Gulim,
 Baekmuk_Headline,
 Bangla,
 Bitstream_Vera_Sans,
 Bitstream_Vera_Sans_Bold,
 Bitstream_Vera_Sans_Bold_Oblique,
 Bitstream_Vera_Sans_Mono,
 Bitstream_Vera_Sans_Mono_Bold,
 Bitstream_Vera_Sans_Mono_Bold_Oblique,
 Bitstream_Vera_Sans_Mono_Oblique,
 Bitstream_Vera_Sans_Mono_Roman,
 Bitstream_Vera_Sans_Oblique,
 Bitstream_Vera_Sans_Roman,
 Bitstream_Vera_Serif,
 Bitstream_Vera_Serif_Bold,
 Bitstream_Vera_Serif_Roman,
 CaslonishFraxx,
 Century_Schoolbook_L,
 Century_Schoolbook_L_Bold,
 Century_Schoolbook_L_Bold_Italic,
 Century_Schoolbook_L_Italic,
 Century_Schoolbook_L_Roman,
 Chandas,
 Cloister_Black_Light,
 Comic_Sans_MS,
 Comic_Sans_MS_Bold,
 Cortoba,
 Courier_New,
 Courier_New_Bold,
 Courier_New_Bold_Italic,
 Courier_New_Italic,
 DejaVu_Sans,
 DejaVu_Sans_Bold,
 DejaVu_Sans_Bold_Oblique,
 DejaVu_Sans_Condensed,
 DejaVu_Sans_Condensed_Bold,
 DejaVu_Sans_Condensed_Bold_Oblique,
 DejaVu_Sans_Condensed_Oblique,
 DejaVu_Sans_Mono,
 DejaVu_Sans_Mono_Bold,
 DejaVu_Sans_Mono_Bold_Oblique,
 DejaVu_Sans_Mono_Oblique,
 DejaVu_Sans_Oblique,
 DejaVu_Sans_Ultra-Light,
 DejaVu_Serif,
 DejaVu_Serif_Bold,
 DejaVu_Serif_Bold_Italic,
 DejaVu_Serif_Bold_Oblique,
 DejaVu_Serif_Bold_Semi-Condensed,
 DejaVu_Serif_Condensed_Bold,
 DejaVu_Serif_Condensed_Bold_Italic,
 DejaVu_Serif_Condensed_Italic,
 DejaVu_Serif_Italic,
 DejaVu_Serif_Oblique,
 DejaVu_Serif_Semi-Condensed,
 Dimnah,
 Dustismo,
 Dustismo_Roman,
 Dustismo_Roman_Bold,
 Dustismo_Roman_Italic,
 Dustismo_Roman_Italic_Bold,
 Dyuthi,
 East_Syriac_Adiabene,
 East_Syriac_Ctesiphon,
 Electron,
 Estrangelo_Antioch,
 Estrangelo_Edessa,
 Estrangelo_Midyat,
 Estrangelo_Nisibin,
 Estrangelo_Quenneshrin,
 Estrangelo_Talada,
 Estrangelo_TurAbdin,
 FreeMono,
 FreeMono_Bold,
 FreeMono_Bold_Italic,
 FreeMono_Bold_Oblique,
 FreeMono_Italic,
 FreeMono_Oblique,
 FreeSans,
 FreeSans_Bold,
 FreeSans_Bold_Oblique,
 FreeSans_Oblique,
 FreeSerif,
 FreeSerif_Bold,
 FreeSerif_Bold_Italic,
 FreeSerif_Italic,
 Furat,
 Garuda,
 Garuda_Bold,
 Garuda_Bold_Oblique,
 Garuda_Oblique,
 GentiumAlt,
 GentiumAlt_Italic,
 Georgia,
 Georgia_Bold,
 Georgia_Bold_Italic,
 Georgia_Italic,
 Granada,
 Graph,
 Hani,
 Haramain,
 Hor,
 IPAGothic,
 IPAMincho,
 IPAPGothic,
 IPAPMincho,
 IPAUIGothic,
 Impact,
 Impact_Condensed,
 Jamrul,
 Jamrul_Semi-Expanded,
 Japan,
 Jet,
 Kalimati,
 Kalyani,
 Kayrawan,
 Kedage,
 Kedage_Bold,
 Kedage_Bold_Italic,
 Kedage_Italic,
 Khalid,
 Khmer_OS,
 Khmer_OS_Battambang,
 Khmer_OS_Bokor,
 Khmer_OS_Content,
 Khmer_OS_Fasthand,
 Khmer_OS_Freehand,
 Khmer_OS_Metal_Chrieng,
 Khmer_OS_Muol,
 Khmer_OS_Muol_Light,
 Khmer_OS_Muol_Pali,
 Khmer_OS_Siemreap,
 Khmer_OS_System,
 Kochi_Gothic,
 Kochi_Mincho,
 LKLUG,
 Lateef,
 Likhan,
 Linux_Biolinum_O,
 Linux_Biolinum_O_Bold,
 Linux_Libertine_O,
 Linux_Libertine_O_Bold,
 Linux_Libertine_O_Bold_Italic,
 Linux_Libertine_O_C,
 Linux_Libertine_O_Italic,
 Lohit_Assamese,
 Lohit_Bengali,
 Lohit_Gujarati,
 Lohit_Hindi,
 Lohit_Malayalam,
 Lohit_Oriya,
 Lohit_Punjabi,
 Lohit_Tamil,
 Lohit_Telugu,
 Loma,
 Loma_Bold,
 Loma_Bold_Oblique,
 Loma_Oblique,
 Lucida_Bright,
 Lucida_Bright_Italic,
 Lucida_Bright_Semi-Bold,
 Lucida_Bright_Semi-Bold_Italic,
 Lucida_Sans,
 Lucida_Sans_Oblique,
 Lucida_Sans_Semi-Bold,
 Lucida_Sans_Semi-Bold_Oblique,
 Lucida_Sans_Typewriter,
 Lucida_Sans_Typewriter_Bold,
 Lucida_Sans_Typewriter_Bold_Oblique,
 Mallige,
 Mallige_Bold,
 Mallige_Bold_Italic,
 Mallige_Italic,
 Mashq,
 Meera,
 Metal,
 Mitra_Mono,
 Monapo,
 Mukti_Narrow,
 Mukti_Narrow_Bold,
 Nada,
 Nagham,
 Nice,
 Norasi,
 Norasi_Bold,
 Norasi_Bold_Italic,
 Norasi_Bold_Oblique,
 Norasi_Italic,
 Norasi_Oblique,
 OpenSymbol,
 Ostorah,
 Padauk,
 Padauk_Bold,
 Petra,
 Phetsarath_OT,
 Pothana2000,
 Proclamate_Light,
 Purisa_Light,
 Rachana,
 Rachana_w01,
 RaghuMalayalam,
 Rehan,
 Rekha,
 Saab,
 Salem,
 Samanata,
 Samyak_Gujarati,
 Samyak_Oriya,
 Sazanami_Gothic,
 Sazanami_Mincho,
 Scheherazade,
 Serto_Batnan,
 Serto_Batnan_Bold,
 Serto_Jerusalem,
 Serto_Jerusalem_Bold,
 Serto_Jerusalem_Italic,
 Serto_Kharput,
 Serto_Malankara,
 Serto_Mardin,
 Serto_Mardin_Bold,
 Serto_Urhoy,
 Serto_Urhoy_Bold,
 Shado,
 Sharjah,
 TAMu_Kadambri,
 TAMu_Kalyani,
 TAMu_Maduram,
 TSCu_Comic,
 TSCu_Paranar,
 TSCu_Paranar_Bold,
 TSCu_Paranar_Italic,
 TSCu_Times,
 TakaoExGothic,
 TakaoExMincho,
 TakaoGothic,
 TakaoMincho,
 TakaoPGothic,
 TakaoPMincho,
 Tarablus,
 Tholoth,
 Tibetan_Machine_Uni,
 Times_New_Roman,
 Times_New_Roman_Bold,
 Times_New_Roman_Bold_Italic,
 Times_New_Roman_Italic,
 TlwgMono,
 TlwgMono_Bold,
 TlwgMono_Bold_Oblique,
 TlwgMono_Oblique,
 TlwgTypewriter,
 TlwgTypewriter_Bold,
 TlwgTypewriter_Bold_Oblique,
 TlwgTypewriter_Oblique,
 Trebuchet_MS,
 Trebuchet_MS_Bold,
 Trebuchet_MS_Bold_Italic,
 Trebuchet_MS_Italic,
 URW_Bookman_L,
 URW_Bookman_L_Bold,
 URW_Bookman_L_Bold_Italic,
 URW_Bookman_L_Italic,
 URW_Bookman_L_Light_Italic,
 UmePlus_Gothic,
 UmePlus_P_Gothic,
 UnBatang,
 UnBatang_Bold,
 UnDotum,
 UnDotum_Bold,
 UnifrakturMaguntia,
 Unikurd_Web,
 Uttara,
 VL_Gothic,
 VL_PGothic,
 Vemana2000,
 Verdana,
 Verdana_Bold,
 Verdana_Bold_Italic,
 Verdana_Italic,
 Walbaum-Fraktur,
 Webdings,
 WenQuanYi_Zen_Hei,
 Wyld,
 Wyld_Italic,
 aakar,
 batang,
 chandas1-1,
 chandas1-2,
 cheluvi,
 dotum,
 gargi,
 gulim,
 hline,
 ipag,
 ipagp,
 ipagui,
 ipam,
 ipamp,
 kalimati,
 kochi-gothic,
 kochi-gothic-subst,
 kochi-mincho,
 kochi-mincho-subst,
 lklug,
 lohit_bn,
 lohit_gu,
 lohit_hi,
 lohit_ml,
 lohit_or,
 lohit_pa,
 lohit_ta,
 lohit_te,
 monapo,
 ori1Uni,
 padmaa,
 padmaa_Bold,
 suruma

Paul

unread,
Jul 8, 2014, 7:34:34 AM7/8/14
to tesser...@googlegroups.com
If you have a look at intproto.h, you'll see there is a similar limitation, bit it's much more complicated. Unfortunately I don't have an overview of what is possible yet, but I'm working on it. :) Just use normproto.h as a reference.

Shree Devi Kumar

unread,
Jul 8, 2014, 8:16:58 AM7/8/14
to tesser...@googlegroups.com
As far as I understand, the font limitation applies up to tesseract 3.02. 

Major changes to training are currently in the works in SVN for 3.03 (not fully released yet - hence you see large number of fonts for english traineddata but not for others). The other languages traineddata maybe forthcoming in future.

Ray/Zdenko/Nick may be able to give an idea of expected timeline for release.

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/bee86d37-9e63-4d76-be78-345b8ed7f931%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Albrecht Hilker

unread,
Jul 8, 2014, 4:48:39 PM7/8/14
to tesser...@googlegroups.com

> As far as I understand, the font limitation applies up to tesseract 3.02. Major changes to training are currently in the works in SVN for 3.03

The files I am talking about are downloaded from
https://code.google.com/p/tesseract-ocr/downloads/list

They are all declared as version 3.02.
For example: tesseract-ocr-3.02.eng.tar.gz


> hence you see large number of fonts for english traineddata but not for others

This is not correct.
The spanish traineddata has the same 358 fonts.

shree

unread,
Jul 9, 2014, 1:49:49 AM7/9/14
to tesser...@googlegroups.com
My information IS dated - I haven't followed the recent changes. Please see this thread -  almost a year old which talked of the upcoming changes for training .... 

Nick White

unread,
Jul 9, 2014, 1:39:11 PM7/9/14
to tesser...@googlegroups.com
On Tue, Jul 08, 2014 at 10:49:49PM -0700, shree wrote:
> My information IS dated - I haven't followed the recent changes. Please see
> this thread - almost a year old which talked of the upcoming changes for
> training ....
>
> https://groups.google.com/forum/#!searchin/tesseract-dev/fonts/tesseract-dev/
> 4lxGjCGLBSw/CH1cZsovPjIJ

This thread only really has information about the new training
tools; I don't think any major changes in the formats / limits of
things are planned. Those new training tools do exist in SVN now,
incidentally; see the training/ and training/langdata directories,
and if you're curious to see how they can be used, check out the
Makefile of my training[0].

Albrecht, thanks for digging around like this and finding
inconsistencies in the documentation. I haven't looked at the font
limits myself, so will try to dip into the code soon to see if I can
figure out a more definitive answer. If you get there first, let me
know and I can update the TrainingTesseract3 page as appropriate.

Nick

0. git clone http://ancientgreekocr.org/grc.git
Reply all
Reply to author
Forward
0 new messages