text2image crash

106 views
Skip to first unread message

Philip Pearl

unread,
Mar 31, 2015, 3:43:23 PM3/31/15
to tesser...@googlegroups.com
Hi All

I'm trying to train tesseract for the first time on my Mac.  I'm running text2image as follows, but it is crashing in Pango as the priv data on the font is NULL.

/usr/local/Cellar/tesseract/HEAD/bin//text2image --leading=32 --fonts_dir=/Library/Fonts --box_padding=0 --strip_unrenderable_words --char_spacing=0.0 --exposure=0 --find_fonts=true --outputbase=/tmp/tesstrain/eng/eng.Helvetica_Neue_Thin.exp0 --text=./tesslang/eng/eng.training_text


Thread 0 Crashed:: Dispatch queue: com.apple.main-thread

0   libpangoft2-1.0.0.dylib             0x00000001090fad9e pango_fc_font_get_glyph + 25

1   text2image                          0x000000010858bf58 tesseract::PangoFontInfo::CanRenderString(char const*, int, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > >*) const + 322

2   text2image                          0x000000010858d0ab tesseract::FontUtils::SelectFont(char const*, int, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > >*) + 287

3   text2image                          0x0000000108592c06 tesseract::StringRenderer::RenderAllFontsToImage(double, char const*, int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*, Pix**) + 108

4   text2image                          0x0000000108584149 main + 2750

5   libdyld.dylib                       0x00007fff932315fd start + 1


I installed from HEAD using homebrew and the instructions I found here https://ryanfb.github.io/etc/2014/11/19/installing_tesseract_training_tools_on_mac_os_x.html

  • Any ideas how to get around this crash?
  • Am I crazy running this on my Mac?  Would I be better off with a Linux VM?
  • Does training from fonts work or am I better off starting with images (my data is analog HD screen captures of TV menus!)? I know the font the menus use.
Thanks in advance for any help or advice you are able to give me.

Phil

Ryan Baumann

unread,
Apr 1, 2015, 12:51:20 PM4/1/15
to tesser...@googlegroups.com
This appears to be an issue with --find_fonts and/or --strip_unrenderable_words. The following command succeeds for me:

$ text2image --exposure=0 --font "Helvetica Neue Thin" --outputbase=eng.Helvetica_Neue_Thin.exp0 --text=/Users/ryan/source/tesseract/tesseract-ocr.langdata/eng/eng.training_text --leading=32 --char_spacing=0.0 --box_padding=0                                             
Initializing fontconfig
Rendered page 0 to file eng.Helvetica_Neue_Thin.exp0.tif
Rendered page 1 to file eng.Helvetica_Neue_Thin.exp0.tif

-Ryan

Ryan Baumann

unread,
Apr 1, 2015, 12:59:35 PM4/1/15
to tesser...@googlegroups.com
Also, to answer your other questions:

  • There appear to be some other issues with Pango/Cairo rendering under OS X which may impact the training process, as a result and for general replicability I now use a Dockerized Linux environment to do Tesseract training on my Mac: https://github.com/ryanfb/tesseract_latinocr_docker
  • Training from fonts works surprisingly well, but if there are significant artifacts introduced by your pipeline/capture process, you may get better accuracy with a manual box/train against images.
-Ryan

On Tuesday, March 31, 2015 at 3:43:23 PM UTC-4, Philip Pearl wrote:

Philip Pearl

unread,
Apr 2, 2015, 12:11:28 PM4/2/15
to tesser...@googlegroups.com
Hi Ryan 

Thanks very much for such a useful answer!  I'm building your docker container as I type and I'll try with font training when its built.

I tried looking at training with boxes and images, but it complained about a good number of my boxes - saying it couldn't detect blobs within them.  I'm guessing my problem is that I don't have good separation of characters, so I plan to look at whether I can just remove those boxes or whether edit the images to remove some characters.

Phil
Reply all
Reply to author
Forward
0 new messages