text2image crash

Philip Pearl

unread,

Mar 31, 2015, 3:43:23 PM3/31/15

to tesser...@googlegroups.com

Hi All

I'm trying to train tesseract for the first time on my Mac. I'm running text2image as follows, but it is crashing in Pango as the priv data on the font is NULL.

/usr/local/Cellar/tesseract/HEAD/bin//text2image --leading=32 --fonts_dir=/Library/Fonts --box_padding=0 --strip_unrenderable_words --char_spacing=0.0 --exposure=0 --find_fonts=true --outputbase=/tmp/tesstrain/eng/eng.Helvetica_Neue_Thin.exp0 --text=./tesslang/eng/eng.training_text

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread

0 libpangoft2-1.0.0.dylib 0x00000001090fad9e pango_fc_font_get_glyph + 25

1 text2image 0x000000010858bf58 tesseract::PangoFontInfo::CanRenderString(char const*, int, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > >*) const + 322

2 text2image 0x000000010858d0ab tesseract::FontUtils::SelectFont(char const*, int, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > >*) + 287

3 text2image 0x0000000108592c06 tesseract::StringRenderer::RenderAllFontsToImage(double, char const*, int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*, Pix**) + 108

4 text2image 0x0000000108584149 main + 2750

5 libdyld.dylib 0x00007fff932315fd start + 1

I installed from HEAD using homebrew and the instructions I found here https://ryanfb.github.io/etc/2014/11/19/installing_tesseract_training_tools_on_mac_os_x.html

Any ideas how to get around this crash?
Am I crazy running this on my Mac? Would I be better off with a Linux VM?
Does training from fonts work or am I better off starting with images (my data is analog HD screen captures of TV menus!)? I know the font the menus use.

Thanks in advance for any help or advice you are able to give me.

Phil

Ryan Baumann

unread,

Apr 1, 2015, 12:51:20 PM4/1/15

to tesser...@googlegroups.com

This appears to be an issue with --find_fonts and/or --strip_unrenderable_words. The following command succeeds for me:

$ text2image --exposure=0 --font "Helvetica Neue Thin" --outputbase=eng.Helvetica_Neue_Thin.exp0 --text=/Users/ryan/source/tesseract/tesseract-ocr.langdata/eng/eng.training_text --leading=32 --char_spacing=0.0 --box_padding=0

Initializing fontconfig

Rendered page 0 to file eng.Helvetica_Neue_Thin.exp0.tif

Rendered page 1 to file eng.Helvetica_Neue_Thin.exp0.tif

-Ryan

Ryan Baumann

unread,

Apr 1, 2015, 12:59:35 PM4/1/15

to tesser...@googlegroups.com

Also, to answer your other questions:

There appear to be some other issues with Pango/Cairo rendering under OS X which may impact the training process, as a result and for general replicability I now use a Dockerized Linux environment to do Tesseract training on my Mac: https://github.com/ryanfb/tesseract_latinocr_docker
Training from fonts works surprisingly well, but if there are significant artifacts introduced by your pipeline/capture process, you may get better accuracy with a manual box/train against images.

-Ryan

On Tuesday, March 31, 2015 at 3:43:23 PM UTC-4, Philip Pearl wrote:

Philip Pearl

unread,

Apr 2, 2015, 12:11:28 PM4/2/15

to tesser...@googlegroups.com

Hi Ryan

Thanks very much for such a useful answer! I'm building your docker container as I type and I'll try with font training when its built.

I tried looking at training with boxes and images, but it complained about a good number of my boxes - saying it couldn't detect blobs within them. I'm guessing my problem is that I don't have good separation of characters, so I plan to look at whether I can just remove those boxes or whether edit the images to remove some characters.

Phil

Reply all

Reply to author

Forward