mftraining core dump - Illegal malloc request size on Ubuntu...

Rob Stewart

unread,

May 16, 2014, 11:56:50 AM5/16/14

to tesser...@googlegroups.com

Hi!
I've been trying to train tesseract and after a hard day getting all the dependencies downloaded and compiled I managed to get so far down the training documentation.

I'm using Ubuntu 14.04LTS and I've downloaded and compiled leptonica-1.70.

I ended up creating a shell script after compiling and installing tesseract and tesseract-training...

---- Start of file (called "commands.sh")...

#!/bin/bash

# Get a copy of Tesseract src code...
#   svn checkout http://tesseract-ocr.googlecode.com/svn/trunk/ tesseract-ocr-read-only
#
# Make a folder, let's call it 'training_text'
#   mkdir training_text
#   cd training_text
#
# Create a '1.txt' file containing the training text. (Try the Gutenburg project).
# Copy 'font_properties' from tesseract-ocr-read-only/training/langdata...
#   cp ../tesseract-ocr-read-only/training/langdata/font_properties .
#
# Run this commands file...
#   commands.sh

# Remove any previously generated files (you will get errors
# if this is the first time you run this, but it's OK)...

rm eng.FreeSans.exp0.box
rm eng.FreeSans.exp0.tif
rm eng.FreeSans.exp0.tr
rm eng.FreeSans.exp0.txt
rm shapetable
rm unicharset
rm unicharset.out

# Try to generate them again...

text2image --text=1.txt -outputbase=eng.FreeSans.exp0 --font='FreeSans' --fonts_dir=/usr/share/fonts/truetype/freefont

tesseract eng.FreeSans.exp0.tif eng.FreeSans.exp0 box.train

unicharset_extractor eng.FreeSans.exp0.box

set_unicharset_properties -U unicharset -O unicharset.out --script_dir=../tesseract-ocr-read-only/training/langdata

shapeclustering -F font_properties -U unicharset eng.FreeSans.exp0.tr
#shapeclustering -F font_properties -U unicharset.out eng.FreeSans.exp0.tr

mftraining -F font_properties -U unicharset -O eng.FreeSans.exp0.tr
#mftraining -F font_properties -U unicharset.out -O eng.FreeSans.exp0.tr

#cntraining eng.FreeSans.exp0.tr

---- End of file

Once I get down to shaperclustering I can't tell from the documentation which unicharset file to use the first one produced or the one produced by the 'set_unicharset_properties' command.

Either way the mftraining usually fails, sometimes a second attempt at running shapeclustering and mftraining outside of this shell file works, but almost every time I get the following error...

---- Start of Error (mftraining)

Error: Illegal malloc request size!
"Fatal error encountered!" == NULL:Error:Assert failed:in file globaloc.cpp, line 75
./commands.sh: line 40: 20958 Segmentation fault      (core dumped) mftraining -F font_properties -U unicharset -O eng.FreeSans.exp0.tr

---- End of Error

And even worse the cntraining command doesn't work at all...

---- Start of Error (cntraining)

Error: Illegal short name for a feature!
"Fatal error encountered!" == NULL:Error:Assert failed:in file globaloc.cpp, line 75
Segmentation fault (core dumped)

---- End of Error

What am I doing wrong?
Any help would be appreciated. Also I think adding this kind of shell script (or equivalent) to a 'fast start' for training could be useful.

Rob

--
Texthelp Ltd is a limited company registered in Belfast, N. Ireland with registration number NI31186 having its registered office and principal place of business at Lucas Exchange, 1 Orchard Way, Antrim, N. Ireland, BT41 2RU.

Nick White

unread,

May 16, 2014, 12:20:15 PM5/16/14

to tesser...@googlegroups.com

Hi Rob,

You're getting there, don't worry :)

On Fri, May 16, 2014 at 08:56:50AM -0700, Rob Stewart wrote:
> [snip]

> unicharset_extractor eng.FreeSans.exp0.box
>
> set_unicharset_properties -U unicharset -O unicharset.out --script_dir=../
> tesseract-ocr-read-only/training/langdata
>
> shapeclustering -F font_properties -U unicharset eng.FreeSans.exp0.tr
> #shapeclustering -F font_properties -U unicharset.out eng.FreeSans.exp0.tr
>
> mftraining -F font_properties -U unicharset -O eng.FreeSans.exp0.tr
> #mftraining -F font_properties -U unicharset.out -O eng.FreeSans.exp0.tr
>
> #cntraining eng.FreeSans.exp0.tr

> Once I get down to shaperclustering I can't tell from the documentation which
> unicharset file to use the first one produced or the one produced by the
> 'set_unicharset_properties' command.

The one produced by set_unicharset_properties is always better to
use, as it should have correct attributes for each character.

Note that shapeclustering is generally not recommended for most
scripts (I think it's just devanagari scripts that it's used for at
the moment). I tested with and without for my grc training, and
results were far better without it.

> Either way the mftraining usually fails, sometimes a second attempt at running
> shapeclustering and mftraining outside of this shell file works, but almost
> every time I get the following error...

You're calling mftraining slightly incorrectly. The -O argument is
for the resulting unicharset, not the .tr file; tesseract is
probably getting upset at you overwriting the .tr with a unicharset
file while (or maybe even before) reading it. In my grc makefile, I
call it like this:

mftraining -F font_properties -U grc.earlyunicharset -O grc.unicharset grc*tr

(grc.earlyunicharset is the output from set_unicharset_properties).

> Any help would be appreciated. Also I think adding this kind of shell script
> (or equivalent) to a 'fast start' for training could be useful.

You may find the Makefile from my grc repository helpful. Get it
with:

git clone http://ancientgreekocr.org/grc.git

I decided to use a Makefile rather than a shell script so that I can
test changes and only the appropriate parts are re-run, rather than
everything.

Nick

Sriranga(80yrs)

unread,

May 18, 2014, 6:52:52 AM5/18/14

to tesser...@googlegroups.com

Nick,

When tried to download "makefile" from your grc respository "http://ancientgreekocr.org/grc.git" - the said

repository displayed error message as "403 - forebidden". This is brought to your kind notice.

With regards,

sriranga(80+)

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/20140516162015.GD15463%40manta.lan.

For more options, visit https://groups.google.com/d/optout.

Rob Stewart

unread,

May 19, 2014, 5:07:13 AM5/19/14

to tesser...@googlegroups.com

Thanks Nick!

Regarding mftraining - I just couldn't see what was wrong, I must have went a bit code blind there.

Things are working now with a simple change to that one line...

mftraining -F font_properties -U unicharset.out -O unicharset.out2 eng.FreeSans.exp0.tr

So it's onto testing to see what difference all this can make.

Good idea about the make file.

Thanks once more!

--

Rob

Nick White

unread,

May 19, 2014, 2:49:24 PM5/19/14

to tesser...@googlegroups.com

Hi Sriranga,

On Sun, May 18, 2014 at 04:22:52PM +0530, Sriranga(80yrs) wrote:
> When tried to download "makefile" from your grc respository "http://
> ancientgreekocr.org/grc.git" - the said
> repository displayed error message as "403 - forebidden". This is brought to
> your kind notice.

That's because it's a git repository, not a webpage (despite the
http). So you have to use git to download it. I didn't bother with a
web interface, for several reasons, not least of which being
security.

Nick

Dovhani Foneworx

unread,

Aug 19, 2014, 7:06:26 AM8/19/14

to tesser...@googlegroups.com

Hi I have a problem that when I run:

set_unicharset_properties -U input_unicharset -O output_unicharset --script_dir=/home/foneworx/DM/Tesseracting/tesseract-3.03/training/langdata

I get the following output:

Loaded unicharset of size 3 from file input_unicharsetSetting unichar propertiesOther case JOINED of Joined is not in unicharsetOther case |BROKEN|0|1 of |Broken|0|1 is not in unicharsetWriting unicharset to file output_unicharset

what does this mean?

have anyone got this before? did i got something wrong based on this output?

Nick White

unread,

Aug 20, 2014, 10:50:20 AM8/20/14

to tesser...@googlegroups.com

Hi Dovhani,

On Tue, Aug 19, 2014 at 04:06:26AM -0700, Dovhani Foneworx wrote:
> Hi I have a problem that when I run:
>
> set_unicharset_properties -U input_unicharset -O output_unicharset --script_dir
> =/home/foneworx/DM/Tesseracting/tesseract-3.03/training/langdata
>
>
>
> I get the following output:
>
>
> Loaded unicharset of size 3 from file input_unicharsetSetting unichar
> propertiesOther case JOINED of Joined is not in unicharsetOther case |BROKEN|0|
> 1 of |Broken|0|1 is not in unicharsetWriting unicharset to file
> output_unicharset

Sometimes unicharsets have lines beginning Joined and |Broken| near
the top. I'm not sure what they mean, but they don't screw anything
up, so don't worry about it. The output you see there is just
set_unicharset_properties warning you that they look weird (they
are, but it's fine).

Sorry to be a bit vague; I don't have time to look into exactly what
they mean or why they're there at the moment.

Nick

Reply all

Reply to author

Forward