unicharset_extractor extracting zero values

433 views
Skip to first unread message

David Barishev

unread,
Jun 19, 2017, 6:39:29 AM6/19/17
to tesseract-ocr
Hello all!
Im trying to train tesseract to recognize a new font in English (supercell-magic).
I have created a .tif file and matching .box file using jTessBoxEditor ( eng.supercell-magic.exp0.tif and  eng.supercell-magic.exp0.box ), and trained tesseract with them.

Here is tesseracts's output:
$ tesseract eng.supercell-magic.exp0.tif eng.supercell-magic.exp0 box.train
Tesseract Open Source OCR Engine v3.04.01 with Leptonica
Page 1
row xheight=30, but median xheight = 37.5455
APPLY_BOXES:
   Boxes read from boxfile:    1559
   Found 1559 good blobs.
Generated training data for 34 words
Page 2
APPLY_BOXES:
   Boxes read from boxfile:    1677
   Found 1677 good blobs.
Generated training data for 34 words
Page 3
APPLY_BOXES:
   Boxes read from boxfile:    1362
   Found 1362 good blobs.
Generated training data for 28 words


So the next step is to extract the characters using unicharset_extractor.
There was a normal output for it :
$ unicharset_extractor eng.supercell-magic.exp0.box
Extracting unicharset from eng.supercell-magic.exp0.box
Wrote unicharset file ./unicharset.

But when i view the file, it's mostly 0 and 255, which is not like the example in the wiki

An example of the unicharset file

110
NULL 0 NULL 0
N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
...

Mine looks more like this:
74 NULL 0 NULL 0 Joined 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # Joined [4a 6f 69 6e 65 64 ] |Broken|0|1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # Broken t 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # t [74 ] h 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # h [68 ] a 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # a [61 ] n 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # n [6e ] P 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # P [50 ] o 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # o [6f ] e 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # e [65 ] : 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # : [3a ] r 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # r [72 ] l 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # l [6c ] i 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # i [69 ] 1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # 1 [31 ] N 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # N [4e ]

Why is that ? Thanks in advances.
Im using ubuntu 16.04 with tesseract version:
tesseract 3.04.01 leptonica-1.73 libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0
 I have attached the box and tiff file and the data file, and the unicharset file.
data.tar.gz

ShreeDevi Kumar

unread,
Jun 19, 2017, 7:58:40 AM6/19/17
to tesser...@googlegroups.com
do u have the common and latin unicharset in ur langdata directory.


Try to build the latest 3.05.01 version.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/cd052525-9eb7-4527-b75b-82e1a687997d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

David Barishev

unread,
Jun 19, 2017, 11:04:30 AM6/19/17
to tesseract-ocr
Thanks for the replay,
If you mean if i have the latin and common unicharset in the tessdata direcotry(  /usr/share/tesseract-ocr/tessdata ),i have downloaded them and placed them in the directory and still getting the same behavior.
I have also tried doing it from my windows machine which has 3.05 version, and had same behavior .
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,
Jun 19, 2017, 11:36:43 AM6/19/17
to tesser...@googlegroups.com
Where do you have your source files for english langdata?

If it is in a directory such as ../langdata/eng/
then put the common.unicharset, latin.unicharset and font_properties etc in 
../langdata



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

ShreeDevi Kumar

unread,
Jun 19, 2017, 11:50:32 AM6/19/17
to tesser...@googlegroups.com
​You could also try running training on your windows pc with 3.05.01 using tesstrain.sh using `git for windows` which will provide you a shell for running ​bash scripts.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

ShreeDevi Kumar

unread,
Jun 19, 2017, 11:58:57 AM6/19/17
to tesser...@googlegroups.com
I would also suggest that you add spaces between words in your input text,

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

David Barishev

unread,
Jun 19, 2017, 5:02:43 PM6/19/17
to tesseract-ocr
hey, i try to build tesseract from source now, and after i have built Leptonica, i couldn't build tesseract with this error :

/bin/bash ../libtool  --tag=CXX   --mode=link g++  -g -O2 -std=c++11   -o tesseract tesseract-tesseractmain.o libtesseract.la  -lrt -lpthread 
libtool: link: g++ -g -O2 -std=c++11 -o .libs/tesseract tesseract-tesseractmain.o  ./.libs/libtesseract.so -lrt -lpthread
/usr/bin/ld: tesseract-tesseractmain.o: undefined reference to symbol 'lept_free'
//usr/local/lib/liblept.so.5: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status
Makefile:598: recipe for target 'tesseract' failed
make[2]: *** [tesseract] Error 1
make[2]: Leaving directory '/home/david/project/tesseract-3.05.01/api'
Makefile:489: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/home/david/project/tesseract-3.05.01'
Makefile:398: recipe for target 'all' failed
make: *** [all] Error 2


Any idea why ? 
Message has been deleted

shree

unread,
Jun 19, 2017, 11:59:58 PM6/19/17
to tesseract-ocr

See https://github.com/tesseract-ocr/tesseract/issues/318
regarding the unicharset format

I was able to do regular tesseract training (not lstm) using tesseract 4.00.00 version from github master and create new unicharset and traineddata with your box/tiff pair. The output on the same tiff file is enclosed.

I think you will get better results with the training input text having interword spaces.
eng.supercell-magic.exp0-eng-magic.txt

David Barishev

unread,
Jun 20, 2017, 3:50:35 AM6/20/17
to tesseract-ocr
Thank you so much for your help, i found my error, i need to set script dir to the langdata folder when runnning set_unicharset_properties.
Do you know why my tesseract isnt compiling ? I would really love a updated version on my ubuntu.

Thank you again.

ShreeDevi Kumar

unread,
Jun 20, 2017, 4:03:25 AM6/20/17
to tesser...@googlegroups.com
Do you know why my tesseract isnt compiling ? I would really love a updated version on my ubuntu.

Not sure. I haven't built 3.05 branch. For master, I follow the usual autotools method.

Have you also built leptonica? Make sure you don't have any old leptonica version already.

Make sure you use either autotools for both or cmake for both tesseract and leptonica. Use the latest sources for both from github.





ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

David Barishev

unread,
Jun 20, 2017, 10:15:04 AM6/20/17
to tesseract-ocr
After several testing, i have found mixed results.

If i download leptonica 1.74.4, build it, and than build master brach, it works fine.
With the same version of leptonica, the 3.05.01 release failes with the following error:

 
libtool: link: g++ -g -O2 -std=c++11 -o .libs/tesseract tesseract-tesseractmain.o  ./.libs/libtesseract.so -lrt -lpthread
/usr/bin/ld: tesseract-tesseractmain.o: undefined reference to symbol 'lept_free'
/usr/local/lib//liblept.so.5: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status
Makefile:598: recipe for target 'tesseract' failed
make[2]: *** [tesseract] Error 1
make[2]: Leaving directory '/home/david/project/tesseract-3.05.01/api'
Makefile:489: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/home/david/project/tesseract-3.05.01'
Makefile:398: recipe for target 'all' failed
make: *** [all] Error 2


On the docs, it states the minimum version to tesseract to build, so the latest should be able to build even with tesserac 3.05.01.

Can you please try to build version 3.05.01 ?


On Tuesday, June 20, 2017 at 11:03:25 AM UTC+3, shree wrote:
Do you know why my tesseract isnt compiling ? I would really love a updated version on my ubuntu.

Not sure. I haven't built 3.05 branch. For master, I follow the usual autotools method.

Have you also built leptonica? Make sure you don't have any old leptonica version already.

Make sure you use either autotools for both or cmake for both tesseract and leptonica. Use the latest sources for both from github.





ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Jun 20, 2017 at 1:20 PM, David Barishev <davi...@gmail.com> wrote:
Thank you so much for your help, i found my error, i need to set script dir to the langdata folder when runnning set_unicharset_properties.
Do you know why my tesseract isnt compiling ? I would really love a updated version on my ubuntu.

Thank you again.


On Tuesday, June 20, 2017 at 6:59:58 AM UTC+3, shree wrote:

See https://github.com/tesseract-ocr/tesseract/issues/318
regarding the unicharset format

I was able to do regular tesseract training (not lstm) using tesseract 4.00.00 version from github master and create new unicharset and traineddata with your box/tiff pair. The output on the same tiff file is enclosed.

I think you will get better results with the training input text having interword spaces.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,
Jun 20, 2017, 10:28:23 AM6/20/17
to tesser...@googlegroups.com
Master branch currently includes the legacy engine. So you can easily build your custom traineddata using the following command (modify it for your fonts location, training text, font name etc)


training/tesstrain.sh \
  --fonts_dir ~/.fonts \
  --tessdata_dir ../tessdata \
  --training_text ../langdata/eng/eng.training_text \
  --langdata_dir ../langdata \
  --lang eng  \
  --exposures "0"    \
  --fontlist "Supercell Magic" \
  --output_dir ~/tesstutorial/engtest

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

shree

unread,
Jun 20, 2017, 11:39:53 AM6/20/17
to tesseract-ocr
I got the same error building 3.05.01 and have filed it as an issue - https://github.com/tesseract-ocr/tesseract/issues/1000

shree

unread,
Jun 21, 2017, 7:56:22 AM6/21/17
to tesseract-ocr


On Tuesday, June 20, 2017 at 9:09:53 PM UTC+5:30, shree wrote:
I got the same error building 3.05.01 and have filed it as an issue - https://github.com/tesseract-ocr/tesseract/issues/1000

This has been fixed by @stweil via https://github.com/tesseract-ocr/tesseract/pull/1003

Please try with the latest code from 3.05 branch on github. 

David Barishev

unread,
Jun 26, 2017, 6:30:17 AM6/26/17
to tesseract-ocr
I have successfully compiled from the latest branch.
Thank you for all the support. 
Reply all
Reply to author
Forward
0 new messages