Tesseract debug output printed to console

2,190 views
Skip to first unread message

Matt Hill

unread,
Aug 2, 2015, 2:05:29 PM8/2/15
to tesseract-ocr

Using Tesseract as a library, I get a ton of this information printed to the console:


Total count=0

Min=0.00 Really=0

Lower quartile=0.00

Median=0.00, ile(0.5)=0.00

Upper quartile=0.00

Max=0.00 Really=0

Range=1

Mean= 0.00

SD= 0.00

Bottom=0, top=38, base=0, x=0


Is there any option or way to disable this?

zdenko podobny

unread,
Aug 2, 2015, 2:45:11 PM8/2/15
to tesser...@googlegroups.com
This is not standard behavior. How you use library?
Please provide simple example case.

Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a1e18d78-c2ca-431a-9b16-5b5cbcb13eb5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ppsupport1

unread,
Nov 28, 2017, 1:45:47 AM11/28/17
to tesseract-ocr


We see this same error using 3.04 tesseract branch build (w/leptonica 172 branch).   

We use default build :
./configure
make DESTDIR=/tmp/tesseract304/ install

Then we 'strip --strip-debug *' on the bin + lib folders.  install into /usr/local/lib using mksquashfs

We do not see these errors when we run under the same code built on windows.
In both cases everything works as expected, only on linux do we see these errors.

It's not clear exactly where in the code these are emanating from.

Example output:
Total count=0
Min=0.00 Really=0
Lower quartile=0.00
Median=0.00, ile(0.5)=0.00
Upper quartile=0.00
Max=0.00 Really=0
Range=1
Mean= 0.00
SD= 0.00
Bottom=0, top=533, base=0, x=0

Total count=0
Min=0.00 Really=0
Lower quartile=0.00
Median=0.00, ile(0.5)=0.00
Upper quartile=0.00
Max=0.00 Really=0
Range=1
Mean= 0.00
SD= 0.00
Bottom=0, top=533, base=0, x=0

Library info:
 ldd /usr/local/lib/libtesseract.so
        linux-gate.so.1 (0xb77c3000)
        liblept.so.4 => /usr/local/lib/liblept.so.4 (0xb7160000)
        libpng16.so.16 => /usr/local/lib/libpng16.so.16 (0xb713a000)
        libjpeg.so.62 => /usr/local/lib/libjpeg-turbo/libjpeg.so.62 (0xb70c2000)
        libtiff.so.5 => /usr/local/lib/libtiff.so.5 (0xb7058000)
        liblzma.so.5 => /usr/local/lib/liblzma.so.5 (0xb703a000)
        libz.so.1 => /usr/lib/libz.so.1 (0xb7028000)
        libpthread.so.0 => /lib/libpthread.so.0 (0xb7000000)
        libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0xb6edd000)
        libm.so.6 => /lib/libm.so.6 (0xb6e9d000)
        libc.so.6 => /lib/libc.so.6 (0xb6d88000)
        libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0xb6d71000)
        /lib/ld-linux.so.2 (0x80026000)
 ldd /usr/local/lib/liblept172.so
        linux-gate.so.1 (0xb771f000)
        libm.so.6 => /lib/libm.so.6 (0xb74d5000)
        libz.so.1 => /usr/lib/libz.so.1 (0xb74c3000)
        libpng16.so.16 => /usr/local/lib/libpng16.so.16 (0xb749d000)
        libjpeg.so.62 => /usr/local/lib/libjpeg-turbo/libjpeg.so.62 (0xb7425000)
        libtiff.so.5 => /usr/local/lib/libtiff.so.5 (0xb73bb000)
        libc.so.6 => /lib/libc.so.6 (0xb72a8000)
        /lib/ld-linux.so.2 (0x8002a000)
        libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0xb7291000)
        liblzma.so.5 => /usr/local/lib/liblzma.so.5 (0xb7273000)
        libpthread.so.0 => /lib/libpthread.so.0 (0xb725c000)

# via git branch 3.04
/tesseract-ocr/tesseract$ git describe
3.04.01-63-g5afface

leptonica used source: leptonica-1.72.tar.gz

It seems other people are now also experiencing this issue, but I don't see much in way of resolving it anywhere... here are some build logs from another party with issue:

We're using tesseract through emgu, here is example code:

public class Program
{
   public static void Main(string[] args)
   {
const string TESSCHAR_ALPHA = @"ABCDEFGHIJKLMNOPQRSTUVWXYZ";
using (var tmpimg = imgh.MakeBlank<Bgr,Byte>(150,50, Color.White).DrawText("HELLOWORLD! 0123456789",Color.Black,8,5,15).DrawText(t2,Color.Black,8,5,30).ToBitMap())
double tp;
var txtvals = GetImageText<Gray>(d, "c:\tessdata", tmpimg, out tp, string.Empty, Tesseract.PageSegMode.SingleChar, true, 300, 0, TESSCHAR_ALPHA);
      if (txtvals.Length==0) throw new Exception("unable to get text");
     else 
foreach (var v in txtvals) Console.WriteLine(v);
}
   }

        public static string GetImageText<T>(DebugDelegate d, string tessdata, Emgu.CV.Image<T, Byte> img, out double p, string regexfilter = "", PageSegMode textmode = PageSegMode.SingleBlock, bool isremovespecialchars = false, int TargetImgWidth = 300, int TargetImgH = 0, string txtdomain_chars = "", string tesslang = TESSLANG_DEFAULT, bool isthrowmissingdata = true) where T : struct, Emgu.CV.IColor
        {
            //setd(d);
            p = 0;
            if (isthrowmissingdata && string.IsNullOrWhiteSpace(tessdata))
                throw new Exception("No TESSDATA folder was defined in call, GetImageText failed.");
            System.Text.StringBuilder text = new System.Text.StringBuilder();
            using (var tessimg = GetTesserImage(img,TargetImgWidth,TargetImgH))
            using (var eng = new TesseractEngine(tessdata, tesslang, EngineMode.Default))
            {
                if (!string.IsNullOrWhiteSpace(txtdomain_chars))
                {
                    eng.SetVariable(TESSENGVAR_CHARWHITELIST, txtdomain_chars);
                    //v("whitelist: " + eng.GetVar<string>(TESSENGVAR_CHARWHITELIST));
                }
                using (var bmp = tessimg.ToBitmap())
                using (var block = eng.Process(bmp, textmode))
                {
                    var raw = block.GetText();
                    var finaltext = GetFilteredText(raw, regexfilter, isremovespecialchars);
                    text.Append(finaltext);
                    p = block.GetMeanConfidence();
                }
            }



            return text.ToString();
        }

        public static string GetFilteredText(string text, bool isremovespecial = true, string regexp = "", bool isshrinkspaces = true)
        {
            string rem;
            var txt = isremovespecial ? RemoveSpecialCharacters(text, out rem) : text;
            if (isshrinkspaces)
                txt = GetTextShrink(txt);
            if (!string.IsNullOrWhiteSpace(regexp))
                txt = Util.rxm(txt, regexp); <-- this is just a one-liner into System.Text.RegularExpressions.RegularExpression.Match
            return txt;
        }

public static string GetTextShrink(string text, string compress2oneval = " ")
        {
            var ctxt = Util.rxr(text, "([" + compress2oneval + "]){2,}", "$1", false); <-- one-liner into RegularExpression.Replace
            return ctxt;
            
        }

        public static Emgu.CV.Image<TColor, Byte> GetTesserImage<TColor>(Emgu.CV.Image<TColor, Byte> src,
            int targetwidth = 300, int targetheight = 0, bool ispreproc = false) where TColor : struct, Emgu.CV.IColor
        {
            var isw = targetwidth > 0;
            var ish = targetheight > 0;
            Emgu.CV.Image<TColor, Byte> tgtimg = null;

            // test for disable
            if (isw || ish)
            {
                if (!isw)
                    targetwidth = src.Width;
                else if (!ish)
                    targetheight = src.Height;
            }
            var isresize = (src.Width < targetwidth) || (src.Height < targetheight);
            if (!isresize) tgtimg = src.Not();


            if (isresize)
            {

                var srcratio = (double)src.Width / src.Height;
                var tgtratiow = (double)targetwidth / src.Width;
                var tgtratioh = (double)targetheight / src.Height;
                var isratw = tgtratiow > tgtratioh;
                int tw = 0, th = 0;
                if (isratw)
                {
                    tw = (int)(src.Width * tgtratiow);
                    th = (int)(tw / srcratio);
                }
                else
                {
                    th = (int)(src.Height * tgtratioh);
                    tw = (int)(th / srcratio);
                }
                tgtimg = src.Not().Resize(tw, th, Emgu.CV.CvEnum.Inter.Cubic);
            }

            if (ispreproc)
            {
                using (Emgu.CV.Image<Emgu.CV.Structure.Gray, Byte> graytgt = imgh.Convert<TColor, Emgu.CV.Structure.Gray>(tgtimg))
                {
                    Emgu.CV.CvInvoke.AdaptiveThreshold(graytgt, graytgt, 255, Emgu.CV.CvEnum.AdaptiveThresholdType.GaussianC, Emgu.CV.CvEnum.ThresholdType.Binary, 15, 20);
                    Emgu.CV.CvInvoke.EqualizeHist(graytgt, graytgt);
                    tgtimg = imgh.Convert<Emgu.CV.Structure.Gray, TColor>(graytgt);
                }

            }

            return tgtimg;
        }

on Emgu side, Tesseract.TesseractEngine.cs has this informational (auto-generated) header:
#region Assembly Tesseract-4Tgt.dll, v4.0.30319
// C:\Users\ppsupport1\Documents\Visual Studio 2010\Projects\bp\Libraries\bs\dlls\Tesseract-4Tgt.dll
#endregion

It was run with no special parameters:
mono test.exe
$ mono --version
Mono JIT compiler version 4.4.2 (Stable 4.4.2.11/f72fe45 Sun Sep 18 09:57:57 UTC 2016)
Copyright (C) 2002-2014 Novell, Inc, Xamarin Inc and Contributors. www.mono-project.com
        TLS:           __thread
        SIGSEGV:       altstack
        Notifications: epoll
        Architecture:  x86
        Disabled:      none
        Misc:          softdebug
        LLVM:          supported, not enabled.
        GC:            sgen

$ ldd `which mono`
        linux-gate.so.1 (0xb7734000)
        libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0xb760d000)
        libm.so.6 => /lib/libm.so.6 (0xb75bc000)
        librt.so.1 => /lib/librt.so.1 (0xb75b4000)
        libdl.so.2 => /lib/libdl.so.2 (0xb75b0000)
        libpthread.so.0 => /lib/libpthread.so.0 (0xb7599000)
        libc.so.6 => /lib/libc.so.6 (0xb7486000)
        /lib/ld-linux.so.2 (0x800b4000)
        libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0xb746f000)



Realize there is probably a simpler reproduce but since this seems to be information-only and no debugging was enabled and others were seeing this, figured I'd share some of our data.

We also have many windows10 builds of same software that do no experience these messages.

Zdenko Podobný

unread,
Nov 28, 2017, 1:47:53 AM11/28/17
to tesser...@googlegroups.com
3.04 version is too old. nobody will check it/improve it. Please use current tesseract version.

Zdenko

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.

Joshua Franta

unread,
Nov 28, 2017, 2:03:23 AM11/28/17
to tesser...@googlegroups.com

this information appears to be originating from here:

tesseract/ccstruct/statistc.cpp
/*********************************************************************** STATS::print_summary * * Print a summary of the stats. **********************************************************************/void STATS::print_summary() const
{
if (buckets_ == NULL)
{ return; }
inT32 min = min_bucket(); inT32 max = max_bucket();
tprintf("Total count=%d\n", total_count_);
tprintf("Min=%.2f Really=%d\n", ile(0.0), min);
tprintf("Lower quartile=%.2f\n", ile(0.25));
tprintf("Median=%.2f, ile(0.5)=%.2f\n", median(), ile(0.5));
tprintf("Upper quartile=%.2f\n", ile(0.75));
tprintf("Max=%.2f Really=%d\n", ile(1.0), max);
tprintf("Range=%d\n", max + 1 - min);
tprintf("Mean= %.2f\n", mean()); tprintf("SD= %.2f\n", sd());
}

is called in the same file here:

void STATS::print() const { //...}
generic 'print' name, frequency of invocations of this and unfamiliar w/tesseract organization make it tough to see where the call can be disabled atm.

Any suggestions/pointers are much appreciated.


--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/-K5d2euBJ_I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.

Joshua Franta

unread,
Nov 28, 2017, 2:45:04 AM11/28/17
to tesser...@googlegroups.com

good advice but we'd like to see existing build perform port to linux well before we flip.

this may not be an impediment, just noting since everyone who reported this before had no resolution.   3.04 was fairly popular, good work there. :)

the developers posted a fix to this on github a second ago, cross posting to bump up answer:


--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/-K5d2euBJ_I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Reply all
Reply to author
Forward
0 new messages