Tesseract convert image to gibberish

367 views
Skip to first unread message

Dusayanta Prasad

unread,
Feb 25, 2018, 2:44:42 PM2/25/18
to tesseract-ocr

I am try to convert the below image using Tesseract in linux using the following command:

tesseract img.jpg out -l eng

and i am getting the result like this



Please help me out.


Zdenko Podobny

unread,
Feb 25, 2018, 2:46:38 PM2/25/18
to tesser...@googlegroups.com

Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3fb7240e-612c-4e64-abc8-99a07c3a0447%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ShreeDevi Kumar

unread,
Feb 25, 2018, 4:18:32 PM2/25/18
to tesser...@googlegroups.com
which version of tesseract are you using?

See attached results with Tesseract 4 and eng from tessdata_fast



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

eng-skewed.txt

Greg Dunkel

unread,
Feb 26, 2018, 1:57:02 AM2/26/18
to tesser...@googlegroups.com
Probably the scan is at too low dpi.  Also slightly skewed.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3fb7240e-612c-4e64-abc8-99a07c3a0447%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
/greg

Dusayanta Prasad

unread,
Feb 26, 2018, 7:15:16 PM2/26/18
to tesseract-ocr
Can you please send me the link for Tesseract 4?
Tell me the method you used to perform the OCR


On Sunday, February 25, 2018 at 9:48:32 PM UTC+5:30, shree wrote:
which version of tesseract are you using?

See attached results with Tesseract 4 and eng from tessdata_fast



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sun, Feb 25, 2018 at 8:16 PM, Zdenko Podobny <zde...@gmail.com> wrote:
2018-02-25 11:38 GMT+01:00 Dusayanta Prasad <dusayan...@gmail.com>:

I am try to convert the below image using Tesseract in linux using the following command:

tesseract img.jpg out -l eng

and i am getting the result like this



Please help me out.


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Dusayanta Prasad

unread,
Feb 26, 2018, 7:44:20 PM2/26/18
to tesseract-ocr
I am using tesseract in ubuntu command line, the version is 
tesseract 3.04.01
 leptonica-1.73
  libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0

Regarding the part of gibberish text, i had to convert the image to .tif format . Then i used tesseract with the .tif image as:
tesseract img.tif outtif
The text generated in my case has notable difference from yours. Your one's has a good accuracy. Please tell me how did u achieved so.
Have a look at the text file attachment

On Sunday, February 25, 2018 at 9:48:32 PM UTC+5:30, shree wrote:
which version of tesseract are you using?

See attached results with Tesseract 4 and eng from tessdata_fast



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sun, Feb 25, 2018 at 8:16 PM, Zdenko Podobny <zde...@gmail.com> wrote:
2018-02-25 11:38 GMT+01:00 Dusayanta Prasad <dusayan...@gmail.com>:

I am try to convert the below image using Tesseract in linux using the following command:

tesseract img.jpg out -l eng

and i am getting the result like this



Please help me out.


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
outtif.txt

ShreeDevi Kumar

unread,
Feb 27, 2018, 4:25:18 AM2/27/18
to tesser...@googlegroups.com
You can download latest version of tesseract-ocr and appropriate traineddata from 


I ran tesseract via command line with default values.


You may need to remove the existing old version, before installing new.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Dusayanta Prasad

unread,
Mar 3, 2018, 9:39:50 AM3/3/18
to tesseract-ocr
What if i build Leptonica and Tesseract from source following the method on GitHub??

shree

unread,
Mar 3, 2018, 10:08:07 AM3/3/18
to tesseract-ocr
Sure, if you are comfortable building software on Linux. You have to make sure you have all the dependencies etc.

Dusayanta Prasad

unread,
Mar 3, 2018, 10:29:41 AM3/3/18
to tesseract-ocr
Please tell me one more thing. Before feeding the image to tesseract do you perform any kind of pre-processing like binarising the image  or something like that?
I didn't get the same result as yours even after trying Tesseract 4 with eng tessdata_best.
Message has been deleted

Dusayanta Prasad

unread,
Mar 3, 2018, 12:27:11 PM3/3/18
to tesseract-ocr
Please help with this:

dusayanta@dusayanta:~/tessy$ tesseract -v
tesseract 4.00.00dev-731-gb9b08c7
 leptonica-1.75.3
  libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8

 Found AVX
 Found SSE


dusayanta@dusayanta:~/tessy$ tesseract book.tif book -l eng
Error opening data file /home/dusayanta/tesseract/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
dusayanta@dusayanta:~/tessy$ cd ..
dusayanta@dusayanta:~$ ls tesseract/tessdata/
configs  eng.traineddata  eng.user-patterns  eng.user-words  Makefile.am  Makefile.in  pdf.ttf  tessconfigs

dusayanta@dusayanta:~$ printenv
XDG_VTNR=7
XDG_SESSION_ID=c2
CLUTTER_IM_MODULE=xim
XDG_GREETER_DATA_DIR=/var/lib/lightdm-data/dusayanta
SESSION=ubuntu
GPG_AGENT_INFO=/home/dusayanta/.gnupg/S.gpg-agent:0:1
SHELL=/bin/bash
TERM=xterm-256color
VTE_VERSION=4205
QT_LINUX_ACCESSIBILITY_ALWAYS_ON=1
TESSDATA_PREFIX=/home/dusayanta/tesseract/tessdata/
WINDOWID=75497482
UPSTART_SESSION=unix:abstract=/com/ubuntu/upstart-session/1000/1099
GNOME_KEYRING_CONTROL=
GTK_MODULES=gail:atk-bridge:unity-gtk-module
USER=dusayanta

ShreeDevi Kumar

unread,
Mar 3, 2018, 12:28:38 PM3/3/18
to tesser...@googlegroups.com
No, I had not pre-processed the iame.

I used tessdata_fast NOT tessdata_best.​

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

ShreeDevi Kumar

unread,
Mar 3, 2018, 12:53:03 PM3/3/18
to tesser...@googlegroups.com
ls -l  /home/dusayanta/tesseract/tessdata/eng.traineddata

combine_tessdata -d /home/dusayanta/tesseract/tessdata/eng.traineddata


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

ShreeDevi Kumar

unread,
Mar 3, 2018, 12:55:40 PM3/3/18
to tesser...@googlegroups.com
Also check 

tesseract --list-langs

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

ShreeDevi Kumar

unread,
Mar 3, 2018, 12:56:58 PM3/3/18
to tesser...@googlegroups.com
The exact directory will depend both on the type of training data, and your Linux distribtion. Possibilities are /usr/share/tesseract-ocr/tessdata or /usr/share/tessdata or /usr/share/tesseract-ocr/4.00/tessdata.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Dusayanta Prasad

unread,
Mar 3, 2018, 5:57:33 PM3/3/18
to tesseract-ocr
Which produces the better result- tessdata_fast or tessdata_best?
Also check 

tesseract --list-langs
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,
Mar 3, 2018, 6:15:20 PM3/3/18
to tesser...@googlegroups.com
Recommendation from Ray is to use tessdata_fast



To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Reply all
Reply to author
Forward
0 new messages