Install Tesseract 4 on CentOS and Red Hat [SOLVED!]

16,695 views
Skip to first unread message

Eugene Huang

unread,
Apr 23, 2018, 2:22:40 PM4/23/18
to tesseract-ocr
Hello! Most people are probably running Tesseract 4 on Ubuntu, MacOS, and Windows. Unfortunately, there are no clear instructions on installing Tesseract 4 for other flavors of Linux--probably most notably CentOS and Red Hat.

After going through dependency hell, I successfully installed Tesseract 4 onto CentOS 7. I presume that the installation script should also work for Red Hat. I want to give credit to EisenVault because this script is essentially a modified version of his script. This is my first contribution to open source software, so any tips will be highly appreciated!

When running this script line by line, you probably have to prefix "sudo" to each line, or you can copy and paste into a bash script and then run sudo along with the script. I have tested both to work on a fresh image of CentOS 7 on VirtualBox.

Cheers!

# (Estimated Time of Completion: 45 minutes)
# Instructions taken (and slightly modified) from https://github.com/EisenVault/install-tesseract-redhat-centos/blob/master/install-tesseract.sh
cd
/opt
# The following line will take 30 minutes to install.
yum
-y update
yum
-y install libstdc++ autoconf automake libtool autoconf-archive pkg-config gcc gcc-c++ make libjpeg-devel libpng-devel libtiff-devel zlib-devel
yum
group install -y "Development Tools"


# Install Leptonica from Source
wget http
://www.leptonica.com/source/leptonica-1.75.3.tar.gz
tar
-zxvf leptonica-1.75.3.tar.gz
cd leptonica
-1.75.3
./autobuild
./configure
make
-j
make install
cd
..
# Delete tar.gz file if you like


# Sanity checks
# check if libpng is installed: type "whereis libpng" and expect to see a directory; a blank line is not good
# check if leptonica is installed: type "ls /usr/local/include" and expect to see "leptonica"


# Install Tesseract from Source
wget https
://github.com/tesseract-ocr/tesseract/archive/4.0.0-beta.1.tar.gz
tar
-zxvf 4.0.0-beta.1.tar.gz
cd tesseract
-4.0.0-beta.1/
./autogen.sh
PKG_CONFIG_PATH
=/usr/local/lib/pkgconfig LIBLEPT_HEADERSDIR=/usr/local/include ./configure --with-extra-includes=/usr/local/include --with-extra-libraries=/usr/local/lib
LDFLAGS
="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make -j
make install
ldconfig
cd
..
# Delete tar.gz file if you like


# Download and install tesseract language files (Tesseract 4 traineddata files)
wget https
://github.com/tesseract-ocr/tessdata/raw/master/osd.traineddata
wget https
://github.com/tesseract-ocr/tessdata/raw/master/equ.traineddata
wget https
://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata
wget https
://github.com/tesseract-ocr/tessdata/raw/master/chi_sim.traineddata
# download another other languages you like
mv
*.traineddata /usr/local/share/tessdata


# Sanity check
# check if tesseract is installed: type "tesseract --version" and expect to see 1st line (tesseract), 2nd line (leptonica), 3rd line(libraries for images)

ShreeDevi Kumar

unread,
Apr 23, 2018, 2:37:09 PM4/23/18
to tesser...@googlegroups.com
Thanks for the script to install tesseract on CentOS.

I would suggest using traineddata files from tessdata_fast or tessdata_best repos for better accuracy and speed.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d41ebcc5-b3b1-4e66-af8a-c7896814a7cc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Message has been deleted

Eugene Huang

unread,
Apr 24, 2018, 11:38:27 AM4/24/18
to tesseract-ocr
@Shree
Thanks for the tip. Just 2 quick questions. 
1) From https://github.com/tesseract-ocr/tesseract/wiki/Data-Files, it says that "osd" and "equ" traineddata files are compatible between Tesseract 3 and 4. In the GitHub tessdata_fast repo (https://github.com/tesseract-ocr/tessdata_fast), "osd" is there with the commit "Use legacy Orientation Script Detector (OSD) because that is the only thing that currently works." However, "equ" is not in the repo. Was this simply a small mistake where the maintainer forgot to include the "equ" data file?

2) Also, with tessdata_fast, I was able to get Tesseract 4 running faster than using Tesseract 4 with tessdata. However, is Tesseract 4 supposed to be slower than Tesseract 3 because that's what I'm experiencing?




# Here are the updated instructions to download tessdata_fast, which I tested to indeed perform faster than tessdata.
# However, when calling Tesseract from the command line, using the arguments "--oem 2" will no longer work. 
# Use "--oem 1" since only the neural net LSTM model exists if using tessdata_fast.

ShreeDevi Kumar

unread,
Apr 24, 2018, 12:09:31 PM4/24/18
to tesser...@googlegroups.com
I have never used equ.traineddata. From feedback in the forum I don't think it works very well. Maybe equ has not been trained via LSTM training, I have no way of knowing. Only Ray Smith or other developers from Google can answer that.

Only LSTM models exist in tessdata_best and tessdata_fast.

Depending on the language and the hardware that you are running on, tesseract 4 can be slower than tesseract 3 - see various issues related to performance on GitHub. However accuracy has improved a lot and a larger number of languages are available for tesseract 4.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Александр Поздняков

unread,
Apr 24, 2018, 12:34:55 PM4/24/18
to tesseract-ocr
Hi. I compiled an rpm package with tesseract-ocr for CentOS, Fedora, ScientificLinux, OpenSuse. It must be checked...
https://build.opensuse.org/project/show/home:Alexander_Pozdnyakov

понедельник, 23 апреля 2018 г., 21:22:40 UTC+3 пользователь Eugene Huang написал:

Eugene Huang

unread,
Apr 25, 2018, 9:30:01 AM4/25/18
to tesseract-ocr
Hello Александр!

I took a look at your stuff; it is very extensive. If all the installations work, this should be front-paged! I have never used openSUSE. Could you point me to some resources to figure out how use your installation packages?


@shree
Thanks for the info. I definitely notice that Tesseract 4 is more accurate--more example, Tesseract 4 can read small italics font whereas Tesseract 3 makes lots of mistakes. Seems like Tesseract 4 is the future!

shree

unread,
Apr 25, 2018, 11:57:13 AM4/25/18
to tesseract-ocr
Thanks for the rpm package, Alex. I have added the info to https://github.com/tesseract-ocr/tesseract/wiki 

Александр Поздняков

unread,
Apr 25, 2018, 12:47:15 PM4/25/18
to tesseract-ocr
for CentOS
yum-config-manager --add-repo https://download.opensuse.org/repositories/home:/Alexander_Pozdnyakov/CentOS_7/
yum update
yum install tesseract 

for example
yum install tesseract-langpack-deu


среда, 25 апреля 2018 г., 16:30:01 UTC+3 пользователь Eugene Huang написал:

Adrian

unread,
Jul 13, 2018, 6:20:38 AM7/13/18
to tesseract-ocr
As of today, when running yum install tesseract, it times out.

Periasamy Kanagavel

unread,
Sep 6, 2018, 11:54:41 PM9/6/18
to tesseract-ocr
I am new to Cent OS. I am trying the steps mentioned here. Upto the step "PKG_CONFIG_PATH=/usr/local/lib/pkgconfig ...", there were no issues. While running the command "LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make -j", I was getting "libtool: Version mismatch error.  This is libtool 2.4.6, but the" error. Am I missing anything?

Eugene Huang

unread,
Sep 9, 2018, 7:26:27 PM9/9/18
to tesser...@googlegroups.com
Hello Periasamy!

Do you know what version of CentOS you're using? I used CentOS 7, and I haven't tried this installation script on other versions of CentOS. I never got the libtool error, so I'm sorry that I don't know what the exact solution is. If you are using CentOS 7, perhaps there have been updates to packages (like libtool) that make installation a little trickier.

Good luck! If you discover the solution, feel free to post it here.
Eugene



--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/u-PZaakaKs0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

shree

unread,
Oct 25, 2018, 4:17:48 PM10/25/18
to tesseract-ocr
Hi Alex,

Do you have a package for Fedora 28 for tesseract 4?

Александр Поздняков

unread,
Oct 26, 2018, 3:35:28 PM10/26/18
to tesser...@googlegroups.com
hi. 
yes

чт, 25 окт. 2018 г. в 23:17, shree <shree...@gmail.com>:
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Vinsec

unread,
Feb 17, 2019, 6:59:27 AM2/17/19
to tesseract-ocr
LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make -j
make[2]: Leaving directory `/root/tesseract-4.0.0/src/classify'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/root/tesseract-4.0.0'
make: *** [all] Error 2

I'm using CentOS 7 and installing tesseract by your scripts.But failed when run the above code.
I would appreciate it if you could give me some advice:)


LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make -j

在 2018年4月24日星期二 UTC+8上午2:22:40,Eugene Huang写道:

Eugene Huang

unread,
Mar 5, 2019, 10:15:51 AM3/5/19
to tesseract-ocr
Hello Vinsec!

Sorry for the slow reply. Work is keeping me busy. No mystery there.
For me, I have been put on another project, so I haven't been using Tesseract for awhile. I hope somebody more experienced like Shree or Александр Поздняков can give you the right answer.

Good luck!
Eugene

Ivan Auffret

unread,
Jul 2, 2019, 12:45:31 PM7/2/19
to tesseract-ocr
Hi Alexander, 

I am trying to install from your repo but I am getting the following error:

Public key for tesseract-langpack-eng-4.00~git30-5.1.noarch.rpm is not installed

Anyone knows what I can do?

Александр Поздняков

unread,
Jul 3, 2019, 6:25:45 AM7/3/19
to tesser...@googlegroups.com
Hi.
You need to add the repository key:

rpm --import repomd.xml.key


вт, 2 июл. 2019 г., 19:45 Ivan Auffret <yvan.a...@gmail.com>:
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Lakshay Saini

unread,
Dec 20, 2019, 2:54:11 AM12/20/19
to tesseract-ocr
Hello all,

I am a bit late to join this conversation. I've recently followed your script and installed tesseract on Centos 7 successfully. 
So far it's working fine. The problem is that I'm unable to use the "--psm 6" mode. I have tried it using through terminal. I have successfully created a PDF from a JPEG, however, the expected results are not being produced.

On the other hand on windows 10, I've tried the same using the command line, and it is producing expected results. I am unable to identify the problem. May be tesseract version is a problem?

CentOS 7 "tesseract --version":

tesseract 4.00.00alpha
 leptonica-1.78.0
  libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7 : libwebp 0.3.0 : libopenjp2 2.3.1
 Found AVX
 Found SSE

Windows 10 "tesseract --version":

tesseract v4.0.0.20181030
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.2.0

Also, if there is a problem in the version, how can I install v4.0.0.20181030 on CentOS 7.

Regards
Lakshay Saini

On Monday, April 23, 2018 at 11:52:40 PM UTC+5:30, Eugene Huang wrote:
Reply all
Reply to author
Forward
0 new messages