Hi all,
I'm in a time critical situation. I want to deliver a new software for our customer on 5th January 2017.
While things worked well on the test-environment; after deploying the software on the productive environment problems came up.
Before describing the situation/failure in detail, some info about the setup and the environment.
Environment & Installation
Operating System: Suse Enterprise Linux Server 12 SP 1
$ uname –a
Linux 3.12.62-60.64.8-default #1 SMP Tue Oct 18 12:21:38 UTC 2016 (42e0a66) x86_64 x86_64 x86_64 GNU/Linux
Since this environment is managed, I can not update any system libraries like glibc etc.
So the newest and only official supported version for "Suse 12 SP1 x86_64" of teaaseract I found is 3.02
Installed Packages:
libgif4-4.1.6-34.1.1.x86_64.rpm
liblept3-1.69-16.1.x86_64.rpm
libtesseract3-3.02.02-3.2.1.x86_64.rpm
libwebp4-0.3.1-34.1.x86_64.rpm
tesseract-3.02.02-59.1.x86_64.rpm
tesseract version
$ tesseract –v
tesseract 3.02.02
leptonica-1.69
libgif 4.1.6 : libjpeg 8d : libpng 1.5.22 : libtiff 4.0.6 : zlib 1.2.8
Release details
$ zypper info tesseract
Information for package tesseract:
----------------------------------
Repository: @System
Name: tesseract
Version: 3.02.02-59.1
Arch: x86_64
Vendor: obs://build.opensuse.org/home:koprok
Support Level: unknown
Installed: Yes
Status: up-to-date
Installed Size: 3.8 MiB
Summary: Open Source OCR Engine
Description: […]
Traindata & Languages
Traindata
The traindata has been manually downloaded from
github.
And files have been to /usr/share/tessdata/
$ ls -la /usr/share/tessdata/
drwxr-xr-x 1 root root 230 Dec 31 16:37 configs/
-rw-r--r-- 1 root root 2438081 Dec 30 15:31 deu.traineddata
-rw-r--r-- 1 root root 171918 Dec 30 20:16 eng.cube.bigrams
-rw-r--r-- 1 root root 38 Dec 30 20:16 eng.cube.fold
-rw-r--r-- 1 root root 181 Dec 30 20:16 eng.cube.lm
-rw-r--r-- 1 root root 857304 Dec 30 20:16 eng.cube.nn
-rw-r--r-- 1 root root 254 Dec 30 20:16 eng.cube.params
-rw-r--r-- 1 root root 13020078 Dec 30 20:16 eng.cube.size
-rw-r--r-- 1 root root 2444187 Dec 30 20:16 eng.cube.word-freq
-rw-r--r-- 1 root root 996 Dec 30 20:16 eng.tesseract_cube.nn
-rw-r--r-- 1 root root 21876572 Dec 30 20:16 eng.traineddata
drwxr-xr-x 1 root root 88 Dec 31 16:37 tessconfigs/
tesseract detects 'deu' and 'eng' as available languages
$ tesseract --list-langs
List of available languages (2):
deu
eng
Application & Problem
The software application is build upon Spring Boot framework
Runtime.getRuntime().exec(new String[] {
"tesseract",
"--tessdata-dir", "/usr/share/tessdata",
"-l", lang.getISO3Language(),
inputTiff.toAbsolutePath().toString(), extractedcntPath });
The appication logfile says
2016-12-30 20:30:02,320 [https-jsse-nio-8443-exec-7] WARN PDFContentExtractor - read_params_file: parameter not found: II*
Executing tesseract with tessdata dir fails
$ tesseract --tessdata-dir /usr/share/tessdata -l deu inputPdf6632237754781472255.tiff out4
read_params_file: parameter not found: II*
When executing tesseract with no tessdata dir works well
$ tesseract -l deu inputPdf6632237754781472255.tiff out5
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
Questions & Ideas
Why does tesseract work well and detect the available languages without the --tessdata-dir parameter set?
Why does teasseract crash during initialization when using the --tessdata-dir parameter set?
Is there any difference between running tesseract with/without the --tessdata-dir parameter set?
What can I do to fix this problem?
Install a newer version of tesseract?
Compile a version from sources?
Use other traindata/tessdata?
Run tesseract without the --tessdata-dir param?
If anybody can help me getting this issue solved in the upcomming week, it would not only make me happy, but rather our whole team.
Thank you very much in advance!
Rüdiger Kurz