Differing output / How do I find out which parameters are being used on a given run?

52 views
Skip to first unread message

Jonathan Zwart

unread,
Sep 5, 2019, 4:56:29 AM9/5/19
to tesseract-ocr
My stackoverflow question refers (https://stackoverflow.com/questions/57794165/tesseract-differing-output-how-do-i-find-out-which-parameters-are-being-used).

Consider this small png image depicting the word 'Account' in black on a white background.

For this ground-truth image the output differs between the following two Tesseract command-line operations, with (A) better than (B). (B) is required in order for me as the user to be have any hope of sensibly controlling Tesseract's large number of configuration parameters - but preferably with (A)'s excellent extraction performance.


Case A (no config file):

tesseract -v test.png test

tesseract 4.1.0
 leptonica
-1.78.0
  libgif
5.1.4 : libjpeg 9c : libpng 1.6.37 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.3 : libopenjp2 2.3.1
 
Found AVX2
 
Found AVX
 
Found SSE
Tesseract Open Source OCR Engine v4.1.0 with Leptonica

cat test.txt

Account
^L


Case B (using config file, which is obviously desirable, to avoid trying to discover the default parameters by brute force):

tesseract --print-parameters > tess_default.cfg
tesseract
-v test.png test test_default.cfg

ccot
^L      Page separator (default is form feed control character)

I believe the output should be the same in both cases, but it is not. Q1. Why? Case A is clearly more accurate in its output, but Case B is not very accurate.

Q2. How does one otherwise discover the current configuration of Tesseract if not using --print-parameters?

Thanks for all help.


Environment:

* **Tesseract Version**: 4.1.0
* **Commit Number**: [executed: brew install tesseract]
* **Platform**: macOS High Sierra 10.13.6 / Darwin redacted.office 17.7.0 Darwin Kernel Version 17.7.0: Sun Jun  2 20:31:42 PDT 2019; root:xnu-4570.71.46~1/RELEASE_X86_64 x86_64

--ENDS----

Zdenko Podobny

unread,
Sep 5, 2019, 5:12:32 AM9/5/19
to tesser...@googlegroups.com
1.  --print-parameters  is not designed to create config file.
2. There are init and not init variables, there could be variables also in  language data, etc...
 
Zdenko


št 5. 9. 2019 o 10:56 Jonathan Zwart <jonatha...@sprinthive.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1f8290b8-6ffe-4610-bdf9-e7b336e64712%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages