Improving Tesseract testing and continuous integration

630 views
Skip to first unread message

James R Barlow

unread,
Apr 11, 2017, 2:47:04 AM4/11/17
to tesseract-dev

Hi everyone,


Tesseract does not have much in the way of a testing infrastructure. 


Tesseract does have the UNLV OCR accuracy test. Setting this up is described in testing/README (although the information in that file is out of date) so there is nothing community contributors can use to easily check that their changes have not broken Tesseract in some way. 


Continuous integration currently only checks if Tesseract builds:


I propose a simple validation testing framework. Testing could focus on these areas:

-CLI validation

-input validation, such as rejection of malformed images

-output validation - are PDFs well-formed, etc.

-API validation - to avoid unintentional API breakage as came up recently between 3.03 and 3.04

-ensuring that accelerated versions of Tesseract (OpenMP, OpenCL, AVX) produce results that match non-accelerated builds, within reasonable tolerance


Ideally this would be a quick series of tests that a typical developer PC could grind through in 5-10 minutes, so that people will actually use it, and so it's suitable for CI. The tests shouldn't be dependent on OCR accuracy or repeatability, so that OCR can improve.


To simplify the creation of tests, I would use Python 3, pytest and pybind11. Python 3's excellent Unicode support makes it a sound choice. pytest is a solid and widely used testing framework that eliminates a lot of the busywork of writing tests. pybind11 is a terrific Python to C++11 header-only template library that makes it easy to write tests against C++.


Are the core developers interested in seeing this added?


ShreeDevi Kumar

unread,
Apr 11, 2017, 6:40:47 AM4/11/17
to tesser...@googlegroups.com
I think it is an excellent idea.

>>The tests shouldn't be dependent on OCR accuracy or repeatability, so that OCR can improve.

I would suggest some minimal OCR evaluation, just to ensure that there is no regression.

eg. see https://github.com/Shreeshrii/tess4eval_marathi - the result of my experimenting with travis a couple months back 



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-dev+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/76bf9d0a-c66a-4123-8f07-dea1305fba25%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Egor Pugin

unread,
Apr 12, 2017, 4:02:36 PM4/12/17
to tesseract-dev
My vote is for C++ test suite.

1. Python is unnecessary dependency.
2. C++ has a lot of nice test libraries.
3. std::string handles utf8 good enough.
4. C++ can give you every other possible opportunity. While with python you can expect some weaknesses. (They begin from pybind11 in your post because of py->c++ connection.)

James R Barlow

unread,
Apr 13, 2017, 4:47:45 PM4/13/17
to tesseract-dev


On Wednesday, 12 April 2017 13:02:36 UTC-7, Egor Pugin wrote:
My vote is for C++ test suite.

1. Python is unnecessary dependency.
2. C++ has a lot of nice test libraries.
3. std::string handles utf8 good enough.
4. C++ can give you every other possible opportunity. While with python you can expect some weaknesses. (They begin from pybind11 in your post because of py->c++ connection.)


You're right, Python would introduce a new language to Tesseract, and that is a drawback. Here are the reasons why I think the advantages of a Python based test suite outweigh the disadvantages:

I develop ocrmypdf in Python (which manages to PDF to OCR PDF conversion using Tesseract and Ghostscript) and some of my tests for that software could be easily retooled for testing Tesseract directly. For example, a test to confirm that an output PDF is valid and visually resembles the input would work for both. I can't offer a turnkey solution in C++.

Dependency management in C++ is a lot harder at the moment, and a test suite is going to need libraries that core Tesseract doesn't need for font to image rasterizing, XML inspection, and PDF inspection. With C++ it is often necessary to modify the host system to install new packages (apt-get install libxml2-dev), while in Python packages for test can be specified to an exact version and installed to a virtual environment. That is a clear advantage for reproducibility and ease of use.

There are good reasons to test the Tesseract C API from another language. I found and fixed a Tesseract bug a while back where a struct that was part of the C API was in fact in C++ class, so its layout was determined at compile time - meaning that ABI was underspecified and unstable across Tesseract versions. Anyone writing tests in with a C++ compiler probably wouldn't notice this issue.

If tests are structured in the way I propose, any number of test cases could still be written in C++ and wired into pytest using either the pytest-cpp plugin or pybind.

Ray Smith

unread,
Apr 16, 2017, 8:14:11 PM4/16/17
to tesser...@googlegroups.com
We ​have a barrage of tests in the Google version of the codebase.

I don't think there would be any objections from Google to open source the tests. They use gunit which is an additional dependency, but it would only be required by the tests.

The tests would have to be modified somewhat to avoid using any other Google specific code, but the changes would be fairly easy.

It isn't something I have time to do right now, but if there are community volunteers to work on the porting, I could put in the work to get the approval s. Is anyone familiar with gunit? I know it's open sourced.

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-dev+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-dev.

James R Barlow

unread,
Apr 20, 2017, 8:20:59 PM4/20/17
to tesseract-dev
Ray,

Clearly this is a better option than the plan I proposed. 

I think there's value for the community to see what is being tested and for OS distributors (Debian etc.) to validate the build on their systems.

It seems there are a few testing frameworks that all go by the name "gunit". I guess you mean Google Test (https://github.com/google/googletest) but possibly GNOME gUnit or something else? I'm not familiar with either test framework, but I would be willing to help with the porting.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.

Ray Smith

unread,
Apr 20, 2017, 8:41:31 PM4/20/17
to tesser...@googlegroups.com
Yes, Google Test is the framework that I meant.
I had a quick look at what it would take, and decided the biggest difficulty would be the images that are used in the tests. There will be copyright concerns, as they are taken from a variety of sources.
One simple solution might be to solicit submissions of test images and/or collect images from existing issues that have been submitted to the github site.
The tests could then be modified to expect appropriate results from these images instead of the ones that are currently used.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-dev+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-dev.

ShreeDevi Kumar

unread,
Apr 23, 2017, 8:37:29 AM4/23/17
to tesseract-dev
On Friday, April 21, 2017 at 6:11:31 AM UTC+5:30, Ray wrote:
Yes, Google Test is the framework that I meant.
I had a quick look at what it would take, and decided the biggest difficulty would be the images that are used in the tests. There will be copyright concerns, as they are taken from a variety of sources.
One simple solution might be to solicit submissions of test images and/or collect images from existing issues that have been submitted to the github site.
The tests could then be modified to expect appropriate results from these images instead of the ones that are currently used.


Ray,

Please elaborate on what type  of images and additional information (ground truth etc) are needed for testing.  I can help collect samples for Indian languages.
We can also share the info on tesseract-ocr google group to solicit submissions from the user community.

When do you expect to update github with the new training/codebase and testing framework?

Jeff Breidenbach

unread,
Jul 12, 2017, 6:29:47 PM7/12/17
to tesseract-dev
I'm taking a look at the Google Test framework. Inside Google, all the tests
run in the cloud. As far as I can tell, that is not the case for a GitHub project.
I think they expect you to run tests locally, with "make" or "cmake" something 
like that.

There's a bunch of tutorials on the web, not obvious which one is best.
First step is to make a working but empty test for Tesseract under this 
framework. Once that exemplar is in place, we can migrate the existing 
tests that currently run inside Google.

@Ray: Are the tests small + fast enough to run locally, or do they require 
a cluster of computers to be practical?

Ray Smith

unread,
Jul 12, 2017, 6:34:56 PM7/12/17
to tesser...@googlegroups.com
There are the order of 50 tests, some of which complete in <1s. Some take ~10mins to run, but very few.
You could probably run the whole lot on a single machine in about an hour.

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-dev+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-dev.

James R Barlow

unread,
Jul 12, 2017, 7:30:13 PM7/12/17
to tesseract-dev

On Wednesday, 12 July 2017 15:29:47 UTC-7, Jeff Breidenbach wrote:
I'm taking a look at the Google Test framework. Inside Google, all the tests
run in the cloud. As far as I can tell, that is not the case for a GitHub project.
I think they expect you to run tests locally, with "make" or "cmake" something 
like that.

GitHub does not discover or run tests on its own, if that's what Google does internally. There will need to be some explicit change to the main Makefile.am to teach it that "make test" should compile and run the test suite. Both developers and Travis CI/Appveyor clone the repo and run whatever script tell they are told to in the cloud, but again they don't discover tests in any way. 

I think Travis uses Google Compute Engine, so the tests may actually just moving a few aisles over in the data center :).

Stefan Weil

unread,
Jul 21, 2017, 1:18:04 PM7/21/17
to tesseract-dev
Ray, what about providing the Google test code in a new Tesseract branch (`test` would be a good name) on GitHub as soon as this is legally possible?

It's not necessary to add copyrighted images to that branch.

The Tesseract community then could have a look on that code and see what is missing and what has to replaced / fixed for an open source test.

Ray Smith

unread,
Jul 21, 2017, 2:05:32 PM7/21/17
to tesser...@googlegroups.com
I have the OK to "throw the tests over the wall" already. ie provide them in a non-working form.
There are actually very few copyrighted images that would need to be replaced. Most of the tests run on synthetic data, existing test data, or don't require images.
If someone would put together the build pieces necessary to build and run an empty test (using Google test), then I will port at least one example, and then push the rest out.

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-dev+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-dev.

ShreeDevi Kumar

unread,
Jul 23, 2017, 8:06:45 AM7/23/17
to tesser...@googlegroups.com
​Resending the message regarding Googletest integration with tesseract.

On Fri, Jul 21, 2017 at 11:35 PM, Ray Smith 
I have the OK to "throw the tests over the wall" already. ie provide them in a non-working form.
There are actually very few copyrighted images that would need to be replaced. Most of the tests run on synthetic data, existing test data, or don't require images.
If someone would put together the build pieces necessary to build and run an empty test (using Google test), then I will port at least one example, and then push the rest out.


​I could build Google test locally in tesseract directory with the following commands and run the sample test.


ln -s ./googletest/googletest ./test
cd test/make
make
./sample1_unittest

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


I tried with submodule option. However, https://github.com/google/googletest/tree/master/googletest lists multiple ways to do this. It will be useful to get Ray and tesseract developer community input on what should be the recommended way forward:

Making GoogleTest's source code available to the main build can be done a few different ways:

  • Download the GoogleTest source code manually and place it at a known location. This is the least flexible approach and can make it more difficult to use with continuous integration systems, etc.
  • Embed the GoogleTest source code as a direct copy in the main project's source tree. This is often the simplest approach, but is also the hardest to keep up to date. Some organizations may not permit this method.
  • Add GoogleTest as a git submodule or equivalent. This may not always be possible or appropriate. Git submodules, for example, have their own set of advantages and drawbacks.
  • Use CMake to download GoogleTest as part of the build's configure step. This is just a little more complex, but doesn't have the limitations of the other methods.

Jeff Breidenbach

unread,
Aug 2, 2017, 7:06:47 PM8/2/17
to tesseract-dev
That's a good start. The next step is to check in an empty sample test into Tesseract 
codebase, and find a recipe that runs it.

In slightly related news, I have checked in the latest Leptonica (1.74.4) into Debian
and anticipate putting some sort of Tesseract 4.x release there in October. This will
be well after the Ubuntu 17.10 cutoff date.

Jeff Breidenbach

unread,
Aug 2, 2017, 7:27:20 PM8/2/17
to tesseract-dev
One thing we can do is create a directory called "unittest" inside Tesseract. 
This can be populated by the example files mentioned above, with a tiny 
tweak to the Makefile so that GTEST_DIR = ../googletest/googletest

Then the recipe looks like this. The main problem is this example is totally self
contained, so qw still need to figure out how to link a test against libtesseract.
cd unittest
make
./sample1_unittest 

Jeff Breidenbach

unread,
Aug 2, 2017, 7:28:19 PM8/2/17
to tesseract-dev
Ray, why don't you throw a tiny piece of one test "over the wall" right now and we'll see if we can get it to run.

Ray Smith

unread,
Aug 2, 2017, 8:37:38 PM8/2/17
to tesser...@googlegroups.com
77c44cde..2fbcba62

It's the simplest test I could find, as it uses no external data, no google code other than gunit, and is a test of a module that will still be in use even if the legacy engine is removed.

On Wed, Aug 2, 2017 at 4:28 PM, Jeff Breidenbach <breid...@gmail.com> wrote:
Ray, why don't you throw a tiny piece of one test "over the wall" right now and we'll see if we can get it to run.

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-dev+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-dev.

ShreeDevi Kumar

unread,
Aug 3, 2017, 2:25:28 AM8/3/17
to tesser...@googlegroups.com
Ray,

we are missing

#include "matrix.h"
#include "gunit.h"


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

ShreeDevi Kumar

unread,
Aug 3, 2017, 2:25:28 AM8/3/17
to tesser...@googlegroups.com
I got it to work and have created a PR.


./matrix_test
Running main() from gtest_main.cc
[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from MatrixTest
[ RUN      ] MatrixTest.RotatingTranspose_3_1
[       OK ] MatrixTest.RotatingTranspose_3_1 (0 ms)
[ RUN      ] MatrixTest.RotatingTranspose_2_0
[       OK ] MatrixTest.RotatingTranspose_2_0 (0 ms)
[ RUN      ] MatrixTest.RotatingTranspose_1_3
[       OK ] MatrixTest.RotatingTranspose_1_3 (0 ms)
[ RUN      ] MatrixTest.RotatingTranspose_0_2
[       OK ] MatrixTest.RotatingTranspose_0_2 (0 ms)
[----------] 4 tests from MatrixTest (1 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test case ran. (2 ms total)
[  PASSED  ] 4 tests.


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Ray Smith

unread,
Aug 3, 2017, 6:53:32 PM8/3/17
to tesser...@googlegroups.com
OK, nice that it works!
I'm pushed a more difficult one. It uses google-specific code for logging, file i/o to access the testdata and langdata directories, and string manipulation, so porting it in a way that will be back-portable to run at Google as well could be tricky...
I've deleted the google includes, redacted the exact paths, but otherwise left the function calls in place.
If you need help understanding what the non-compilable code does, I can help.

ShreeDevi Kumar

unread,
Aug 4, 2017, 5:39:50 AM8/4/17
to tesser...@googlegroups.com
I got errors trying to build that one and have posted them as comments on the commit.

Someone with better knowledge about C++ and tesseract will have to look at this.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Reply all
Reply to author
Forward
0 new messages