Distributing trainingdata files

1,181 views
Skip to first unread message

Nick White

unread,
Apr 16, 2014, 9:55:34 PM4/16/14
to tesser...@googlegroups.com
Hi all,

I recently went through the process of installing Tesseract on a
virtual install of Windows, using gImageReader (which in its new
version is pretty easy, incidentally, which is great). I'm keen to
have it be as simple as possible, and gImageReader mostly
accomplishes that. I wrote it up as a simple guide for a mini
website I'm preparing about the Ancient Greek training, designed
very much for novice computer users, which you can see at [0] if you
like (any suggestions most welcome).

One thing that is tricky at the moment is that the training files
are wrapped in gzipped tarballs, which Windows users need a 3rd
party application (like 7zip) to extract. This isn't exactly rocket
science, but at the same time it is an extra couple of steps that I
think is not worth the saving of space & bandwidth.

So I reckon we should just offer the straight lang.traineddata files
directly, from the next release onwards. Does anyone have any
objection to that?

Nick

0. http://ancientgreekocr.org/windows.html

Tom Powers

unread,
Apr 16, 2014, 10:32:16 PM4/16/14
to tesseract-dev
On Wed, Apr 16, 2014 at 6:55 PM, Nick White <nick....@durham.ac.uk> wrote:
One thing that is tricky at the moment is that the training files
are wrapped in gzipped tarballs, which Windows users need a 3rd
party application (like 7zip) to extract. This isn't exactly rocket
science, but at the same time it is an extra couple of steps that I
think is not worth the saving of space & bandwidth.

So I reckon we should just offer the straight lang.traineddata files
directly, from the next release onwards. Does anyone have any
objection to that?

Windows 7 (and I think Vista) has builtin support for unpacking .zip files. Depending on how big these files are (I mean the tessdata folder for 3.02 was 600+ MB right?), I think some sort of compression would still be wise.​

    -- Tom

Nick White

unread,
Apr 16, 2014, 11:09:41 PM4/16/14
to tesser...@googlegroups.com
XP has builtin support for .zip too, yes. But we're expecting end
users to just download the trainings they're interested in, which
will typically only be 2 or 3 at most. They're around 8MB each
uncompressed, so I still think it's not worth it (never
underestimate how easy it is to confuse users ;)). Though I agree
that if we go for compression, .zip is a better idea.

Ray Smith

unread,
Apr 18, 2014, 1:14:47 PM4/18/14
to tesser...@googlegroups.com
I have no objection to switching to zip (with no tar) for the tessdata files. That should be usable by everybody more easily.

I don't know if there is any good reason to continue to use tar.gz for the source distribution. It seems to be the standard.



--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nick White

unread,
Apr 18, 2014, 2:50:34 PM4/18/14
to tesser...@googlegroups.com
On Fri, Apr 18, 2014 at 10:14:47AM -0700, Ray Smith wrote:
> I don't know if there is any good reason to continue to use tar.gz for the
> source distribution. It seems to be the standard.

Yeah, .tar.gz for the source makes sense (.tar.xz would be fine
too); we can reasonably expect people compiling from source to be
able to extract from a tarball, which is less true of people
downloading languages to use with e.g. GUI precompiled packages.

zdenko podobny

unread,
Apr 18, 2014, 5:31:23 PM4/18/14
to tesser...@googlegroups.com
Ray,

did you think/test to use zip instead of combine_tessdata?
I have impression that Google invest a lot of sources to compression algorithms (e.g. [1] or [2]) so I would expect that some compression is used for language data files ;-). Any maybe also cube datafiles could be than easily added to zip archive so we could have one file per language.

libz is already needed dependency because of leptonica and/or png support in tesseract. So decompression of zip archive should not be issue. This would be welcomed by mobile developers (decreased size of storage space) and maybe it could speed up tesseract init process (at lest this is my impression based on articles like this[3] [4] but I am not fan of implementing xz because of memory consumption[1] maybe additional dependency)


Zdenko

Tom Powers

unread,
Apr 18, 2014, 10:59:05 PM4/18/14
to tesseract-dev
On Fri, Apr 18, 2014 at 2:31 PM, zdenko podobny <zde...@gmail.com> wrote:
did you think/test to use zip instead of combine_tessdata?
I have impression that Google invest a lot of sources to compression algorithms (e.g. [1] or [2]) so I would expect that some compression is used for language data files ;-). Any maybe also cube datafiles could be than easily added to zip archive so we could have one file per language.

libz is already needed dependency because of leptonica and/or png support in tesseract. So decompression of zip archive should not be issue. This would be welcomed by mobile developers (decreased size of storage space) and maybe it could speed up tesseract init process (at lest this is my impression based on articles like this[3] [4] but I am not fan of implementing xz because of memory consumption[1] maybe additional dependency)


Just as a test, I used Windows 7, and did "Send to Compressed (zipped) Folder" and it reduced 3.02's tessdata folder from 605MB down to 226MB. It took a minute or so to create the archive. This is not necessarily the same as using the zlib library directly, but probably similar results will be obtained.

But... just because a library is used by leptonica doesn't mean (at least on Windows) that you can directly call any functions in that library. While this is true when using the static libraries (where you have to specify all Leptonica's dependent libraries when linking) it is **not** true with the Leptonica DLL. You can then only use functions mentioned in allheaders.h. This was the whole impetus for my earlier work on the --enable-visibility flag. When this flag is not used programs can link with tesseract's shared libraries (that by default make everything visible) but can't link to the tesseract Windows DLL.

Currently, allheaders.c only has:

   LEPT_DLL extern l_uint8 * zlibCompress ( l_uint8 *datain, size_t nin, size_t *pnout );
   LEPT_DLL extern l_uint8 * zlibUncompress ( l_uint8 *datain, size_t nin, size_t *pnout );

I'm completely ignorant of zlib, but I doubt these alone would support creating multi-file compressed archives? I gather from [1] that zlib has gzip File Access Functions but from [2]:

   Although [gzip's] file format also allows for multiple such streams to be
   concatenated (zipped files are simply decompressed concatenated as if
   they were originally one file), gzip is normally used to compress
   just single files. Compressed archives are typically created by
   assembling collections of files into a single tar archive, and then
   compressing that archive with gzip.

Therefore, I don't think Zdenko's statement that "​libz is already needed dependency because of leptonica and/or png support in tesseract. So decompression of zip archive should not be issue" is quite correct? It is probably true to say only that "the decompression of individual files is not an issue".

Additionally, even these two functions aren't directly tested by any of Dan's regression tests (that are run by me and others for every Leptonica release (beta or otherwise). Possibly these functions are indirectly called by other tests but you'll have to ask Dan to be sure. I suggest that if you do indeed intend to start using zlib in tesseract, then explicit tests of these functions should be added to the current regression test suite for Leptonica 1.71.

[1] http://www.zlib.net/manual.html

[2] http://en.wikipedia.org/wiki/Gzip

    -- Tom

zdenko podobny

unread,
Apr 19, 2014, 4:24:50 PM4/19/14
to tesser...@googlegroups.com
Tom,

thanks for corrections. I meant tesseract should use libz directly (not via leptonica) and I expected libz can process zip files that is not true (libz has in contrib minizip project that is able to read/write zip, but it is not part of libz).


Zdenko


--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.

Ray Smith

unread,
Apr 19, 2014, 9:59:11 PM4/19/14
to tesser...@googlegroups.com
I spent some time looking at zlib. It doesn't seem to make it easy to randomly access named entities in a gzip file, unless I am missing something. The memory compress/uncompress functions are quite nice though.

For the next version it would be nice to:
  • Update tessdatamanager to cope with compressed components.
  • Eliminate fread/fscanf from file input code and allow everything to read from a memory buffer.
These can probably both be achieved with the TFile class that I added for 3.03.

This is a change in direction from my previous work with new classifier experiments, where I have been writing everything to use Serialize/DeSerialize and FILE streams, but this doesn't seem to be as portable as I had hoped, due to its reliance on fmemopen. It seems it would be better to make everything use memory buffers and push the file I/O responsibility out to TessDataManager/TFile, which could then just as easily deal with compressed files or in-memory data.


Nick White

unread,
Apr 29, 2014, 11:52:24 AM4/29/14
to tesser...@googlegroups.com
Hi all,

On Sat, Apr 19, 2014 at 06:59:11PM -0700, Ray Smith wrote:
> I spent some time looking at zlib. It doesn't seem to make it easy to randomly
> access named entities in a gzip file, unless I am missing something. The memory
> compress/uncompress functions are quite nice though.

I took a look at zlib this morning, and agree, the compress/
uncompress functions look prety straightforward. Apparently the lzma
API is pretty similar, so it would be easy to add that as an option
at some point if we wanted to.

> For the next version it would be nice to:
>
> • Update tessdatamanager to cope with compressed components.
> • Eliminate fread/fscanf from file input code and allow everything to read
> from a memory buffer.

One slightly different approach would be to just have the
traineddata file be a straightforward .tar.gz, which is read and
parsed into memory more-or-less in one go, e.g. in the
init_tesseract_lang_data.

I can see a few advantages and a few disadvantages to that approach:

Advantages:
- combine_tessdata can go away (nothing wrong with it, but it's nice
to make it obsolete), and regular file archiving tools can be used
to read the training data.
- we need to read the all of training data into memory soon after
starting anyway.
- should be simple to code.

Disadvantages:
- tar is slow at random access, so we could be loading parts when it
turns out they won't be needed. for example if the cube parts were
encountered before the config part, would we load it and then just
discard if it turns out we aren't going to be using it?

Thoughts?

Nick

Stefan Weil

unread,
May 12, 2017, 10:36:16 AM5/12/17
to tesseract-dev
On Friday, April 18, 2014 at 7:14:47 PM UTC+2, Ray wrote:
I have no objection to switching to zip (with no tar) for the tessdata files. That should be usable by everybody more easily.

I wrote a first proof of concept for zip support and published it on GitHub. Other file formats can give an even better compression:

31873501 eng.traineddata
16461487 eng.traineddata.zip (default)
16372645 eng.traineddata.zip (maximum compression)
15193532 eng.traineddata.tar.bz2
13274164 eng.traineddata.tar.xz
13273173 eng.traineddata.7z

Zdenko Podobný

unread,
May 13, 2017, 6:53:21 AM5/13/17
to tesser...@googlegroups.com
Stefan,

did you test speed effect on tesseract init process based on different compression? 

Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-dev+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.

Stefan Weil

unread,
May 15, 2017, 11:19:25 AM5/15/17
to tesseract-dev
On Saturday, May 13, 2017 at 12:53:21 PM UTC+2, Zdenko Podobný wrote:
did you test speed effect on tesseract init process based on different compression?

Yes, I tested the time with several different libraries which support zip archives
and also different kinds of other compressed archives with libarchive, a library
which supports most common formats and compression methods. See
https://github.com/tesseract-ocr/tesseract/pull/911#issuecomment-301323598
for the results with mya.traineddata, the largest traineddata file. There are also
numbers for eng.traineddata in the same pull request.

Stefan

Zdenko Podobný

unread,
Nov 12, 2018, 9:00:13 AM11/12/18
to tesseract-dev
FYI: there is a discussion on Linux kernel about kernel support of compression (add zstd and drop bz2, lzma1)[1].

Reply all
Reply to author
Forward
0 new messages