Hi all,
On Sat, Apr 19, 2014 at 06:59:11PM -0700, Ray Smith wrote:
> I spent some time looking at zlib. It doesn't seem to make it easy to randomly
> access named entities in a gzip file, unless I am missing something. The memory
> compress/uncompress functions are quite nice though.
I took a look at zlib this morning, and agree, the compress/
uncompress functions look prety straightforward. Apparently the lzma
API is pretty similar, so it would be easy to add that as an option
at some point if we wanted to.
> For the next version it would be nice to:
>
> • Update tessdatamanager to cope with compressed components.
> • Eliminate fread/fscanf from file input code and allow everything to read
> from a memory buffer.
One slightly different approach would be to just have the
traineddata file be a straightforward .tar.gz, which is read and
parsed into memory more-or-less in one go, e.g. in the
init_tesseract_lang_data.
I can see a few advantages and a few disadvantages to that approach:
Advantages:
- combine_tessdata can go away (nothing wrong with it, but it's nice
to make it obsolete), and regular file archiving tools can be used
to read the training data.
- we need to read the all of training data into memory soon after
starting anyway.
- should be simple to code.
Disadvantages:
- tar is slow at random access, so we could be loading parts when it
turns out they won't be needed. for example if the cube parts were
encountered before the config part, would we load it and then just
discard if it turns out we aren't going to be using it?
Thoughts?
Nick