Convert traineddata to integer ("fast" variant)

Stefan Weil

unread,

Feb 1, 2018, 10:35:39 AM2/1/18

to tesseract-dev

Hi,

does anybody know how exactly the tessdata_fast/*.traineddata were built?

I tried these steps to convert `tessdata_best/eng.traineddata` to a fast variant:

    combine_tessdata -u tessdata_best/eng.traineddata /tmp/tmp.
    lstmtraining --convert_to_int=true --continue_from /tmp/tmp.lstm --traineddata tessdata_best/eng.traineddata --stop_training

The result is a file called `lstmtrain` which looks like a traineddata file. In large parts it is identical to `tessdata_fast/eng.traineddata`, but the size of the lstm component differs:

    $ combine_tessdata -d lstmtrain 2>&1 | grep lstm:size
    17:lstm:size=1487588, offset=192
    $ combine_tessdata -d tessdata_best/eng.traineddata 2>&1 | grep lstm:size
    17:lstm:size=11689099, offset=192
    r$ combine_tessdata -d tessdata_fast/eng.traineddata 2>&1 | grep lstm:size
    17:lstm:size=401636, offset=192

Obviously the fast model is not simply a best model converted to integer, but there must have been more reductions of the LSTM data.

Regards
Stefan Weil

Jeff Breidenbach

unread,

Feb 8, 2018, 12:26:13 PM2/8/18

to tesseract-dev

I have a magical file called lang_map_fast.txt. It comes with a comment from Ray that

says "Added quantized 'fast' models [...] and the lang_map that was used to select them."

The beginning of the file looks is below. Does this seem helpful? It doesn't mean anything

to me, because I was not involved at all in the generation.

Beyond this, the only person on the planet who can say more is Ray.

afr l36-64-96-512

ara l48-64-96-192

bel l36-48-96-128

ben l36-64-96-192

bul l36-48-96-128

...

Message has been deleted

ShreeDevi Kumar

unread,

Feb 8, 2018, 2:09:42 PM2/8/18

to tesser...@googlegroups.com

I would guess that it lists the parameters used for building network spec for diff languages

-net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-dev+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/04ec7f30-bf6c-48d7-9e92-e2edfbd9514d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Stefan Weil

unread,

May 10, 2018, 2:32:42 AM5/10/18

to tesseract-dev

On Thursday, February 1, 2018 at 4:35:39 PM UTC+1, Stefan Weil wrote:

I tried these steps to convert `tessdata_best/eng.traineddata` to a fast variant:

combine_tessdata -u tessdata_best/eng.traineddata /tmp/tmp.
lstmtraining --convert_to_int=true --continue_from /tmp/tmp.lstm --traineddata tessdata_best/eng.traineddata --stop_training

[...]

The easiest way to create a fast variant is this:

combine_tessdata -c eng.traineddata

It reads a traineddata file with a best (float) model, converts the model into a fast (integer) one, and writes it back replacing the original file.
It does not change the LSTM network spec as it was done by Ray's tessdata_fast.

Reply all

Reply to author

Forward