Convert traineddata to integer ("fast" variant)

773 views
Skip to first unread message

Stefan Weil

unread,
Feb 1, 2018, 10:35:39 AM2/1/18
to tesseract-dev
Hi,

does anybody know how exactly the tessdata_fast/*.traineddata were built?

I tried these steps to convert `tessdata_best/eng.traineddata` to a fast variant:

    combine_tessdata -u tessdata_best/eng.traineddata /tmp/tmp.
    lstmtraining --convert_to_int=true --continue_from /tmp/tmp.lstm --traineddata tessdata_best/eng.traineddata --stop_training

The result is a file called `lstmtrain` which looks like a traineddata file. In large parts it is identical to `tessdata_fast/eng.traineddata`, but the size of the lstm component differs:

    $ combine_tessdata -d lstmtrain 2>&1 | grep lstm:size
    17:lstm:size=1487588, offset=192
    $ combine_tessdata -d tessdata_best/eng.traineddata 2>&1 | grep lstm:size
    17:lstm:size=11689099, offset=192
    r$ combine_tessdata -d tessdata_fast/eng.traineddata 2>&1 | grep lstm:size
    17:lstm:size=401636, offset=192

Obviously the fast model is not simply a best model converted to integer, but there must have been more reductions of the LSTM data.

Regards
Stefan Weil

Jeff Breidenbach

unread,
Feb 8, 2018, 12:26:13 PM2/8/18
to tesseract-dev
I have a magical file called lang_map_fast.txt. It comes with a comment from Ray that 
says "Added quantized 'fast' models [...] and the lang_map that was used to select them."
The beginning of the file looks is below.  Does this seem helpful? It doesn't mean anything 
to me, because I was not involved at all in the generation. 

Beyond this, the only person on the planet who can say more is Ray.

afr l36-64-96-512
ara l48-64-96-192
bel l36-48-96-128
ben l36-64-96-192
bul l36-48-96-128
...


Message has been deleted

ShreeDevi Kumar

unread,
Feb 8, 2018, 2:09:42 PM2/8/18
to tesser...@googlegroups.com
I would guess that it lists the parameters used for building network spec for diff languages
-net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]


--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-dev+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/04ec7f30-bf6c-48d7-9e92-e2edfbd9514d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Stefan Weil

unread,
May 10, 2018, 2:32:42 AM5/10/18
to tesseract-dev


On Thursday, February 1, 2018 at 4:35:39 PM UTC+1, Stefan Weil wrote:
I tried these steps to convert `tessdata_best/eng.traineddata` to a fast variant:

    combine_tessdata -u tessdata_best/eng.traineddata /tmp/tmp.
    lstmtraining --convert_to_int=true --continue_from /tmp/tmp.lstm --traineddata tessdata_best/eng.traineddata --stop_training
[...]

The easiest way to create a fast variant is this:

     combine_tessdata -c eng.traineddata
 
It reads a traineddata file with a best (float) model, converts the model into a fast (integer) one, and writes it back replacing the original file.
It does not change the LSTM network spec as it was done by Ray's tessdata_fast.
Reply all
Reply to author
Forward
0 new messages