Re: [tesseract-ocr] How to get the net_spec

165 views
Skip to first unread message

Shree Devi Kumar

unread,
Sep 16, 2023, 4:09:14 PM9/16/23
to tesseract-ocr
https://github.com/tesseract-ocr/tessdoc/blob/main/Data-Files-in-tessdata_best.md


Version string : 4.00.00alpha : [Network specification] for tessdata_best

tessdata_best models - incomplete list, only till Kannada.

The flags are TrainingFlags from lstmrecognizer.h. 0x40 is compress unicharset and 1 is integer mode. The one from best has flags 40 = compress + not integer mode.

afr
Version string:4.00.00alpha:afr:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
LSTM training info:Network str:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1], 
flags=40, iteration=286700, sample_iteration=286724, null_char=95, 
learning_rate=0.001, momentum=0.5, adam_beta=0.999

amh
Version string:4.00.00alpha:amh
LSTM training info:Network str:[1,36,0,1Ct3,3,16Mp3,3Lfys48Lfx96Lrx96Lfx192O1c1], 
flags=40, iteration=6112200, sample_iteration=6112270, null_char=284, 
learning_rate=0.001, momentum=0.5, adam_beta=0.999


On Fri, Sep 15, 2023, 9:50 PM Des Bw <desal...@gmail.com> wrote:
For the last couple of days, I have been trying to train the amh data to include some missing characters. 

I have seen that Shree was able to add the Norwegian Æ by removing the top layer and training on it (https://groups.google.com/g/tesseract-ocr/c/l33zsTEPj70/m/wPzPv6HiEQAJ). 

I was trying to do the same. But, the traineddata in Amharic doesn't contain the net_spec information with the version line. 

Version:4.00.00alpha:amh:synth20170629

17:lstm:size=3356155, offset=192

18:lstm-punc-dawg:size=3154, offset=3356347

19:lstm-word-dawg:size=5007810, offset=3359501

20:lstm-number-dawg:size=810, offset=8367311

21:lstm-unicharset:size=18906, offset=8368121

22:lstm-recoder:size=2578, offset=8387027

23:version:size=30, offset=8389605


Can sb  (@Shree, please) help me on how to get the net_spec, or how to proceed to add a layer to introduce the missing characters?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5bedab52-2f9b-44ab-a97a-e2033d0e92den%40googlegroups.com.

Shree Devi Kumar

unread,
Sep 16, 2023, 4:11:48 PM9/16/23
to tesseract-ocr
The language name headings seem to be missing from the tessdoc page for tessdata_fast

Please revert to an older version of page from history 

Shree Devi Kumar

unread,
Sep 16, 2023, 4:25:51 PM9/16/23
to tesseract-ocr
combine_tessdata(1) is the main program to combine/extract/overwrite/list/compact tessdata components in [lang].traineddata files.

Options to list network details

-d .traineddata FILE...: Lists directory of components from the .traineddata file.

-l .traineddata FILE...: List the network information.

Des Bw

unread,
Sep 17, 2023, 2:50:08 PM9/17/23
to tesseract-ocr
Thank you so much dear Shree. You are a life saver. 

Tom Morris

unread,
Oct 26, 2023, 10:34:24 AM10/26/23
to tesseract-ocr
On Saturday, September 16, 2023 at 4:11:48 PM UTC-4 shree wrote:
The language name headings seem to be missing from the tessdoc page for tessdata_fast

Please revert to an older version of page from history 
 
I've regenerated the tessdata_fast page using the latest `combine_tessdata` and included the language names.

I also regenerated the tessdata_best page so that it no longer is truncated at Kannada. Both pages now separate the models for languages and scripts to reflect the new directory structure.

Tom

Tom Morris

unread,
Oct 26, 2023, 10:35:55 AM10/26/23
to tesseract-ocr
On Saturday, September 16, 2023 at 4:25:51 PM UTC-4 shree wrote:
combine_tessdata(1) is the main program to combine/extract/overwrite/list/compact tessdata components in [lang].traineddata files.

Options to list network details

-d .traineddata FILE...: Lists directory of components from the .traineddata file.

-l .traineddata FILE...: List the network information.

It was undocumented (until now), but you can also combine the above as -dl or -ld to get both sets of information at once.

Tom 

Des Bw

unread,
Oct 26, 2023, 12:14:59 PM10/26/23
to tesseract-ocr
Thank you for adding those improvements dear Tom. 
Reply all
Reply to author
Forward
0 new messages