Next problem with training (tesseract 4.0)

544 views
Skip to first unread message

Adam Funk

unread,
Sep 17, 2019, 10:54:03 AM9/17/19
to tesseract-ocr
Hi again,

Using the instructions at
<https://www.endpoint.com/blog/2018/07/09/training-tesseract-models-from-scratch>,
I'm getting a bit further, but when my script gets to this part:

combine_lang_model \
--input_unicharset "${UNICHARSET_FILE}" \
--script_dir "${TESSDATA_PREFIX}" \
--output_dir "${OUTPUT_DIR}" \
--pass_through_recoder \
--lang "${LANG_CODE}"

it fails with this error:

Config file is optional, continuing...
Failed to read data from: /home/adam/sandboxes/TEST/tessdata/mem/mem.config
Failed to read data from:
/home/adam/sandboxes/TEST/tessdata/radical-stroke.txt
Error reading radical code table
/home/adam/sandboxes/TEST/tessdata/radical-stroke.txt


I can't figure out from these instructions or the tesseract
documentation on github where the mem.config and radical-stroke.txt
files are supposed to come from. Any help would be greatly appreciated!

Also, the previous tesseract command is creating the *.lstmf files in
the same directory as the *.box and *.tif files --- are they supposed to
be in the TESSDATA_PREFIX directory instead?

Thanks,
Adam

Shree Devi Kumar

unread,
Sep 17, 2019, 11:38:19 AM9/17/19
to tesseract-ocr
config files are there some languages. They will be in langdata or langdata_lstm repos. radical_stroke.txt is also there.

You can also look at training instructions in wiki or in shreeshrii/tess4training


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b685cfec-0144-fc06-b90f-e9ba54771316%40sheffield.ac.uk.

J Adam Funk

unread,
Sep 18, 2019, 9:19:40 AM9/18/19
to tesseract-ocr
Those look very useful --- thanks!


On Tuesday, 17 September 2019 16:38:19 UTC+1, shree wrote:
config files are there some languages. They will be in langdata or langdata_lstm repos. radical_stroke.txt is also there.

You can also look at training instructions in wiki or in shreeshrii/tess4training
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

J Adam Funk

unread,
Sep 20, 2019, 7:13:00 AM9/20/19
to tesseract-ocr
Hi again,

I've tried using combine_tessdata -u to unpack the contents of the standard eng.trainedddata to use as a "starter" for all the required files, but the combine_lang_model is still failing with "Failed to read data from: /home/adam/sandboxes/TEST/tessdata/eng/eng.config" error. (I have the setting TESSDATA_PREFIX="/home/adam/sandboxes/TEST/tessdata".)  Where do I get that file?

Thanks,
Adam


On Tuesday, 17 September 2019 16:38:19 UTC+1, shree wrote:
config files are there some languages. They will be in langdata or langdata_lstm repos. radical_stroke.txt is also there.

You can also look at training instructions in wiki or in shreeshrii/tess4training
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Shree Devi Kumar

unread,
Sep 20, 2019, 7:35:47 AM9/20/19
to tesseract-ocr
English does not have a config file. It is optional. Only used in some languages.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/03f6f8cc-ad82-4338-9e72-9db7bb62ac9a%40googlegroups.com.

J Adam Funk

unread,
Sep 20, 2019, 9:12:50 AM9/20/19
to tesseract-ocr
OK, so that "Failed..." is just a warning.  
Thanks!


On Tuesday, 17 September 2019 16:38:19 UTC+1, shree wrote:
config files are there some languages. They will be in langdata or langdata_lstm repos. radical_stroke.txt is also there.

You can also look at training instructions in wiki or in shreeshrii/tess4training
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Shree Devi Kumar

unread,
Sep 20, 2019, 9:22:08 AM9/20/19
to tesseract-ocr
Failed to read is a generic warning from the common file read routine (as far as I know)


shows

=== Constructing LSTM training data ===
Creating new directory ../tesstutorial/engtrain
[Mon Apr 1 08:36:38 UTC 2019] /usr/local/bin/combine_lang_model --input_unicharset /tmp/eng-2019-04-01.Q4Z/eng.unicharset --script_dir ../langdata --words ../langdata/eng/eng.wordlist --numbers ../langdata/eng/eng.numbers --puncs ../langdata/eng/eng.punc --output_dir ../tesstutorial/engtrain --lang eng
Loaded unicharset of size 111 from file /tmp/eng-2019-04-01.Q4Z/eng.unicharset
Setting unichar properties
Other case É of é is not in unicharset
Setting script properties
Warning: properties incomplete for index 25 = ~
Config file is optional, continuing...
Failed to read data from: ../langdata/eng/eng.config
Null char=2
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg

 

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5e4bc187-cc72-4d3b-b91a-73e1bc49cc1a%40googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
Reply all
Reply to author
Forward
0 new messages