How to regenerate the training text

Dingyuan Wang

unread,

Jun 15, 2017, 9:35:51 AM6/15/17

to tesseract-ocr

Dear all,

I'm trying to generate a training text (chi_sim) for training tesseract because I have a better dictionary and unigram/bigram data than the defaults. I've found the following comments in training/language-specific.sh

(line 845)

# Set language-specific values for several global variables, including
#   ${TEXT_CORPUS}
#      holds the text corpus file for the language, used in phase F
#   ${FONTS[@]}
#      holds a sequence of applicable fonts for the language, used in
#      phase F & I. only set if not already set, i.e. from command line
#   ${TRAINING_DATA_ARGUMENTS}
#      non-default arguments to the training_data program used in phase T
#   ${FILTER_ARGUMENTS} -
#      character-code-specific filtering to distinguish between scripts
#      (eg. CJK) used by filter_borbidden_characters in phase F
#   ${WORDLIST2DAWG_ARGUMENTS}
#      specify fixed length dawg generation for non-space-delimited lang
# TODO(dsl): We can refactor these into functions that assign FONTS,
# TEXT_CORPUS, etc. separately.

So I suppose there are scripts called training_data (phrase T) and filter_borbidden_characters (sic, phrase F) to create the training text from some wordlists and unigram/bigram frequency data.

Where are these scripts, or how can I otherwise generate training text from dictionary/corpus data?

Thanks.

ShreeDevi Kumar

unread,

Jun 15, 2017, 10:49:22 PM6/15/17

to tesser...@googlegroups.com

>Where are these scripts, or how can I otherwise generate training text from dictionary/corpus data?

These are (most probably) internal scripts at Google which have not been open sourced.

Please see https://groups.google.com/forum/#!searchin/tesseract-ocr/training$20text%7Csort:date/tesseract-ocr/-B0mWBwki5w/zuR4R6AGAgAJ

which has Ray's comments regarding the generation of training text.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9a5c68ce-43d5-449e-81c1-ff7237133053%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ShreeDevi Kumar

unread,

Jun 15, 2017, 10:51:31 PM6/15/17

to tesser...@googlegroups.com

You can also see https://ancientgreekocr.org/ for Nick White's method of creating training data for Ancient Greek.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Reply all

Reply to author

Forward