Dear all,
I'm trying to generate a training text (chi_sim) for training tesseract because I have a better dictionary and unigram/bigram data than the defaults. I've found the following comments in training/language-specific.sh
(line 845)
# Set language-specific values for several global variables, including
# ${TEXT_CORPUS}
# holds the text corpus file for the language, used in phase F
# ${FONTS[@]}
# holds a sequence of applicable fonts for the language, used in
# phase F & I. only set if not already set, i.e. from command line
# ${TRAINING_DATA_ARGUMENTS}
# non-default arguments to the training_data program used in phase T
# ${FILTER_ARGUMENTS} -
# character-code-specific filtering to distinguish between scripts
# (eg. CJK) used by filter_borbidden_characters in phase F
# ${WORDLIST2DAWG_ARGUMENTS}
# specify fixed length dawg generation for non-space-delimited lang
# TODO(dsl): We can refactor these into functions that assign FONTS,
# TEXT_CORPUS, etc. separately.
So I suppose there are scripts called training_data (phrase T) and filter_borbidden_characters (sic, phrase F) to create the training text from some wordlists and unigram/bigram frequency data.
Where are these scripts, or how can I otherwise generate training text from dictionary/corpus data?
Thanks.