- I also have my training files such as the text files, box files and .lsmf files inside oro-ground-truth folder.
But, I am having trouble to proceed from there. All the instructions for training from scratch talk about using tesstrain.sh., which the manual calls unsupported and outdated.
- What should I do? Can you guys help me please?
TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=oro TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000
Hi Jephthah,
Creating a starter traineddata:
You need:
1. unicharset: you can prepare it by hand. You can take the English sample and modify it.
2. script: if the language is written in Latin, you can download the latin script from the tesseract GitHub repo (https://github.com/tesseract-ocr/langdata_lstm). If the language uses Cyrillic, you download the respective script.
The following are optional:
3. word: if you want add word list, you can create a word list.
4. number: if you have patterns where numbers appear
5. punc: if you have pattern where punctuations appear.
(a 6th one is the redical stroke file. You can download it from the above repot. But, my experience is that tesseract creates it automatically.)
Assume the name of your language is Jephthah: you are going to organize those files as:
jep.unicharset
jep.word
jep.pun
jep.num
You put these files together in one folder (call it langModel for simplicity). You create other folders such as script and myOutput inside langModel folder . And, then point your terminal to the langModel folder and run combine_lang_model --input_unicharset jep.unicharset --script_dir script --output_dir myOutput --lang jep --words jep.word --puncs jep.punc --numbers jep.number
That will produce a traineddata file: jep.traineddata inside myOutput folder. That is your starter traineddata.