How to start from scratch (new language) in Tesseract 5

1,733 views
Skip to first unread message

Des Bw

unread,
Sep 10, 2023, 1:19:15 PM9/10/23
to tesseract-ocr
I am trying to train a new language. I have prepared the all the necessary files as per the manual. I have also combined them to a trained data file using the combine_lang_model command. 

- I also have my training files such as the text files, box files and .lsmf files inside oro-ground-truth folder. 


But, I am having trouble to proceed from there. All the instructions for training from scratch talk about using tesstrain.sh., which the manual calls unsupported and outdated. 

- What should I do? Can you guys help me please?

Des Bw

unread,
Sep 10, 2023, 2:06:56 PM9/10/23
to tesseract-ocr

I was having a bit of trouble with the directory locations: seems that TESSDATA_PREFIX worked better. 

TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=oro TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000

Ali hussain

unread,
Sep 11, 2023, 8:43:02 AM9/11/23
to tesseract-ocr
follow as I said in the previous conversion https://groups.google.com/g/tesseract-ocr/c/gSzwpxa1oMM and https://groups.google.com/g/tesseract-ocr/c/-G7TZEnVHgE. when you run the triannig command only remove the    MODEL_NAME=oro  and think it should work.

Des Bw

unread,
Sep 12, 2023, 6:55:59 AM9/12/23
to tesseract-ocr
Thank you for the help brother. 

Jephthah Anga

unread,
Nov 16, 2023, 10:39:28 AM11/16/23
to tesseract-ocr
Hi Des,

I am attempting to walk the same path you just walked and was hoping you could provide me with information on where to start. I want to train / create a new language in tesseract that would recognize texts of that language. How do i create the files you mentioned above? Is there a central wiki with all the info i need to get started? What were the biggest challenges you faced and in your opinion is it feasible to attempt to create a new language?

Thank you for your help
Message has been deleted

Des Bw

unread,
Nov 16, 2023, 1:10:52 PM11/16/23
to tesseract-ocr

Hi Jephthah, 


Creating a starter traineddata: 



You need: 

1. unicharset: you can prepare it by hand. You can take the English sample and modify it. 

2. script: if the language is written in Latin, you can download the latin script from the tesseract GitHub repo (https://github.com/tesseract-ocr/langdata_lstm). If the language uses Cyrillic, you download the respective script. 

The following are optional: 


3. word: if you want add word list, you can create a word list. 

4. number: if you have patterns where numbers appear

5. punc: if you have pattern where punctuations appear. 

(a 6th one is the redical stroke file. You can download it from the above repot. But, my experience is that tesseract creates it automatically.) 


Assume the name of your language is Jephthah: you are going to organize those files as: 

jep.unicharset

jep.word

jep.pun

jep.num


You put these files together in one folder (call it langModel for simplicity). You create other folders such as  script and myOutput inside langModel folder . And, then point your terminal to the langModel folder and run combine_lang_model --input_unicharset jep.unicharset --script_dir script --output_dir myOutput --lang jep --words jep.word --puncs jep.punc --numbers jep.number


That will produce a traineddata file: jep.traineddata inside myOutput folder. That is your starter traineddata. 

Des Bw

unread,
Nov 16, 2023, 1:15:18 PM11/16/23
to tesseract-ocr
Once you have the starter model, you can produce training materials such the ground truth sentences. You need at least 100,000 lines of text since you are going to train from scratch. Once you have those lines of texts, you will run the text2image script to produce the tif images and box files which tesseract will use for the training. 

Jephthah Anga

unread,
Nov 17, 2023, 11:27:38 AM11/17/23
to tesseract-ocr
Thank you so much for this detailed response. I have two follow up questions:

The language i am working with is based on Latin with the addition of an extra character (superscript u). With this in mind, could I use the Latin script on the tesseract github repo? Would i have to modify it, or is specifying the characters by hand in the unicharset all i need to do?

Secondly, my training data are all in image files already. These images were taken from handwritten texts submitted by the communities that speaks the Innu-aimun language. Is it necessary to run the text2image script as the data is already in image form. Or would I have to go through the process of converting these images files to text first and then running the text2script script on the resulting lines of text.

Thank you

Des Bw

unread,
Nov 17, 2023, 11:57:53 AM11/17/23
to tesseract-ocr
>The language i am working with is based on Latin with the addition of an extra character (superscript u). With this in mind, could I use the Latin script on the tesseract github repo? Would i have to modify it, or is specifying the characters by hand in the unicharset all i need to do?

Yes, you can use the Latin script in the GitHub repo. You can manually remove the characters that are not available in the Innu-aimun language, and add the characters specific to the language. You can also take some written material from Innu-aimun language; and extract the characters and use them aunicharset. 

Here is a shell script to extract from a given text, if you want to try it:  cat mytext.txt | grep -o . | sort | uniq -c | sort -bnr > character_list_sorted_frequency.txt
You can also use this script to investigate the quality of a text material. If the material contains characters from other language, you need to remove the words or sentences which contain characters from other languages. 

>Secondly, my training data are all in image files already. These images were taken from handwritten texts submitted by the communities that speaks the Innu-aimun language. Is it necessary to run the text2image script as the data is already in image form. Or would I have to go through the process of converting these images files to text first and then running the text2script script on the resulting lines of text.

How about the transcription of the images? If you have the transcription, you can use that as a ground truth text; and then generate the box file from the ground truth and the images. If you don't have the transcription, you need to manually transcribe each image. personally, I find using the actual image to produce a training material very difficult. You need to go through convoluted ways of generating the box files. The process doesn't seem to be part of in the official way of doing things. I think you need to use hocr to generate the box files from the images. 
- Furthermore, hand written texts are very hard to train.  I don't want to discourage you; but it is always good to be ready for what will come. You should expect a lot of hard work, and probably a lot of frustrations because very few people seem to have success with hand written materials.

The official and most straightforward way is to have lines of texts and generate synthetic data (images) from the texts. This strategy will have little success to recognize hand written texts (if that is your ultimate purpose). But, if your purpose is to scan and ocr regular typed books, my suggestion to you is to go  through this route. 

Note that I am just a dabbler as you are; you should not take my advise too seriously. You should keep on experimenting your own ways. 
Reply all
Reply to author
Forward
0 new messages