How to generate .lstmf file with non-randomized lines

16 views

Skip to first unread message

Ben Bongalon

unread,

Jan 5, 2021, 2:44:43 AM1/5/21

to tesseract-ocr

Hello and Happy New Year,

I am training Tesseract 4 to recognize special characters in a Philippine bilingual dictionary (specifically Hanunoo -> English). Following the "Fine Tuning" tutorial but using Spanish as starting model, I am getting good recognition accuracy on some characters such as eng "ŋ" but not in others.

To improve, I plan to experiment with feeding it various combinations of input training data sampled from the source dictionary. However I noticed that when tesseract generates an .lstmf file, it randomly picks lines from the training file. That is, the following command

$ tesseract <my-TIF-file> <lstmf-name> --psm 6 lstm.train

produces a different .lstmf file when called again with the same input TIF file. This makes it harder to tease out if the performance difference is due to the quality of the training data itself, or simply a statistical variation as a result of how tesseract happen to have randomly chosen the data for the .lstmf file.

My questions:

1. How can I force tesseract not to randomize the data when generating an .lstmf file?

2. Is there anything I can do to minimize the effect of the randomed .lstmf data?