Oh boy....
Well, there are some steps to do (again, a made this looking on google, if someone knows a better way, please let me know). I'll enumerate them with a short description, if you need some more details, we can talk later.
- Prepare the training set: you'll need some examples to work with. The more, the merrier. After that, you need to standardize the training set. I found better results with 300 dpi images, in TIFF format.
- Process the training set: one of the mistakes I made was applying some filters to the images and not applying the same filter on the training set. If you use some processing or filter (I used binarization and noise removal), you need to apply that to the training set as well.
- Create the truth files: the training will be on the result of these truth files. In early versions of tesseract, you have to cut the images and provide some text files. It's easier now, you can create .box files of your images, using the tesseract. The command is tesseract <image>.tiff <output_name> -l <language> wordstrbox
- Change the .box files: with the truth files (these .box), correct them. These files will be the base for the fine tuning. If the output was an "a" and it must be a "s", change it in these files.
- Create the training files: after correcting every box file you have for the training set, create the training files. The command is tesseract <image>.tiff <output_name> lstm.train
- Generate the training base file: no mystery here, the training requires a file with the path for ALL lstmf files created in the previous step. In linux, you could achieve this with the command ls -1 *.lstmf > all_lstmf.txt
- Tuning: now comes the real training. The command is:
lstmtraining \
--model_output <path_output> \
--continue_from <path_language_lstm> \
--trainineddata <path_traineddata> \
--train_listfile <path_all_lstmf.txt> \
--max_iterations <max_iterations>
Some considerations in the command above: you'll need the lstm file from the language you are fine tuning. You can get it from the github of the tesseract (ALWAYS USE THE BEST FOLDER). You need the traineddata of this language too. Again, use the BEST.
After the training finishes, create the traineddata for the new fine tuned language:
lstmtraining \
--stop_training \
--continue_from <path_output>_checkpoint \
--traineddata <path_traineddata> \
--model_output <path_output_new_language>.traineddata
With these steps, you'll have a new .traineddata file. Put it on your tessdata directory and you're ready to go.
I could've missed something, I doing this by heart, but I'm almost sure that's all I did.
Hope can help.
Best regards.