trainning question

Ali hussain

unread,

Jul 19, 2023, 9:51:43 PM7/19/23

to tesseract-ocr

I'm new in Tesseract and trying to train my own fonts on Tesseract 5.3.2 but I have to know if the electricity is cut off or if I cancel vs code or something like that of the process of training then if I run the training command again so after that it starts from begging or from electricity cut off?

I have already to tested it but every time starts from begging. so I need to know any method to apply this problem to handle this. because it takes a lot of time and is not necessary to start by begging every time or it's normal?

I use this command to create text-to-image.tif files for multiple fonts in Tesseract 5.3.2:

import os
import random
import pathlib
import subprocess

training_text_file = 'langdata/ben.training_text'
font_list = ['FL Badhon Ansari Rh. Unicode',
'F Khairuddin Barbarusa Rah. Uni',
'F Mahfuj Art Unicode Italic',
'F Mahfuj Art Unicode',
'FL Niribili Plain Unicode',
'FL Niribili Plain Unicode Itali Italic'
] # Add more fonts as needed

lines = []

with open(training_text_file, 'r') as input_file:
for line in input_file.readlines():
lines.append(line.strip())

output_directory = 'tesstrain/data/ben-ground-truth'

if not os.path.exists(output_directory):
os.mkdir(output_directory)

random.shuffle(lines)

count = 100

lines = lines[:count]

line_count = 0
for line in lines:
for font in font_list:
training_text_file_name = pathlib.Path(training_text_file).stem
line_training_text = os.path.join(
output_directory, f'{training_text_file_name}_{line_count}.gt.txt')
with open(line_training_text, 'w') as output_file:
output_file.writelines([line])

file_base_name = f'ben_{line_count}'

subprocess.run([
'text2image',
f'--font={font}',
f'--text={line_training_text}',
f'--outputbase={output_directory}/{file_base_name}',
'--max_pages=1',
'--strip_unrenderable_words',
'--leading=32',
'--xsize=3600',
'--ysize=350',
'--char_spacing=1.0',
'--exposure=0',
'--unicharset_file=langdata/ben.unicharset'
])

line_count += 1

and this command is for training :

import subprocess

# List of font names
font_names = ['ben']

for font in font_names:
command = f"TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME={font} START_MODEL=ben TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000 LANG_TYPE=Indic"
subprocess.run(command, shell=True)

Message has been deleted

Ali hussain

unread,

Jul 25, 2023, 4:48:25 AM7/25/23

to tesseract-ocr

import subprocess

# List of font names
font_names = ['ben']

for font in font_names:

    command = f"lstmtraining --continue_from data/ben/checkpoints/ben_19.535_298_300.checkpoint --traineddata data/ben/ben.traineddata --model_output data/ben/checkpoints/ben --train_listfile data/ben/list.train --eval_listfile data/ben/list.eval --max_iterations 1000"
    subprocess.run(command, shell=True)

i fixed the problem and this code work for me by adding the checkpoint.

Ali hussain

unread,

Jul 25, 2023, 5:01:20 AM7/25/23

to tesseract-ocr

make sure the command of the training file will be under tesstrain folder and run the first command for training data and if you train from any checkpoint then run the second post command.

Des Bw

unread,

Sep 4, 2023, 7:34:04 AM9/4/23

to tesseract-ocr

Thank you man. This is very useful.

Des Bw

unread,

Oct 30, 2023, 11:04:23 AM10/30/23

to tesseract-ocr

Hi Ali,

Do you think the starting and stopping at a specific line would also be possible for the actual training, just you have done for the text2image?

Today, I have been very surprised that tesseract always restarting from the beginning every time we interrupted the process.

https://github.com/tesseract-ocr/tesseract/issues/3954

This is very bad; it can definitely degrade the accuracy of the training especially for larger data sets, because the training is quintessentially running only on some lines (latter text lines are ignored).

So, if you have 800,000 text lines; and you run your training step by step:

Round 1: 10,000 iterations

Round 2: 10000, 000 iterations

Round 3: 400,000 iterations

Round 4: 400,000 iterations

Basically, you used only 400,000 text lines. The other 400,000 text lines are not used for training. They are wasted.

So, it would be great if we can have similar python script that could stop and resume the training.

Ali hussain

unread,

Oct 30, 2023, 6:47:22 PM10/30/23

to tesseract-ocr

i have done by selected the checkpoint i think it was worked for me. i have tested more times in small line of text and itarations.
i never use as you said 'stop and resume the training'

you can trying to modify the tranning command for stop and resume training. if you succesed share me also. im also trying.

Reply all

Reply to author

Forward