Train Tesseract with my own Data

120 views
Skip to first unread message

testcoal

unread,
Apr 22, 2024, 2:08:09 PMApr 22
to tesseract-ocr
Hi,
i am trying to train a tesseract model with my own data. This is my code : 
import os

# Pfade konfigurieren
TRAIN_DATA_DIR = "./data1"
TRAIN_LISTFILE = "./trainingsliste.txt"
OUTPUT_DIR = "./output"
TRAINEDDATA = "./tesseract-4.1/tessdata/deu.traineddata"
# Prüfe notwendige Pfade
if not os.path.exists(TRAIN_DATA_DIR) or not os.path.exists(TRAIN_LISTFILE) or not os.path.exists(TRAINEDDATA):
    raise FileNotFoundError("Ein oder mehrere benötigte Verzeichnisse/Dateien fehlen.")

# Ausgabeverzeichnis erstellen, falls nicht vorhanden
if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR)


# Trainingskonfiguration
MAX_ITERATIONS = 200
os.environ['OMP_THREAD_LIMIT'] = '16'

# Trainingsbefehl
command = f'lstmtraining --model_output {OUTPUT_DIR}/font_name --traineddata {TRAINEDDATA} --train_listfile {TRAIN_LISTFILE} --max_iterations {MAX_ITERATIONS}'
result = os.system(command + " > train_output.txt 2>&1")
print("Ausgeführter Befehl:", command)

if result != 0:
    with open('train_output.txt', 'r') as file:
        output = file.read()
    print("Fehler beim Training:", output)
    raise Exception("Fehler beim Starten des Trainingsprozesses.") and this is the error: Must specify an input layer as the first layer, not !!
Failed to create network from spec:

Yaofu Zhou

unread,
May 21, 2024, 5:15:42 AMMay 21
to tesseract-ocr
Hi. You seem to be missing a lot of input. Please take a look at Tesstrain, and particularly its Makefile, so that you know what is involved in the training process. I would go over the official documentation of Tesstrain and run "make help" to see the input needed. One of the items, among many, that you have not specified is the CNN-LSTM network specs, which you can ask GPT/Claude to explain to you.

Furthermore, you can use GPT or Claude to digest the Makefile for you so that you know what binaries are invoked during different steps of the training process. Once you find the binaries involved, you can do something like "lstmtraining --help" for each binary and check for the complete list of options, some of which are not specified in the Tesstrain Makefile.

Once you digest the Makefile of Tesstrain, it will become clear to you that, as messy as it may be, it is just an ugly wrapper to run various Tesseract binaries in sequence, which is similar to what you were trying to achieve. Then, you can (use GPT/Claude to) tailor the Makefile for you and even turn it into an equivalent Python script for easier modifications. This is almost certainly necessary if your training set is very large.
Reply all
Reply to author
Forward
0 new messages