Train Tesseract 5 german for new font

50 views

Skip to first unread message

testcoal

unread,

May 12, 2024, 2:52:57 PMMay 12

to tesseract-ocr

Hi,

I wanted to reach out regarding my recent attempt to train Tesseract 5 for a new font, specifically in German. I followed a tutorial I found on YouTube: https://www.youtube.com/watch?v=KE4xEzFGSU8) and initially had success when training it for English. However, upon transitioning to German, I encountered an error that I'm struggling to resolve.

The issue arises with the file data/deu/Apex.lstm-unicharset, which appears to be missing. In langdata, I've confirmed that the file deu.unicharset exists and is correct; all German characters are present as expected. However, upon further inspection, I noticed discrepancies in the file data/Apex/my.unicharset. Not all characters from the all-gt dataset seem to be included.

I've reviewed the process and ensured that all steps were followed accurately, but I'm still encountering this error.

error_Tesseract5.PNG

Tom Morris

unread,

May 13, 2024, 12:19:11 PMMay 13

to tesseract-ocr

It would be much easier to quote, and comment on, your commands and errors if they were in text format rather than locked away in a picture. It would also make it possible for future users to search for them.

The error message references a different filename for input than what the previous merge command specifies for output, so that's where I'd start my search to debug this.