Add an character to language data

509 views
Skip to first unread message

koy...@googlemail.com

unread,
Jun 16, 2016, 10:01:49 AM6/16/16
to tesseract-ocr
Hi all,

I'm trying to set up tesseract to scan German documents. So far everything works just fine, except tesseract won't recognize the character "§". This is slightly frustrating, since the documents in question are mostly legal stuff and the "§" is used a lot. It has the meaning of article or section and is not uncommon at all.

I tried to add it as a user-pattern oder user-word without success. I then scanned the files at github in tesseract-ocr/langdata/tree/master/deu and it seems the § is neither in the desired_characters file nor anywhere in the deu.wordlist.

Does that mean, that tesseract does not try to find a § in the documents at all? If so, is there a way to add the character to the language data without completely retraining tesseract? I'm not sure I could do a full training myself.

Thanks
Message has been deleted
Message has been deleted

Quan Nguyen

unread,
Jun 20, 2016, 7:09:54 PM6/20/16
to tesseract-ocr
You can unpack the deu.traineddata file, modify the extracted deu.unicharambigs such that it would always replace the misrecognized characters with § symbol, and then re-combine the component files. Check the Training Wiki for details on the commands.

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract

On the other hand, full training is not that difficult. There are available tools that automated the entire training process.

koy...@googlemail.com

unread,
Jun 23, 2016, 5:56:03 AM6/23/16
to tesseract-ocr
I am afraid your first solution is not an option here, since the § is recognized as a "5" and replacing all the 5s would make it even worse. So it seems I will have to put up with the idea of a full training.

Thank you for answering my post

Nikolai Krot

unread,
Nov 7, 2017, 2:44:02 PM11/7/17
to tesseract-ocr
Hi,

Did you manage to solve the issue with SECTION sign? I am also working with legal domain and this issue bothers me. Best solution so far has been to always use the combination deu+eng. But I really want to learn how to extend tesseract traineddata.

Best regards,
Nikolai
Reply all
Reply to author
Forward
0 new messages