Ciao,
Thanks for sharing!
I have the same problem with
script / Fraktur.traineddata, which is far better than simple "frk.traineddata, but I found there was in the wordlist and in the unicharset all European accented characters (French,
Italian
and Spanish: âêîôû, æ, œ, àèìòù, áéíóúñ, ¡ ¿ [and relatives CAPS] and other useless characters: € Þ) which are absolutely unknown in old German.
Could it be that for Tesseract, "Fraktur" is not only for German language?
I solved my problem of ">" and "<" by modifying the unicharset file, and replacing in the first column only, these characters by "ck" and "ch" (I also tried to modify the 2 fields after the # ["# ck [63 6b"], but it made no difference).
I tried the same modification on "ô" and "ó" to get "o" but it doesn't work, even with a modified word list where I cancelled all words with these letters.
I also noticed that the word list seems to have absolutely no effect: changing the list (replace "best"-list by "lstm
"-list) doesn't change anything on the result…
Best regards,
Isidore.