use of unicharambigs

75 views
Skip to first unread message

Isidore Paris

unread,
Mar 13, 2023, 12:13:33 PM3/13/23
to tesseract-ocr
Hi,
I'm doing some frk text recognition, and in my results, I have a great number of " > ". Each one should be replaced by " ck ".
I updated my frk.traineddata file (from tessdata_best repository) with a frk.unicharambigs file (I tried both formats v1 and v2) but absolutely nothing changed.
I also tried the parameter " -c use_ambigs_for_adaption=1 " to see if maybe it was needed, but still nothing changed, not a single character (> and = and / are all still there).

Here is the content of my v2 frk.unicharambigs file:
v2
> ck 1
= - 1
/ - 1

Does unicharambigs not work with LSTM files? Or did I miss some particular or special step?

Andrea Rossato

unread,
Mar 20, 2023, 2:53:01 PM3/20/23
to tesseract-ocr
Hi,

no, unicharambigs is not used by LSTM files. It was used in the legacy mode.

I'm having similar problems with the ancient greek best traineddata: unfortunately it has been trained with some non standard characters (ά έ ή ί ό ύ ώ, instead of  ά έ ή ί ό ύ ώ). I tried fine tuning the grc.traineddata, but without very much success, so, for the time being, I'm producing hocr files, post-process them and then use hocr-pdf to create a searchable pdf.


best,
andrea

Isidore Paris

unread,
Mar 26, 2023, 5:35:06 PM3/26/23
to tesseract-ocr
Ciao,

Thanks for sharing!
I have the same problem with script / Fraktur.traineddata, which is far better than simple "frk.traineddata, but I found there was in the wordlist and in the unicharset all European accented characters (French, Italian and Spanish: âêîôû, æ, œ, àèìòù, áéíóúñ, ¡ ¿ [and relatives CAPS] and other  useless characters: € Þ) which are absolutely unknown in old German.
Could it be that for Tesseract, "Fraktur" is not only for German language?

I solved my problem of ">" and "<" by modifying the unicharset file, and replacing in the first column only, these characters by "ck" and "ch" (I also tried to modify the 2 fields after the # ["# ck [63 6b"], but it made no difference).
I tried the same modification on "ô" and "ó" to get "o" but it doesn't work, even with a modified word list where I cancelled all words with these letters.

I also noticed that the word list seems to have absolutely no effect: changing the list (replace "best"-list by "lstm "-list) doesn't change anything on the result…

Best regards,
Isidore.
Reply all
Reply to author
Forward
Message has been deleted
0 new messages