Make russian_with_accent traineddata file

64 views
Skip to first unread message

Romain B. (Le Belge)

unread,
Feb 5, 2024, 3:53:28 AM2/5/24
to tesseract-ocr
Hi,

I saw that tesseract make the mistakes of turning russian vowels with accents(ò,à,...)(used for educational purposes most of the time) into other russian letters, and saw that someone, with the same problem, had created trained data(if i understood correctly) for russian with accents

The problem is, i can not find a way to make it a traineddata file, to test it and later use it in my code. I found the tesstrain git, but was not able to make it work with the data found.

I honestly don't know if I am missing something, not understanding correctly something, or if we simply don't train data with these types of files anymore.

If you got any clue, that would help me a lot.

Thank you!

Zdenko Podobny

unread,
Feb 6, 2024, 12:51:01 PM2/6/24
to tesser...@googlegroups.com
You are referring old issue...
You either provide steps to replicate your problem (including input image) or you have to solve it by yourself.

Zdenko


po 5. 2. 2024 o 9:53 Romain B. (Le Belge) <romainbar...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/201355ba-dafd-49fd-b0a7-3b42fd8175d8n%40googlegroups.com.

Romain B. (Le Belge)

unread,
Feb 9, 2024, 6:03:02 AM2/9/24
to tesseract-ocr
Here is all the informations to reproduce my problem:

Here is an image from my russian learning book(french version)
testTes2.png
If you run it with tesseract(while using the russian + french language) with this command: tesseract testTes2.png stdout -l rus+fra
You will get this result:
Capture d’écran du 2024-02-09 11-47-14.png

As you can see, Tesseract (not used to russian having accents on vowels, again only used for educational purposes), interprets ó for б, é for ё,...

I'm trying to fix this issue. By what i have read, i think i need to re-train the russian language in tesseract for it to support accents.
I found this folder in langdata, but can't find a way to use it to re-train the russian language.

How can i use the rus_accent folder and its files to easily re-train the russian language ?

I hope my explanation was clear enough. (Sorry if i made some grammatical or some other english mistakes, english is not my native language).

Tom Morris

unread,
Feb 9, 2024, 3:21:49 PM2/9/24
to tesseract-ocr
Salut Romain,

On Friday, February 9, 2024 at 6:03:02 AM UTC-5 Romain B. (Le Belge) wrote:

I'm trying to fix this issue. By what i have read, i think i need to re-train the russian language in tesseract for it to support accents.
I found this folder in langdata, but can't find a way to use it to re-train the russian language.

How can i use the rus_accent folder and its files to easily re-train the russian language ?

Looking at the history [1] for that folder makes me think that it was an incomplete work-in-progress, but it's also for the previous OCR engine.  You want to look at langdata_lstm/rus [2] for your training text and then using the fine tuning directions [3] with the rus model from tessdata_best/rus.traineddata [4]. This would involve going through and adding accents to some proportion of the vowels and then rerunning the training. For example, there are 10 occurrences of the string балкон and you could change some or all of them to have your accent mark (I don't know if there's a standard convention for encoding them).

As a caveat, I don't know if adding accented variants of all 10 vowels would be considered "a few characters" for the purposes of the finetuning instructions.

Good luck!

Tom

Reply all
Reply to author
Forward
0 new messages