Easily readable Russian not recognized in language app screenshot

86 views
Skip to first unread message

d-ka

unread,
Oct 7, 2020, 8:31:26 AM10/7/20
to tesseract-ocr

I’d like to process Duolingo screenshots with Tesseract, in order to have exercises worth reiterating in a searchable form (i.e. a text file). However, it just yields gibberish:

> tesseract.exe img.jpg img.jpg -l rus+eng --tessdata-dir "\tessdata"

FXjEk.png

Э 20:22
51МАВО\М/
Тгапз(а{е {15 5еп{епсе
Апу диес00п5
Уоч аге согтес& |"
СОМТИМЧЕ
Ч 4

  • For my inherent neural network, it’s easy to resolve: clear contrasts, easy font, no scanning artifacts.
  • It doesn’t read the actual Russian part at all (Вопросы есть?), yet I don’t find the font weight too light or thin.
  • No luck with greyscale or increased contrast, or by varations of rus+eng.
  • I assume that it’s implicitly UTF-8 and that I already have appropriate trained data.
  • What could help Tesseract to properly parse this seemingly easy imagery?
Thanks so much!

Shree Devi Kumar

unread,
Oct 8, 2020, 1:08:28 AM10/8/20
to tesseract-ocr
Give each region of interest separately.

Virus-free. www.avg.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4978d94a-ec7d-4bce-b8be-cd58576d4ab2n%40googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

d-ka

unread,
Nov 2, 2020, 11:45:41 AM11/2/20
to tesseract-ocr
Well, that’d require much additional logic because the general layout entails quite a diverse segmentation.

The main question is, why Tesseract obviously has severe trouble with clear Russian, no-noise PNGs—and what could be done about it.

d-ka

unread,
Jan 7, 2021, 3:14:40 PM1/7/21
to tesseract-ocr
I still fail to understand why Tesseract performs so poorly. Isn’t it made for OCR in screenshots? Doesn’t it understand Russian at all?

Shree Devi Kumar

unread,
Jan 7, 2021, 11:21:44 PM1/7/21
to tesseract-ocr
(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract rus.png - -l rus+eng --tessdata-dir ~/tessdata_best


D 20:22 Э
5IN AROW 5IN AROW 5IN AROW
Translate this sentence Translate this sentence Translate this sentence
(0) Вопросы есть? (0) Вопросы есть? Вопросы есть?
Апу questions Any questions Any questions
15 15 IS
You are correct |" You are correct |" You are correct |"

CONTINUE [elo] Nay 1]V] CONTINUE

A A 4
(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract rus.png - -l rus+eng --tessdata-dir ~/tessdata_fast


О 20:22 9)
5 1МАВОМ/ 5INAROW 5INAROW
Translate this sentence Translate this sentence Translate this sentence
о Вопросы есть? о Вопросы есть? Вопросы есть?
Апу questions Any questions Any questions
15 15 15
You are correct [мы You are correct [мы You are correct [мы

CONTINUE CONTINUE CONTINUE

” a 7 в
(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract rus.png - -l rus+eng --tessdata-dir ~/tessdata


D 20:22 °)
5IN AROW 5IN AROW 5IN AROW
Translate this sentence Translate this sentence Translate this sentence
Ф Вопросы есть? Ф Вопросы есть? Вопросы есть?
Апу questions Any questions Any questions
15 15 IS
You are correct |- You are correct |- You аге correct |-

CONTINUE СОМПИМЧЕ СОМТПИЧЧЕ

о ) 4

Reply all
Reply to author
Forward
0 new messages