Diagnosing and fixing poor precision on mixed Greek-English text

319 views
Skip to first unread message

Ben Crowell

unread,
May 10, 2021, 1:14:56 AM5/10/21
to tesseract-ocr
I'm trying to OCR a book that is written in interspersed Greek and English:


Here is a sample of text from the first page:

a.jpg

I'm running tesseract 4.1.1 on linux, with the tesseract-ocr-grc package installed. Here's the command I'm using to OCR this sample:

tesseract a.jpg temp -l eng+grc

Here is the result:

1. Evverre declare wot to me, Movca Muse,
ανδρα the man voAvtpotrov of many fortunes,
os who wAayyx@n wandered μαλα πολλα very
much, eves when ewepoev he had destroyed
i d city T { Troy:
lepov troAscOpor the sacred city Tons of Troy :
we Se and saw aorea towns «at and eyvo
learnt vooy the mood roAAwy avOpwror of

Basically it almost never recognizes Greek as Greek, and instead tries to read it as English 95% of the time. Here is what I get if I just tell tesseract to treat it as Greek:

1. ἔννεπε ἀδοίατο μοι ἴο 1π0, ἴἥουσα δίαΞο,
ανδρα {11 τπᾶπι πολύτροπον οἱ ΤΩΔ}Υ ἰοτέιπο5,
ὃς ψ|ὸ πλαγχθὴ παπάθιοα μαλα πολλα νετῦ
τω ποἢ, ἐπὲῦ ΠῸπ ἐπέρσεν ᾿ἰ6 πα ἀεβίτογοά
; ἀο Τ {Ττου:
ἱερον πτολίεθρον [116 ΞΔογβα οἷἵγ Τροιης οἷ ΤτοΥ :
ἰδὲ δε ἀμ 5 αστεῶ ἰο 8 καὶ ἃπά εγνω
Ἰρατηῦ νοὸν {πῸ Ἰηοοὰ πολλων ανθρωπὼν οἵ

This seems odd to me. Although it still makes some errors, such as reading Μουσα as  ἴἥουσα on the first line, it now gets the common word ἱερον (holy) correct, whereas in the original attempt, it rendered it as lepov, which is not a word in either language. If it's capable of correctly interpreting ἱερον, which is presumably in its dictionary, then I don't understand why, when I use eng+grc, it doesn't get it right.

I tried cropping this sample so it was only the single word:

aa.jpg
When I read this using -l eng+grc, it gets it right. So it seems as though it's perfectly capable of both recognizing this word as Greek and properly OCRing it, but somehow it's reluctant to do so when some of the surrounding text is in English.

So in summary, although there are some errors that may have to do with image quality or not being trained on this font, there is also some other kind of problem where tesseract doesn't like to "switch gears" from one language to the other.

Can anyone help with diagnosing and/or fixing this problem?

Could the issue have anything to do with the fact that the Latin letters are upright, while the Greek ones are in a slanted/italic font? Does the neural network have a preference for English because the English corpus it was trained on was so huge compared to the Greek one?

Merlijn B.W. Wajer

unread,
May 10, 2021, 6:39:09 AM5/10/21
to tesser...@googlegroups.com
Hi Ben,
I took the liberty to re-run OCR for that item using the Archive.org
Tesseract stack (and also provide Greek as a language), and this is the
result of the quoted paragraph - it's not perfect, but better than what
you are seeing I think):

> i SOMERS ODYSSEY.
>
>
> BOOK I.
>
>
> 1. Έννεπε declare µοι to me, Movca Muse,
> avdpa the man πολυτροπον of many fortunes,
> os who πλαγχθη wandered pada πολλα very
> much, eves when επερσεν he had destroyed
> ἱερον πτολιεθρον the sacred city Τροιης of Troy :
> we δε and saw αστεα towns και and εγνω
> learnt vooy the mood πολλων ανθρωπων of
> many men, πολλα δε αλγεα but many sorrows
> oye he indeed παθε suffered ὁν κατα θυµον in
> his soul, apyvper'os whilst grasping ἦν τε Wyn?
> both his own life και and νοστον the return erat.
> pov of his companions. Adda but ουδε not even
> ὡς thus ερρυσατο did he save έταρους his com-
> panions, iewevos περ though bent upon it:
> ολοντο yap for they perished σφετερησιν ατασ-
> σθαλιῃσι by their own phrensies, νηπιοι fools,
> οἱ who κατα ησθιον ate up βους the oxen
> Heduovo of the Sun ὝὙπεριονος who rolls above
> Us : autap but ὁ he αφειλετο took away Tors

I wonder if the problem you were seeing was related to using Ancient
Greek (grc) as opposed to Greek (ell)? These are the parameters that
were used just now:

> ocr_parameters -l eng+ell

Hope this helps.

Cheers,
Merlijn

Ben Crowell

unread,
May 10, 2021, 9:09:02 AM5/10/21
to tesseract-ocr
Hi Merlijn,

Thanks very much for your reply. It's encouraging that you were able to get somewhat better results. However, I'm not able to reproduce them. When I use -l eng+ell, the results are still very poor:

1. Evverre declare wot to me, Movca Muse,
avopa the man voAvtpotrov of many fortunes,
ὁς Νο πλαγχθη παπἀρτεάἁ µαλα πολλα very

much, eves when ewepoev he had destroyed
i d city T { Troy:
lepov troAscOpor the sacred city Tons of Troy :
we Se and saw aorea towns «at and eyvo
learnt vooy the mood πολλων ανθρωπων οἳ

The text uses ancient Greek vocabulary and accentuation, so it actually makes sense to use grc, not ell.

I didn't understand what you meant by "using the Archive.org Tesseract stack," but a web search on your name led me to archive-pdf-tools, which you're the author of. It's great to have help from someone who's clearly very expert. I just don't know how to diagnose what is different between your setup and mine. It looks like you did the whole first page rather than the piece I posted, so there may be a difference in how the image was prepared. I just zoomed in on the archive.org page, took a screenshot, cropped it, and changed it to grayscale. I'm running tesseract 4.1.1, which seems to be the latest official release. Are you running a version compiled from the latest source or something? My file /usr/share/tesseract-ocr/4.00/tessdata/grc.traineddata , which came from installing the debian package tesseract-ocr-grc, is dated 2017, which seems old, and is 2.2 Mb. The version at https://github.com/tesseract-ocr/tessdata is 7 Mb and looks like it was changed around 2018. I could try just replacing the file with the newer version, but I have no idea whether that's a reasonable thing to do, since I don't know anything about how the software is designed.

Ben Crowell

unread,
May 10, 2021, 9:29:56 AM5/10/21
to tesseract-ocr
I tried replacing the grc.traineddata file with the newer version, and the software still ran, but the results were identical. From the comments on git, it looks like the newer version is just optimized for speed.

Merlijn B.W. Wajer

unread,
May 10, 2021, 10:34:34 AM5/10/21
to tesser...@googlegroups.com
Hi Ben,

On 10/05/2021 15:09, Ben Crowell wrote:
> Hi Merlijn,
>
> Thanks very much for your reply. It's encouraging that you were able to get
> somewhat better results. However, I'm not able to reproduce them. When I
> use -l eng+ell, the results are still very poor:
>
> 1. Evverre declare wot to me, Movca Muse,
> avopa the man voAvtpotrov of many fortunes,
> ὁς Νο πλαγχθη παπἀρτεάἁ µαλα πολλα very
> much, eves when ewepoev he had destroyed
> i d city T { Troy:
> lepov troAscOpor the sacred city Tons of Troy :
> we Se and saw aorea towns «at and eyvo
> learnt vooy the mood πολλων ανθρωπων οἳ
>
> The text uses ancient Greek vocabulary and accentuation, so it actually
> makes sense to use grc, not ell.

Ah, my bad.

>
> I didn't understand what you meant by "using the Archive.org Tesseract
> stack," but a web search on your name led me to archive-pdf-tools, which
> you're the author of. It's great to have help from someone who's clearly
> very expert. I just don't know how to diagnose what is different between
> your setup and mine. It looks like you did the whole first page rather than
> the piece I posted, so there may be a difference in how the image was
> prepared. I just zoomed in on the archive.org page, took a screenshot,
> cropped it, and changed it to grayscale. I'm running tesseract 4.1.1, which
> seems to be the latest official release. Are you running a version compiled
> from the latest source or something? My
> file /usr/share/tesseract-ocr/4.00/tessdata/grc.traineddata , which came
> from installing the debian package tesseract-ocr-grc, is dated 2017, which
> seems old, and is 2.2 Mb. The version
> at https://github.com/tesseract-ocr/tessdata is 7 Mb and looks like it was
> changed around 2018. I could try just replacing the file with the newer
> version, but I have no idea whether that's a reasonable thing to do, since
> I don't know anything about how the software is designed.

"using the Archive.org Tesseract stack" means that archive.org will
automatically run Tesseract OCR on uploaded content and make those
results available (so you can compare with your local results). Because
this book predates the integration of Tesseract, I submitted the content
for re-OCRing, using Tesseract, in an attempt to reproduce your results.

I'm rerunning the item now with Ancient Greek "grc" as opposed to Greek
"ell".

The version that is being used is Tesseract "5.0.0-alpha-20201231" [1],
the language packs are the latest ones from Git, I believe. Maybe it
would be worth giving the latest version a shot and see if it yields
better results. There is an ubuntu ppa [2] with development
snapshots/versions. Then, if the latest version still results in
unsatisfying results, it would be worth trying to investigate why?


Hope this helps,
Cheers,
Merlijn

[1]
https://github.com/tesseract-ocr/tesseract/releases/tag/5.0.0-alpha-20201231
[2] http://ppa.launchpad.net/alex-p/tesseract-ocr-devel

Ben Crowell

unread,
May 10, 2021, 6:20:47 PM5/10/21
to tesseract-ocr
I compiled tesseract from source, which gave me version 5.0.0-alpha-20210401-102-g4374, and used the latest grc.traineddata file. To get a measure of what's going on, I decided to count the number of Greek words rendered as Greek in the first 7 lines of this text, which contain 22 actual Greek words.

tesseract 4.1.1, eng+grc -- 14% correct

tesseract 5.0.0 on my machine, eng+grc -- 41% correct

tesseract 5.0.0 on my machine, eng+ell -- 68% correct

tesseract 5.0.0 on archive.org -- 55% correct

Several things are similar in your results and mine. The incorrect scanning of ἱερον when surrounded by English words no longer seems to occur in 5.0.0. The word μοι is usually rendered incorrectly, but this may be because there seems to be broken type that causes the descender on the mu to be omitted. Μουσα is read incorrectly as Movca, which is probably because this personification of the Muse isn't in the dictionary.

One thing that I hadn't noticed previously is that the accentuation in this text is weird. Although the 18th-century typesetter included the breathing marks, which aren't used in modern Greek, they left out all the acute, grave, and circumflex accents, which would usually have been included in a modern typesetting of an ancient Greek text. So it may be that the dictionary for grc is more appropriate, but the character recognition for ell is better here. I think this can be tested by typesetting the same 7 lines with and without accents.

Ben Crowell

unread,
May 10, 2021, 7:42:12 PM5/10/21
to tesseract-ocr
Here is a version of the text that I typeset using xelatex, with the font DejaVu Serif. It has all the accents, which should make it a good typographical match to the data that tesseract was trained on to make the grc file.
tex_output.png
Here is the result:

Ἔννεπε declare pot to me, Movoa Muse,

ἄνδρα the man πολύτροπον of many fortunes,
oc who πλάγχθη wandered μάλα πολλὰ very
much, ἐπεὶ when émepoe he had destroyed
ἱερὸν πτολίεθρον the sacred city Τροίης of Troy:
ἴδε δε and saw ἄστεα towns Kai and ἔγνω
learnt voov the mood πολλῶν ἀνθρώπων of

Now 73% of Greek words are recognized as Greek. So this is quite a bit better, but still fairly poor. It seems really odd to me that tesseract is not getting the moon words μοι, ὃς, and καὶ. For comparison, it would be as if tesseract was OCRing an English text and not being able to read "me," "who," and "and."

Ben Crowell

unread,
May 12, 2021, 8:34:12 PM5/12/21
to tesseract-ocr
I made some efforts to improve the performance of tesseract on this text. I made an English dictionary consisting only of words used in a collection of 7 English translations of Homer, so that the dictionary includes words like Acheloüs but doesn't include words like the (German? Dutch?) name Kai, which was being used as a reading for the common Greek word και. I made a Greek dictionary consisting only of the words that actually occur in the Odyssey, with all acute, grave, and circumflex accents removed as in the text I'm trying to scan. So for instance, this custom dictionary contains πολλων, the form in the text, not πολλῶν. I also trained tesseract a little bit on the Greek font used in this book, although I don't know if the amount of text I provided was enough.

After this specialized fine-tuning, the accuracy is still not at all acceptable. The result on the above passage looks like this:

1. Εννεπε declare wot to me, Μουσα Muse,
ανδρα the man πολυτροποιν of many fortunes,
os who πλαγχθη wandered μαλα πολλα very
much, επεν when ezepoev he had destroyed
i d city T { Troy:
ἱερον πτολιεθρον the sacred city Τροιης of Troy :
we δε and saw αστεα towns «at and eyvo
learnt vooy the mood πολλων ανιθρωπων of

Only 68% of Greek words are correctly recognized as Greek, and even of those, some are misread. Extremely common words like μοι,  ὁς, and και are not recognized, although they are mostly recognized when I OCR the text with the language set only to Greek. So as far as I can tell, tesseract just can't really do this kind of bilingual text with a non-Latin font. Of course, there could be something I'm not understanding that would improve things.

From descriptions I've read, it seems that tesseract's neural network is designed to try to scan large blocks of text at once, not just individual words. I suspect that this makes it unwilling to read Greek as Greek when it's surrounded by English. This would help to explain why it reads ὁς correctly when in Greek-only mode, but when in English+Greek mode, it reads it as os, which isn't even a word in the English dictionary I'm using.

Training it on the book's Greek font may have done as much harm as good. It gets words like Μουσα right, which it got wrong before, but it makes errors on words like πολυτροπον and ανθρωπων, spelling them as πολυτροποιν and ανιθρωπων.

Merlijn B.W. Wajer

unread,
May 14, 2021, 8:47:39 AM5/14/21
to tesser...@googlegroups.com
Hi Ben,

On 13/05/2021 02:34, Ben Crowell wrote:
>
> Only 68% of Greek words are correctly recognized as Greek, and even of
> those, some are misread. Extremely common words like μοι, ὁς, and και are
> not recognized, although they are mostly recognized when I OCR the text
> with the language set only to Greek. So as far as I can tell, tesseract
> just can't really do this kind of bilingual text with a non-Latin font. Of
> course, there could be something I'm not understanding that would improve
> things.
>
> From descriptions I've read, it seems that tesseract's neural network is
> designed to try to scan large blocks of text at once, not just individual
> words. I suspect that this makes it unwilling to read Greek as Greek when
> it's surrounded by English. This would help to explain why it reads ὁς
> correctly when in Greek-only mode, but when in English+Greek mode, it reads
> it as os, which isn't even a word in the English dictionary I'm using.
>
> Training it on the book's Greek font may have done as much harm as good. It
> gets words like Μουσα right, which it got wrong before, but it makes errors
> on words like πολυτροπον and ανθρωπων, spelling them as πολυτροποιν and
> ανιθρωπων.

One other venue you could perhaps explore is to OCR the text in each
language separately, and somehow pick the words with the highest
confidence per word. I haven't tried this and do not know how feasible
it is.

Also - I am not sure if it helps, but you might want to consider filing
a bug report on Github: https://github.com/tesseract-ocr/tesseract/issues

Cheers,
Merlijn
Reply all
Reply to author
Forward
0 new messages