Tesseract versus template-based OCR for multilingual text

108 views
Skip to first unread message

Ben Crowell

unread,
Jun 26, 2021, 11:43:15 AM6/26/21
to tesseract-ocr
Back in May, I posted on this group about my attempts to use tesseract to OCR a book consisting of a mixture of English and ancient Greek. The thread was titled "Diagnosing and fixing poor precision on mixed Greek-English text." I fiddled with it off and on for about 5 days, posting about the things I'd tried and getting some helpful suggestions from Merlijn Wajer, who was very generous with his time and effort. In that thread I documented what I tried, but ultimately I was unsuccessful in getting usable results. Although there were various problems, the most serious was that tesseract was not very good at detecting which words were English and which were Greek.

Since I'm highly motivated to OCR this book as part of a retirement/hobby project, I decided to code up my own OCR software using an approach that is completely different from tesseract's, one that I thought might be intrinsically better suited to this type of specialized problem. After about 6 weeks of work, I have some encouraging results from what I would characterize as a pre-alpha version of the software, which is open source: https://github.com/bcrowell/dorcas

Here is a sample comparing the performance of tesseract with my software, dorcas. The page scan for this test, from an old public-domain book by Giles, is a bilingual passage from Homer, describing a banquet. I'm working with the image at fairly high resolution, but I'll post it here for reference at half that resolution to make it not too obnoxious to look at in a browser:

a.png

The Greek portion of the text is in a somewhat weird system of accentuation, in which breathing marks are included but most accents are omitted. In the previous thread, I described some testing I did to see if this was an issue for tesseract, and I think it was at most a minor issue. Here is tesseract's result from this text, using the best training and setup that I was able to put together in May:

==============================================
tesseract output
==============================================
appearance ξεινῳ Mevrn to the stranger Mentes,
nyntopt leader Ταφιων of the Taphians. Evpe
& apa and she found then μνηστηρας aynvopas
the haughty suitors: οὗ μὲν they evera then
ereptrov were delighting θυμον their mind πεσ-
σοις with draughts mporrapowBer i m front θυραων
of the doors, 7 ἡμενοι sitting ev pivowow on skins
Bowv of bulls ovs which avto. ‘themselves
extavov slew. Knpvuxes de but heralds καὶ and
οτρήροι θεράποντες busy attendants αὐτοισιν
upon them, οἱ μὲν apa some indeed εμίσγον
Were mixing owvov wine Kat vdwp and water eve
κρητήρσιν in bowls, οἱ de but others avve again
vigov were wiping τραπεζας the tables σπογγοίσι
πολυτρητοίισι with their porous sponges, cas
and προτιθεντοὸ were putting them forwards,
καὶ and δατευντο were dealing forth πολλα
Kpéa many meats.
==============================================

Here are the results from this very early stage in the development of my own software:

==============================================
dorcas output
==============================================
appearanco ξεινω Μεντη to thoetranger Mentea
ἡγητορι leader Υαφιων of thu Taphianz Εὑρε
δ αρα und shu found then μνηστηρας αγηνορας
tha haughty suitors οἱ μευ they επειτα then
ετερπον were delighting θυμον their mind πεσ
σοις with draughts προπαροιθεν in front θυραων
of thu doors ἡμενοι sitting ευῥινοισιν un skinz
βοων of bulla οὑς whieb αυτοι themselves
εκτανον slew Κηρυκες δε but heralds και and
οτοηροι θεραποντες busy attendants αυτοισιν
upou them οἱ μευ ορα somo indeed εμισγον
wero mixing ρινον wins και ὑδωο und water ενι
κρητηρσιν in bowls ο δε but others αυτε again
νιζον were wiping τοαπεζας thotablezo πογγοισι
πολυτοητοισι with their poronz sponges και
and προτιθεντο wero putting them forwards
και and δατευντο wero dealing forth πολλα
κρεα many meats
==============================================

I'm sure I'm biased, but it seems to me that dorcas's output, while poor at this point, is fairly intelligible in both the English and the Greek portions, while tesseract's is fairly intelligible in the English but totally unintelligible for the Greek. I realize that most people here probably don't know any Greek, but I think it's easy to see that tesseract is simply failing a lot of the time to recognize the Greek as Greek, and instead renders it into meaningless sequences of Latin letters. Dorcas distinguishes the languages from one another with very high reliability. Dorcas's OCR of the Greek is actually a lot more accurate than what it does on the English right now.

Dorcas is certainly not perfect, especially at this early stage. In this book's Latin font, it has a tendency to confuse certain letters, such as s with z and u with a, e, and n. However, I'm optimistic that this can be improved quite a bit with more work. Going into this project, I was pretty naive about the science involved in machine vision and template matching, and I didn't even know many of the relevant search terms.

Since the github page for dorcas describes the technological approach I'm using, I don't think there's much point in recapitulating that here. Anyone who finds it of interest can take a look. In my documentation, I've tried to give a comparison of the intrinsic pros and cons of the two methods. For example, dorcas's method is likely always to be much slower than tesseract's, and it needs training on the font being used in a particular document -- it can't give reasonable results, as tesseract can, on a font that it's never seen before. For this reasons, it will never be a turn-key tool like tesseract, but I'm optimistic that with more development work, I can make it into a usable niche tool for certain people, e.g., a history grad student who has thousands of pages of scans of bilingual text and needs a reasonably accurate way to do searching.

I would be grateful to hear any comments from people who have a better understanding than I do of the science behind OCR. Particularly valuable would be responses like, "Oh have you tried...," "We tried X in tesseract by Y worked better," or "There's a paper by Smith and Zhang on ..."

It could also be fun if anyone wanted to point me to any bilingual documents that they think are particularly interesting and that I could play with. I anticipated that many use cases would be Christian or Jewish religious texts, so I tried to be careful to design the software so that it could handle RTL and would not make assumptions that would be violated by Hebrew. It would be fun to play with an English-Hebrew, Greek-Hebrew, or English-Greek-Hebrew multilingual text, although I don't read Hebrew at all.
Reply all
Reply to author
Forward
0 new messages