Inuktitut OCR problems - ᐃᓄᑦᑎᑐᑦ (Euphemia typeface)

357 views
Skip to first unread message

Riel Gallant

unread,
Jun 23, 2015, 12:40:49 PM6/23/15
to tesser...@googlegroups.com
Hello everyone. Greetings from Nunavut, Canada.

I'm fairly new to the technical side of OCR and Tesseract in general, so my apologies in advance.

I've been OCRing quite a bit using Adobe Acrobat. It works quite well for English, but offers no support at all for the written language of Inuktitut. The Inuktitut language is native to the north eastern part of Canada and uses a non-Roman orthography script named "syllabic", which was introduced by missionaries in the 1800s and is still used today. Some Cree dialects also use syllabary. Here's a link to the Unified Canadian Aboriginal Syllabics Official Unicode Consortium code chart (PDF) - Wikipedia link.

Since Windows Vista, every Windows OS comes prepackaged with a font named Euphemia, which is a unicode font that supports syllabics. When you activate the Inuktitut keyboard and hit the caps lock, you can type syllabics. Apple also supports Euphemia--a recent app came out with gives users an Inuktitut keyboard. Android does not support it yet. There's also many of pre-Unicode typefaces that look slightly different than Euphemia syllabics, which I realize may be an issue.

I've been able to manually fix OCR errors in Adobe Acrobat under Text Recognition -> Find All Suspects -> changing the font to Euphemia -> manually typing the correct text in the red box (see attached image for instructions). Though this was a step forward, we're looking for a batch production OCR solution. OCRing Inuktitut using Acrobat gives us results like this:



Both Adobe and ABBYY haven't responded to our requests to have Inuktitut added as a language in their text recognition feature.

Is there something we can try with Tesseract? I downloaded it but haven't made much progress. We'd love to be able to search our older scanned PDFs using syllabics and eventually put our historic documents on our website, which would then come up in Google search results. Any help would be greatly appreciated. I've attached a jpg of sample text from the Nunavut Land Claims Agreement (table of contents for Article 26) if anyone needs some content for testing.

ᓇᑯᕐᒦᒃ / Thank you!



https://en.wikipedia.org/wiki/Unified_Canadian_Aboriginal_Syllabics_%28Unicode_block%29

Unified Canadian Aboriginal Syllabics[1]
Official Unicode Consortium code chart (PDF)
  0 1 2 3 4 5 6 7 8 9 A B C D E F
U+140x
U+141x
U+142x
U+143x
U+144x
U+145x
U+146x
U+147x
U+148x
U+149x
U+14Ax
U+14Bx
U+14Cx
U+14Dx
U+14Ex
U+14Fx
U+150x
U+151x
U+152x
U+153x
U+154x
U+155x
U+156x
U+157x
U+158x
U+159x
U+15Ax
U+15Bx
U+15Cx
U+15Dx
U+15Ex
U+15Fx
U+160x
U+161x
U+162x
U+163x
U+164x
U+165x
U+166x
U+167x



manually-fix-OCR-inuktitut-adobe.jpg
test-text-Inuktitut-OCR.JPG
InuktitutOCR-Copy-Paste-Error.JPG

Art Rhyno.

unread,
Jun 23, 2015, 1:51:26 PM6/23/15
to tesser...@googlegroups.com

Hi Riel,

 

I did some volunteer work on Inuktitut OCR for an ongoing project collaboration between OurDigitalWorld.org and the Multicultural History Society of Ontario (MHSO), there is a presentation on that project here [1], but I was focused only on the OCR of the scanned titles in the MHSO collection. One of these is "Inuit Today", an Inuktitut/English publication from the 1970s.

 

The training files I created are on GitHub [2], I have attached the result of using the trained data set to this message but I was relying on the English dataset for numbers so none of the numeric characters are in the sample. Sad to say, I have no facility in the Inuktitut language and I was dealing with one publication and one font, so I was out of my depth for much of this but it might give you a starting point. I would be happy to walk you through the process I went through for the dataset. The ability to add your own fonts is an area where tesseract shines, though it’s sad that the companies you approached didn’t step forward to add it to the commercial options since it is a major language in Canada.

 

art

---

1. http://www.accessola2.com/superconference2014/sessions/329.pdf

2. https://github.com/OurDigitalWorld/odw-font-training

 

From: tesser...@googlegroups.com [mailto:tesser...@googlegroups.com] On Behalf Of Riel Gallant
Sent: Tuesday, June 23, 2015 11:52 AM
To: tesser...@googlegroups.com
Subject: [tesseract-ocr] Inuktitut OCR problems -
ᐃᓄᑦᑎᑐᑦ (Euphemia typeface)

 

Hello everyone. Greetings from Nunavut, Canada.

I'm fairly new to the technical side of OCR and Tesseract in general, so my apologies in advance.

I've been OCRing quite a bit using Adobe Acrobat. It works quite well for English, but offers no support at all for the written language of Inuktitut. The Inuktitut language is native to the north eastern part of Canada and uses a non-Roman orthography script named "syllabic", which was introduced by missionaries in the 1800s and is still used today. Some Cree dialects also use syllabary. Here's a link to the Unified Canadian Aboriginal Syllabics Official Unicode Consortium code chart (PDF) - Wikipedia link.

Since Windows Vista, every Windows OS comes prepackaged with a font named Euphemia, which is a unicode font that supports syllabics. When you activate the Inuktitut keyboard and hit the caps lock, you can type syllabics. Apple also supports Euphemia--a recent app came out with gives users an Inuktitut keyboard. Android does not support it yet. There's also many of pre-Unicode typefaces that look slightly different than Euphemia syllabics, which I realize may be an issue.

I've been able to manually fix OCR errors in Adobe Acrobat under Text Recognition -> Find All Suspects -> changing the font to Euphemia -> manually typing the correct text in the red box (see attached image for instructions). Though this was a step forward, we're looking for a batch production OCR solution. OCRing Inuktitut using Acrobat gives us results like this:

Image removed by sender.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a6bc9b80-6bd6-4451-99d1-6caf925b4207%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

test.txt

Tom Morris

unread,
Jun 24, 2015, 3:08:09 PM6/24/15
to tesser...@googlegroups.com
That's cool that there's already a starting point for the IKU language training.  

To help you understand the various files in Art's repo and the process used to create them, here's the wiki page which describes the training process: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract

Tom

Tom Morris

unread,
Jun 25, 2015, 12:22:32 AM6/25/15
to tesser...@googlegroups.com
In addition to Art's training data, you might also want to test the IKU language data for Tesseract 3.04 that Google released a few hours ago:


It was generated from the source language data here:


and I think this is the script data:


The fact that this is in the standard Google implementation now may also mean that you can (or soon will be able to) get IKU OCR search results for books in Google Books.  That might be worth testing at some point.

Tom

...

Riel G

unread,
Jun 25, 2015, 10:54:55 AM6/25/15
to tesser...@googlegroups.com
Thanks to both of you.

I've made some progress on this front and will update you all shortly. Art helped us with picking a box editor so we're currently correcting some non-Unicode fonts, like ProSyl, OldSyl, etc.

Riel G

unread,
Jul 6, 2015, 2:17:54 PM7/6/15
to tesser...@googlegroups.com
Thanks to Art Rhyno, we've successfully OCRed documents from our collection. He created a few training files for us and they work great. Will post updates in the future. Message us id you have questions.
...
Reply all
Reply to author
Forward
0 new messages