ᓇᑯᕐᒦᒃ / Thank you!
https://en.wikipedia.org/wiki/Unified_Canadian_Aboriginal_Syllabics_%28Unicode_block%29
| Unified Canadian Aboriginal Syllabics[1] Official Unicode Consortium code chart (PDF) |
||||||||||||||||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F | |
| U+140x | ᐀ | ᐁ | ᐂ | ᐃ | ᐄ | ᐅ | ᐆ | ᐇ | ᐈ | ᐉ | ᐊ | ᐋ | ᐌ | ᐍ | ᐎ | ᐏ |
| U+141x | ᐐ | ᐑ | ᐒ | ᐓ | ᐔ | ᐕ | ᐖ | ᐗ | ᐘ | ᐙ | ᐚ | ᐛ | ᐜ | ᐝ | ᐞ | ᐟ |
| U+142x | ᐠ | ᐡ | ᐢ | ᐣ | ᐤ | ᐥ | ᐦ | ᐧ | ᐨ | ᐩ | ᐪ | ᐫ | ᐬ | ᐭ | ᐮ | ᐯ |
| U+143x | ᐰ | ᐱ | ᐲ | ᐳ | ᐴ | ᐵ | ᐶ | ᐷ | ᐸ | ᐹ | ᐺ | ᐻ | ᐼ | ᐽ | ᐾ | ᐿ |
| U+144x | ᑀ | ᑁ | ᑂ | ᑃ | ᑄ | ᑅ | ᑆ | ᑇ | ᑈ | ᑉ | ᑊ | ᑋ | ᑌ | ᑍ | ᑎ | ᑏ |
| U+145x | ᑐ | ᑑ | ᑒ | ᑓ | ᑔ | ᑕ | ᑖ | ᑗ | ᑘ | ᑙ | ᑚ | ᑛ | ᑜ | ᑝ | ᑞ | ᑟ |
| U+146x | ᑠ | ᑡ | ᑢ | ᑣ | ᑤ | ᑥ | ᑦ | ᑧ | ᑨ | ᑩ | ᑪ | ᑫ | ᑬ | ᑭ | ᑮ | ᑯ |
| U+147x | ᑰ | ᑱ | ᑲ | ᑳ | ᑴ | ᑵ | ᑶ | ᑷ | ᑸ | ᑹ | ᑺ | ᑻ | ᑼ | ᑽ | ᑾ | ᑿ |
| U+148x | ᒀ | ᒁ | ᒂ | ᒃ | ᒄ | ᒅ | ᒆ | ᒇ | ᒈ | ᒉ | ᒊ | ᒋ | ᒌ | ᒍ | ᒎ | ᒏ |
| U+149x | ᒐ | ᒑ | ᒒ | ᒓ | ᒔ | ᒕ | ᒖ | ᒗ | ᒘ | ᒙ | ᒚ | ᒛ | ᒜ | ᒝ | ᒞ | ᒟ |
| U+14Ax | ᒠ | ᒡ | ᒢ | ᒣ | ᒤ | ᒥ | ᒦ | ᒧ | ᒨ | ᒩ | ᒪ | ᒫ | ᒬ | ᒭ | ᒮ | ᒯ |
| U+14Bx | ᒰ | ᒱ | ᒲ | ᒳ | ᒴ | ᒵ | ᒶ | ᒷ | ᒸ | ᒹ | ᒺ | ᒻ | ᒼ | ᒽ | ᒾ | ᒿ |
| U+14Cx | ᓀ | ᓁ | ᓂ | ᓃ | ᓄ | ᓅ | ᓆ | ᓇ | ᓈ | ᓉ | ᓊ | ᓋ | ᓌ | ᓍ | ᓎ | ᓏ |
| U+14Dx | ᓐ | ᓑ | ᓒ | ᓓ | ᓔ | ᓕ | ᓖ | ᓗ | ᓘ | ᓙ | ᓚ | ᓛ | ᓜ | ᓝ | ᓞ | ᓟ |
| U+14Ex | ᓠ | ᓡ | ᓢ | ᓣ | ᓤ | ᓥ | ᓦ | ᓧ | ᓨ | ᓩ | ᓪ | ᓫ | ᓬ | ᓭ | ᓮ | ᓯ |
| U+14Fx | ᓰ | ᓱ | ᓲ | ᓳ | ᓴ | ᓵ | ᓶ | ᓷ | ᓸ | ᓹ | ᓺ | ᓻ | ᓼ | ᓽ | ᓾ | ᓿ |
| U+150x | ᔀ | ᔁ | ᔂ | ᔃ | ᔄ | ᔅ | ᔆ | ᔇ | ᔈ | ᔉ | ᔊ | ᔋ | ᔌ | ᔍ | ᔎ | ᔏ |
| U+151x | ᔐ | ᔑ | ᔒ | ᔓ | ᔔ | ᔕ | ᔖ | ᔗ | ᔘ | ᔙ | ᔚ | ᔛ | ᔜ | ᔝ | ᔞ | ᔟ |
| U+152x | ᔠ | ᔡ | ᔢ | ᔣ | ᔤ | ᔥ | ᔦ | ᔧ | ᔨ | ᔩ | ᔪ | ᔫ | ᔬ | ᔭ | ᔮ | ᔯ |
| U+153x | ᔰ | ᔱ | ᔲ | ᔳ | ᔴ | ᔵ | ᔶ | ᔷ | ᔸ | ᔹ | ᔺ | ᔻ | ᔼ | ᔽ | ᔾ | ᔿ |
| U+154x | ᕀ | ᕁ | ᕂ | ᕃ | ᕄ | ᕅ | ᕆ | ᕇ | ᕈ | ᕉ | ᕊ | ᕋ | ᕌ | ᕍ | ᕎ | ᕏ |
| U+155x | ᕐ | ᕑ | ᕒ | ᕓ | ᕔ | ᕕ | ᕖ | ᕗ | ᕘ | ᕙ | ᕚ | ᕛ | ᕜ | ᕝ | ᕞ | ᕟ |
| U+156x | ᕠ | ᕡ | ᕢ | ᕣ | ᕤ | ᕥ | ᕦ | ᕧ | ᕨ | ᕩ | ᕪ | ᕫ | ᕬ | ᕭ | ᕮ | ᕯ |
| U+157x | ᕰ | ᕱ | ᕲ | ᕳ | ᕴ | ᕵ | ᕶ | ᕷ | ᕸ | ᕹ | ᕺ | ᕻ | ᕼ | ᕽ | ᕾ | ᕿ |
| U+158x | ᖀ | ᖁ | ᖂ | ᖃ | ᖄ | ᖅ | ᖆ | ᖇ | ᖈ | ᖉ | ᖊ | ᖋ | ᖌ | ᖍ | ᖎ | ᖏ |
| U+159x | ᖐ | ᖑ | ᖒ | ᖓ | ᖔ | ᖕ | ᖖ | ᖗ | ᖘ | ᖙ | ᖚ | ᖛ | ᖜ | ᖝ | ᖞ | ᖟ |
| U+15Ax | ᖠ | ᖡ | ᖢ | ᖣ | ᖤ | ᖥ | ᖦ | ᖧ | ᖨ | ᖩ | ᖪ | ᖫ | ᖬ | ᖭ | ᖮ | ᖯ |
| U+15Bx | ᖰ | ᖱ | ᖲ | ᖳ | ᖴ | ᖵ | ᖶ | ᖷ | ᖸ | ᖹ | ᖺ | ᖻ | ᖼ | ᖽ | ᖾ | ᖿ |
| U+15Cx | ᗀ | ᗁ | ᗂ | ᗃ | ᗄ | ᗅ | ᗆ | ᗇ | ᗈ | ᗉ | ᗊ | ᗋ | ᗌ | ᗍ | ᗎ | ᗏ |
| U+15Dx | ᗐ | ᗑ | ᗒ | ᗓ | ᗔ | ᗕ | ᗖ | ᗗ | ᗘ | ᗙ | ᗚ | ᗛ | ᗜ | ᗝ | ᗞ | ᗟ |
| U+15Ex | ᗠ | ᗡ | ᗢ | ᗣ | ᗤ | ᗥ | ᗦ | ᗧ | ᗨ | ᗩ | ᗪ | ᗫ | ᗬ | ᗭ | ᗮ | ᗯ |
| U+15Fx | ᗰ | ᗱ | ᗲ | ᗳ | ᗴ | ᗵ | ᗶ | ᗷ | ᗸ | ᗹ | ᗺ | ᗻ | ᗼ | ᗽ | ᗾ | ᗿ |
| U+160x | ᘀ | ᘁ | ᘂ | ᘃ | ᘄ | ᘅ | ᘆ | ᘇ | ᘈ | ᘉ | ᘊ | ᘋ | ᘌ | ᘍ | ᘎ | ᘏ |
| U+161x | ᘐ | ᘑ | ᘒ | ᘓ | ᘔ | ᘕ | ᘖ | ᘗ | ᘘ | ᘙ | ᘚ | ᘛ | ᘜ | ᘝ | ᘞ | ᘟ |
| U+162x | ᘠ | ᘡ | ᘢ | ᘣ | ᘤ | ᘥ | ᘦ | ᘧ | ᘨ | ᘩ | ᘪ | ᘫ | ᘬ | ᘭ | ᘮ | ᘯ |
| U+163x | ᘰ | ᘱ | ᘲ | ᘳ | ᘴ | ᘵ | ᘶ | ᘷ | ᘸ | ᘹ | ᘺ | ᘻ | ᘼ | ᘽ | ᘾ | ᘿ |
| U+164x | ᙀ | ᙁ | ᙂ | ᙃ | ᙄ | ᙅ | ᙆ | ᙇ | ᙈ | ᙉ | ᙊ | ᙋ | ᙌ | ᙍ | ᙎ | ᙏ |
| U+165x | ᙐ | ᙑ | ᙒ | ᙓ | ᙔ | ᙕ | ᙖ | ᙗ | ᙘ | ᙙ | ᙚ | ᙛ | ᙜ | ᙝ | ᙞ | ᙟ |
| U+166x | ᙠ | ᙡ | ᙢ | ᙣ | ᙤ | ᙥ | ᙦ | ᙧ | ᙨ | ᙩ | ᙪ | ᙫ | ᙬ | ᙭ | ᙮ | ᙯ |
| U+167x | ᙰ | ᙱ | ᙲ | ᙳ | ᙴ | ᙵ | ᙶ | ᙷ | ᙸ | ᙹ | ᙺ | ᙻ | ᙼ | ᙽ | ᙾ | ᙿ |
Hi Riel,
I did some volunteer work on Inuktitut OCR for an ongoing project collaboration between OurDigitalWorld.org and the Multicultural History Society of Ontario (MHSO), there is a presentation on that project here [1], but I was focused only on the OCR of the scanned titles in the MHSO collection. One of these is "Inuit Today", an Inuktitut/English publication from the 1970s.
The training files I created are on GitHub [2], I have attached the result of using the trained data set to this message but I was relying on the English dataset for numbers so none of the numeric characters are in the sample. Sad to say, I have no facility in the Inuktitut language and I was dealing with one publication and one font, so I was out of my depth for much of this but it might give you a starting point. I would be happy to walk you through the process I went through for the dataset. The ability to add your own fonts is an area where tesseract shines, though it’s sad that the companies you approached didn’t step forward to add it to the commercial options since it is a major language in Canada.
art
---
1. http://www.accessola2.com/superconference2014/sessions/329.pdf
From: tesser...@googlegroups.com [mailto:tesser...@googlegroups.com]
On Behalf Of Riel Gallant
Sent: Tuesday, June 23, 2015 11:52 AM
To: tesser...@googlegroups.com
Subject: [tesseract-ocr] Inuktitut OCR problems - ᐃᓄᑦᑎᑐᑦ (Euphemia typeface)
Hello everyone. Greetings from Nunavut, Canada.
I'm fairly new to the technical side of OCR and Tesseract in general, so my apologies in advance.
I've been OCRing quite a bit using Adobe Acrobat. It works quite well for English, but offers no support at all for the written language of
Inuktitut. The Inuktitut language is native to the north eastern part of Canada and uses a non-Roman orthography script named "syllabic",
which was introduced by missionaries in the 1800s and is still used today. Some Cree dialects also use syllabary. Here's a link to the
Unified Canadian Aboriginal Syllabics
Official Unicode Consortium code chart (PDF) -
Wikipedia link.
Since Windows Vista, every Windows OS comes prepackaged with a font named
Euphemia, which is a unicode font that supports syllabics. When you activate the Inuktitut keyboard and hit the caps lock, you can type syllabics. Apple also supports Euphemia--a
recent app came out with gives users an Inuktitut keyboard. Android does not support it yet. There's also
many of pre-Unicode typefaces that look slightly different than Euphemia syllabics, which I realize may be an issue.
I've been able to manually fix OCR errors in Adobe Acrobat under Text Recognition -> Find All Suspects -> changing the font to Euphemia -> manually typing the correct text in the red box (see attached image for instructions). Though this was a step forward,
we're looking for a batch production OCR solution. OCRing Inuktitut using Acrobat gives us results like this:

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
tesseract-oc...@googlegroups.com.
To post to this group, send email to
tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/a6bc9b80-6bd6-4451-99d1-6caf925b4207%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
...
...