Indigenous Language OCR w/ Tesseract 4.0 in Mac OSX - uncharset_extractor "command not found"

46 views

Skip to first unread message

Josh Holden

unread,

Oct 8, 2020, 3:06:02 PM10/8/20

to tesseract-ocr

Dear All,

I’m looking for advice because I am stuck. I’m training Tesseract to do optical character recognition of texts in Lushootseed, an Indigenous language of Washington State with no living speakers. The language has some special characters and many diacritics, and I do not know what the font is because the texts are (typewriter?) typed or printed from a long time ago.

I finished editing my box files for Lushootseed, but I got stuck on the step in Section 7 of the manual by Isabell Hubert for a previous version, which is "extracting the character set" with unicharset_extractor. I enter this comand and the information for my model into terminal and it says unicharset_extractor "command not found"

Isabell says that this manual may not be entirely applicable to the new 4.0 version, and the best is to ask this group for advice.

From what I see online, it means that I have not installed the character extractor capability or the training tools. But it looks like my computer has downloaded unicharset_extractor.exe, but this is Windows version. I have Mac OSX. I cannot figure out how to install (re)install the training tools, if needed.

Any advice anyone has on this would be much appreciated! I am doing this for a paid contract, not my own research, so I really would like to get past this roadblock.

Thanks so much.

Sincerely,

Dr. Josh Holden

Postdoctoral researcher

ALT Lab, University of Alberta

Reply all

Reply to author

Forward

0 new messages