Yoruba OCR

Victor Williamson

unread,

Dec 3, 2014, 6:52:04 PM12/3/14

to tesser...@googlegroups.com

I am working on Yoruba OCR using Tesseract 3.02. After following the steps on the wiki and referring to Cedric and all the training goes through, running Tessecrat coverts my images with Yoruba text to all dashes (-) proportional to the size of the text in the image. This happens even for the image I trained on. I used a very small sample of Yoruba text, and I realize I may not meet the minimum per character requirement because during mftraining I get a bunch of

Warning: no protos/configs for ò in CreateIntTemplates()
Warning: no protos/configs for w in CreateIntTemplates()
Warning: no protos/configs for ú in CreateIntTemplates()
Warning: no protos/configs for à in CreateIntTemplates()
...

Is there a way to build off the existing English training data? i.e. I want to extend the existing English training data because Yoruba uses most of the English characters plus 3 dozen additional special non-English characters. The existing English characters should always be recognized. I wanted to start with a small training image so that I could finish with minimal effort, run simple tests, and expand later.

I've tried both manual commands and using training within JTessBoxEditor.with the same end result. It would be nice to at least some characters output.

ShreeDevi Kumar

unread,

Dec 4, 2014, 3:55:01 AM12/4/14

to tesser...@googlegroups.com

Try to use training text from the following and see if it helps -

https://code.google.com/r/shreeshrii-langdata/source/browse?name=asc

https://code.google.com/r/shreeshrii-langdata/source/browse?name=iast

https://code.google.com/r/shreeshrii-tessdata/source/browse?name=iast

You can use eng+your_language_code to recognize english + your language text.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e23b7124-2df2-44a1-ab0d-5fdea104177e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Quan Nguyen

unread,

Dec 4, 2014, 8:00:16 PM12/4/14

to tesser...@googlegroups.com

Check out the training sample files bundled with jTessBoxEditor located under samples\vie folder. It seems Vietnamese alphabet share some common characters as Yoruba. You certainly adapt it to your language.

Victor Williamson

unread,

Aug 30, 2015, 10:00:41 AM8/30/15

to tesseract-ocr

The links you gave me are great. I created the tiff/box pair on a mac as follows:

raining/text2image --text=yor.training_text --outputbase=yor.VerdanaMedium.exp0 --font='Verdana Medium' --fonts_dir=/Library/Fonts

Then I ran training as follows:

tesseract yor.VerdanaMedium.exp0.tif yor.VerdanaMedium.exp0 box.train.stderr

The only problem is that after creating the tiff/box pairs, the training throws failures as follows

APPLY_BOXES: boxfile line 2087/ ((2121,1882),(2131,1921)): FAILURE! Couldn't find a matching blob

FAIL!

APPLY_BOXES: boxfile line 2135/ ((2112,1810),(2122,1848)): FAILURE! Couldn't find a matching blob

FAIL!

...

APPLY_BOXES:

Boxes read from boxfile: 2265

Boxes failed resegmentation: 124

Found 2141 good blobs.

Leaving 3 unlabelled blobs in 0 words.

Generated training data for 986 words

Warning in pixReadMemTiff: tiff page 5 not found

I tried using the asc.training_text example directly too, i.e. without my changes, but still these errors are happening. I've Googled, but unclear of what the solution is.

Reply all

Reply to author

Forward