Preparing training data for new language

109 views

Skip to first unread message

Ruwanka De Silva

unread,

Mar 15, 2015, 10:45:37 AM3/15/15

to tesser...@googlegroups.com

Hi All,

I am trying to train tesseract for Sinhalese language, for recognize text in old Sinhalese newspapers. I am new for tesseract and I have few questions about how to prepare training data for best results. So these are my questions,

1. What is the best resolution (dpi) for training data?

2. I supposed to do binarization and some enhancements as a preprocessing before doing ocr, so will teseract give best results if I train it for preprocessed images or will it give best results if I train it for raw images (attached herewith)?

3. I don't have font related with these images so I couldn't create training data myself, so are there any solution for creating training data other than using scanned images of newspapers?

4. Sinahales has huge character set which include different diacritics for modify the phonetic sound/meaning of a letter so what are the steps do I have to take in order to increase accuracy?

Any help would be appreciated.

Regards,

Ruwanka De Silva

sin.lankadeepa76.jpg

ShreeDevi Kumar

unread,

Mar 15, 2015, 10:59:06 AM3/15/15

to tesser...@googlegroups.com

Please see

http://www.ucsc.cmb.ac.lk/sdu/research.html

http://192.248.22.122/ocrsinhala/upload.php

Here is the output from it:

ටුද්‍රණි:ල .ය්චත වැට වරීජන:: ඵාෂ්. ඨ:ර්චූකට පවන්චි:යගැ න ::න චූට කූ- එ0 දූකූ:ගයගැ

0පි පිශ්‍රීබඳව රජය:ෘන් ඉදීරිෂන් කූයරන ය:ට,රණ් ච්ඝ දූ0කට 9දාද්‍රඩා භ:තපිජං .ාරීග

ාඝන් ප්‍රශ",නය පිඝඳ: ග::චූටිට ාද්‍රංයහාර:ක්ත: වන ඛචද්‍ර තීාඝ.

ථි 9තර ඉත.න් :0ද: ::ංළක් :: ව:ග චරීජනජෙි ළශද්‍රණු අ:: බීශින් න:ර:ණු ගැ: ක:ළරන බව

කි::න අනචසූතකඅ ඝමඛන්ඩශයක් වෘන්කිළ ඝමින ඒක:බ6හ යණ්ගැංසූ 8: ත්‍රං.උළ

ඩාය පද්‍ර ගට නි::බී.

ට්‍රද්‍රන්ාඋ යහ්ච්ත ව'ඩා වජී චන ළග:ණීරණ් ක: ඝංළක්න ජන. පිශ්‍රීබදච රජය ත්‍රිභින්

භළ ගැධබ්න්ඩළ::න් තවළන් වග වරජනා::න් පසූව ජංකික ඉදිථීපන. කූංරන ඟඋ මිඝඳු

ල වීතඳු.ක් රජ:න් ඉදි5 ගප්ධපි80 ඝංශ,එ:යථී නල්පිපි:: ටික් ඝමබන්ඩාඟයන් වෘන්කි::

න් වි නතී ඛච්ළ වාන'කිය කමගළල් හට::ගැන කිගඛන යමිනි එ'"ත:බග්ඩ 0ණ්)ටලජෙි

ථීත ඒළ:ඛද්ඩ ළණ්ඩලයළ:' ාශ්චක අචූලු අභූ 884::,) තර,ණු ගල්කඛ 0ංඩාලඝ ව:ශිදුරටක්

ප්‍රතෘශකළ ::0 කාළ්ඝ. ද වෘක්ති. ඝමිති ඒඛද:ඛ ශ:තවජං යත:ප නිරණ්ළකච

 ාඝ:::ළ :ක්‍ෂය දිෘ: ළං:ාංංශ් -ංළ::ං ාංගං: එචූම්ළ,න් ළචූං ව:ක්තිළ ළටිති

%ලළ ංං:ර:ළ, ෂ 8ළෘඳා ළශන් දැහෘළන් පූංල ෂං එක:ඛළ'ඩ ංණ්"ඩාළඝ ත:ඳම්

තසූ::. ක්‍රිංෂ ට්:ජීග, තීද්‍රඩ: ාළන් ඝංකචජ: ක්‍ෂංච නෂෘ. ළටිඝ:න නිගළනළතප එළබ්ළ1ච

ක්ංංගැ ංෂ්ප:::යප ෘදූ පූද්‍රණං ංංහාෘ ා:ා ඝ- ඛළ,ඥාළථ-න්ත ළපි.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8d8ad5b8-e3d7-4581-8972-1b631f5bc1c5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Raffael

unread,

Mar 15, 2015, 11:01:53 AM3/15/15

to tesser...@googlegroups.com

Hi Ruwanka!

1. 300 to 500 dpi

2. Preprocessing is necessary. Regarding the sample you give - lines should be as horizontal as possible and text should be black, background be white.

3. The font doesn't have to be identical - just "sufficiently" (very) similar.

4. Diacritics shouldn't be a big issue - at least not for the dash-above-character-kind. Just make sure you have sufficiently large sample size (at least 10 specimen) per each character(-diacritcs-combination).

Good luck

Raffael

Ruwanka De Silva

unread,

Mar 15, 2015, 11:11:35 AM3/15/15

to tesser...@googlegroups.com

Thanks ShreeDevi for quick reply, I have tried that software but as you can see results are under 5% accuracy that's why I am going to train it targeting for old Sinhalese newspapers. The software you mentioned has been trained for several fonts and my target fonts are not among those.

Thanks Raffael for your quick reply, So do I have to train tesseract for preprocessd images not for raw one, am I right? Also I have 72dpi scanned copies can I upscale it to create training data? Will it work as expected? As you said I'll try it for similar font.

Thanks and Regards,

Ruwanka De Silva.

Reply all

Reply to author

Forward

0 new messages