Tesseract training for Burmese script

279 views
Skip to first unread message

Thura Hlaing

unread,
Dec 22, 2011, 10:49:14 PM12/22/11
to tesser...@googlegroups.com, Ngwe Tun, Shwun Mi, thinzar...@gmail.com
Hello guys, I am trying to train tesseract (3.01) for Burmese script. I am following exactly the guide,  however I couldn't get a acceptable accuracy rate (less than 50%). Although, Burmese script has only 33 letters (consonants), there are a lot of consonant + diacritic combinations (ligatures). So, I need to train more than 900 characters (glyphs). I have generated a tiff image & box file, including 7 sample for infrequent characters and 20 for frequent characters.

Is it because the characters (glyphs) in training set are quite similar to each other? (In Burmese, each consonant can has several ligatures - which are quite similar to each other - combined with one or more diacritics.)

I have attached my training image (converted to png from tif) & box file. Any help or tip to improve accuracy is greatly appreciated. Thanks in advance.

mya.Tharlon.exp12.zip

Sven Pedersen

unread,
Dec 28, 2011, 7:42:31 AM12/28/11
to tesser...@googlegroups.com
Hi Thura,
It looks like you're using 600 pixels per inch resolution. That may be
too high res. Have you checked the instructions about character pixel
height? You're right that they're similar, but that should not make
such a big difference. You may be able to use character sequences and
post-processing to yield better results.
--Sven

> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en

--
``All that is gold does not glitter,
  not all those who wander are lost;
the old that is strong does not wither,
  deep roots are not reached by the frost.
From the ashes a fire shall be woken,
  a light from the shadows shall spring;
renewed shall be blade that was broken,
  the crownless again shall be king.”

Reply all
Reply to author
Forward
0 new messages