New Georgian (kartuli ena) traineddata for Tesseract

333 views
Skip to first unread message

Derek

unread,
Apr 1, 2015, 9:28:07 PM4/1/15
to tesser...@googlegroups.com
I've recently finished training tesseract 3.03-rc1 on the Georgian language, using tesstrain.sh and based off the files in the langdata repository. I created my own word list and bigrams list using Wikipedia.

Performance is very good on high-quality scans with modern fonts, but it doesn't do very well on older documents; I'm not sure whether this is because of differences in the font, or because the synthetic images generated by the tesstrain.sh script don't give tesseract enough training in handling degraded images.

I've uploaded the traineddata file and all training files here: https://dl.dropboxusercontent.com/u/11840441/kat_train20150401.zip

I'm attaching a test image (a randomly-selected scan from Georgia's registry of corporations) and the output of running tesseract recognition on the test image. No pre-processing was done on the test image except to upsample it to 300dpi. The test image contains some Latin characters so I ran tesseract with the language selector "kat+eng".

The licensing for any documents to which I hold the copyright is the same as the tesseract source, i.e. the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0).
NIKA_28.txt
NIKA_28.png

Sven Pedersen

unread,
Apr 2, 2015, 9:35:20 AM4/2/15
to tesser...@googlegroups.com
Cool! Good work. I hope that will help the others who have been asking about Georgian for a couple years. :-)
--Sven

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ded0dcf6-e050-450b-8bcd-17dde924aafe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
``All that is gold does not glitter,
  not all those who wander are lost;
the old that is strong does not wither,
  deep roots are not reached by the frost.
From the ashes a fire shall be woken,
  a light from the shadows shall spring;
renewed shall be blade that was broken,
  the crownless again shall be king.”

ShreeDevi Kumar

unread,
Apr 2, 2015, 10:16:59 AM4/2/15
to tesser...@googlegroups.com
Please see 

It maybe possible to do additional training using degraded versions of 'synthetic' images which may improve recognition of older documents.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Derek Dohler

unread,
Apr 2, 2015, 11:04:42 PM4/2/15
to tesser...@googlegroups.com
ShreeDevi,

Thanks for this -- I tried re-training tesseract with a range of exposure values passed to text2image, but didn't see improved results.

However, I did notice in the process that the x-heights for the document I was attempting to recognize were near the lower limit of what Tesseract can handle (~10px), so I doubled the image size. This resulted in much improved recognition; there are still errors, but fewer of them and they "make sense" now. Tesseract isn't able to segment the 5-column page layout very well, but otherwise I'm pretty happy with the results.

Derek

--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/R_-9cduyixc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

zdenko podobny

unread,
Apr 3, 2015, 4:07:03 AM4/3/15
to tesser...@googlegroups.com
Can you create a repository for your training (in sourceforge or  github)?

Maybe with detailed description how you created it (so potentially other people can try to improve/extend it).


Zdenko

Zdenko

Derek Dohler

unread,
Apr 3, 2015, 10:40:49 PM4/3/15
to tesser...@googlegroups.com
Hi Zdenko,

Sure, no problem -- I've made all the files, along with instructions, at https://github.com/ddohler/tesseract-georgian

Cheers,
Derek

zdenko podobny

unread,
Apr 4, 2015, 3:09:13 AM4/4/15
to tesser...@googlegroups.com
Thanks. I put link to AddOn wiki.

Zdenko

sibi kanagaraj

unread,
Apr 8, 2015, 10:41:00 AM4/8/15
to tesser...@googlegroups.com
Hi Derek ,

Excellent Documentation .

A small correction in the documentation .

Here //kat.wordlist.clean / kat.word.bigrams.clean

<<Run python count_stuff/word_counts.py>>

but the actual fie name  is wordcounts.py .

-Sibi
Reply all
Reply to author
Forward
0 new messages