New Georgian (kartuli ena) traineddata for Tesseract

Derek

unread,

Apr 1, 2015, 9:28:07 PM4/1/15

to tesser...@googlegroups.com

I've recently finished training tesseract 3.03-rc1 on the Georgian language, using tesstrain.sh and based off the files in the langdata repository. I created my own word list and bigrams list using Wikipedia.

Performance is very good on high-quality scans with modern fonts, but it doesn't do very well on older documents; I'm not sure whether this is because of differences in the font, or because the synthetic images generated by the tesstrain.sh script don't give tesseract enough training in handling degraded images.

I've uploaded the traineddata file and all training files here: https://dl.dropboxusercontent.com/u/11840441/kat_train20150401.zip

I'm attaching a test image (a randomly-selected scan from Georgia's registry of corporations) and the output of running tesseract recognition on the test image. No pre-processing was done on the test image except to upsample it to 300dpi. The test image contains some Latin characters so I ran tesseract with the language selector "kat+eng".

The licensing for any documents to which I hold the copyright is the same as the tesseract source, i.e. the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0).

NIKA_28.txt

NIKA_28.png

Sven Pedersen

unread,

Apr 2, 2015, 9:35:20 AM4/2/15

to tesser...@googlegroups.com

Cool! Good work. I hope that will help the others who have been asking about Georgian for a couple years. :-)

--Sven

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ded0dcf6-e050-450b-8bcd-17dde924aafe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

``All that is gold does not glitter,
not all those who wander are lost;
the old that is strong does not wither,
deep roots are not reached by the frost.
From the ashes a fire shall be woken,
a light from the shadows shall spring;
renewed shall be blade that was broken,
the crownless again shall be king.”

ShreeDevi Kumar

unread,

Apr 2, 2015, 10:16:59 AM4/2/15

to tesser...@googlegroups.com

Please see

https://code.google.com/p/tesseract-ocr/source/browse/training/degradeimage.h

https://code.google.com/p/tesseract-ocr/source/browse/training/degradeimage.cpp

It maybe possible to do additional training using degraded versions of 'synthetic' images which may improve recognition of older documents.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAFTC0i5tzPcRkrspmQ0EREVtOXQ3ifsf_rP8TQ%2B7MaU3UkURhg%40mail.gmail.com.

Derek Dohler

unread,

Apr 2, 2015, 11:04:42 PM4/2/15

to tesser...@googlegroups.com

ShreeDevi,

Thanks for this -- I tried re-training tesseract with a range of exposure values passed to text2image, but didn't see improved results.

However, I did notice in the process that the x-heights for the document I was attempting to recognize were near the lower limit of what Tesseract can handle (~10px), so I doubled the image size. This resulted in much improved recognition; there are still errors, but fewer of them and they "make sense" now. Tesseract isn't able to segment the 5-column page layout very well, but otherwise I'm pretty happy with the results.

Derek

--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/R_-9cduyixc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUXJuxf8orSD5OLG2A0zrbj4BqAbs6LkgB7t0mpUEnw1A%40mail.gmail.com.

zdenko podobny

unread,

Apr 3, 2015, 4:07:03 AM4/3/15

to tesser...@googlegroups.com

Can you create a repository for your training (in sourceforge or github)?

Maybe with detailed description how you created it (so potentially other people can try to improve/extend it).

Zdenko

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CABjJQf%2BNdVKcy%3D2rf%2BwSQpPR98mN%3DpTX4m7BzC1ZGbDxKfSidg%40mail.gmail.com.

Derek Dohler

unread,

Apr 3, 2015, 10:40:49 PM4/3/15

to tesser...@googlegroups.com

Hi Zdenko,

Sure, no problem -- I've made all the files, along with instructions, at https://github.com/ddohler/tesseract-georgian

Cheers,

Derek

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8x6cqN%2BpqF_sCOB4Wne0ZQg2La1gQTz8iJ4G3G%3DiTXpuQ%40mail.gmail.com.

zdenko podobny

unread,

Apr 4, 2015, 3:09:13 AM4/4/15

to tesser...@googlegroups.com

Thanks. I put link to AddOn wiki.

Zdenko

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CABjJQf%2BSAn9PQ7bvmkPaOd2vbGQ07PpmCA9PQcAfKeXd_7EtHA%40mail.gmail.com.

sibi kanagaraj

unread,

Apr 8, 2015, 10:41:00 AM4/8/15

to tesser...@googlegroups.com

Hi Derek ,

Excellent Documentation .

A small correction in the documentation .

Here //kat.wordlist.clean / kat.word.bigrams.clean

<<Run python count_stuff/word_counts.py>>

but the actual fie name is wordcounts.py .

-Sibi

Reply all

Reply to author

Forward