Anyone working on Georgian (kartuli ena)?

237 views
Skip to first unread message

La Monte H. P. Yarroll

unread,
May 17, 2012, 9:42:02 AM5/17/12
to tesser...@googlegroups.com
Is anyone working on Georgian (kartuli ena)? I have a regular timeslot coming up where I could work on this.

Sven Pedersen

unread,
May 17, 2012, 10:44:28 AM5/17/12
to tesser...@googlegroups.com
Someone named Derek Dohler was working on it a year ago (I've private
messaged both parties).
--Sven
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en



--
``All that is gold does not glitter,
  not all those who wander are lost;
the old that is strong does not wither,
  deep roots are not reached by the frost.
From the ashes a fire shall be woken,
  a light from the shadows shall spring;
renewed shall be blade that was broken,
  the crownless again shall be king.”

Derek

unread,
May 17, 2012, 11:17:15 AM5/17/12
to tesseract-ocr
Strangely, the spammer who was sending tons of messages to the list is
also Georgian and claims that his software works on the Georgian
language. I'm planning to download his software tonight and (after
carefully checking for viruses) test it out.

Will respond to PM momentarily.

Derek

On May 17, 6:44 pm, Sven Pedersen <sven.peder...@gmail.com> wrote:
> Someone named Derek Dohler was working on it a year ago (I've private
> messaged both parties).
> --Sven
>
> On Thu, May 17, 2012 at 8:42 AM, La Monte H. P. Yarroll
>

Nick White

unread,
May 28, 2014, 11:26:55 AM5/28/14
to tesser...@googlegroups.com
Hi all,

Resurrecting an old thread: has anyone got anywhere training
tesseract for Georgian? Or tried to?

There's a new comment on the TrainingTesseract3 wiki page that
implies someone is, but I don't know how to contact them, and it
would be very useful to me. Even if it isn't complete, it would be
useful to talk to anyone who's interested in Georgian OCR.

Thanks,

Nick

vov4ik829

unread,
May 31, 2014, 4:26:11 AM5/31/14
to tesser...@googlegroups.com
Hi Nick,

I work at Georgian.
Will revert later and share my little experience.

среда, 28 мая 2014 г., 19:26:55 UTC+4 пользователь Nick White написал:

Derek

unread,
May 31, 2014, 11:22:22 AM5/31/14
to tesser...@googlegroups.com
I trained Tesseract on Georgian data a couple years ago and got passable, but not excellent results. I'm attaching my traineddata file here; one major known issue is that I forgot to include samples for the numeral 4. Oops.

I'm happy to pass along any of my training documents to anyone who wants to fix that, and maybe improve the recognition quality overall.

Cheers,
Derek
kat.traineddata.zip

gtess...@gmail.com

unread,
Jun 2, 2014, 5:53:23 AM6/2/14
to tesser...@googlegroups.com
I use the Georgian language in my commercial program SunnyPage v2.1 with recognition tables. http://www.sunnypage.ge/en/ . I use a modified version of the database. I use tesseract-3.03-rc1. Georgian (ge_s.traineddata+ge_s_.traineddata+geo.traineddata), and also an old Georgian "Khutsuri" geo_old.traineddata.


gtess...@gmail.com

unread,
Jun 2, 2014, 6:41:59 AM6/2/14
to tesser...@googlegroups.com
I have a problem with segmentation on the attached file. https://drive.google.com/file/d/0B8h3BnFL4od5bklGeU5CUDV6RHc/edit?usp=sharing

Nick White

unread,
Jun 3, 2014, 12:03:43 PM6/3/14
to tesser...@googlegroups.com
Hi Derek,

Thanks for this. It does indeed look pretty good, from my brief
testing (though I don't know Georgian at all, so I'm only basing it
on "those shapes look like the shapes in the scan").

If you could post the training documents somewhere, that would be
useful. I could then at least add samples for the numeral 4, if
nothing else.

Thanks again!

Nick
> To unsubscribe from this group and stop receiving emails from it, send an email
> to tesseract-oc...@googlegroups.com.
> To post to this group, send email to tesser...@googlegroups.com.
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/msgid/
> tesseract-ocr/fa105446-ae2d-4e0c-b845-01a9895b22e5%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

> Archive: /tmp/kat.traineddata.zip
> Length Date Time Name
> --------- ---------- ----- ----
> 6543722 2012-06-10 09:16 kat.traineddata
> --------- -------
> 6543722 1 file

Derek

unread,
Oct 10, 2014, 10:04:31 PM10/10/14
to tesser...@googlegroups.com, Nick White
Hi Nick,

Apologies for taking such a long time, but I finally got around to digging my training set out of my development machine and zipping it up for distribution. Here's a DropBox link to the file:
https://dl.dropboxusercontent.com/u/11840441/kat_train.zip I'll try to keep it available at that link for as long as possible since this is a public list, but it likely will not stay around forever.

Note that the source text for this training data is (unfortunately) not in the public domain. The original text is available here: http://transparency.ge/blog/საპარტნიორო-ფონდი-ეკონომიკის-განვითარებისთვის-გამჭვირვალობის-პრობლემა?page=2 I believe that the copyright owner (Transparency International Georgia) would give permission for it to be included on the Tesseract downloads page and I'd be happy to make that contact if the project maintainers are interested in including this training set once it has been improved somewhat.

Cheers,
Derek
Reply all
Reply to author
Forward
0 new messages