Re: Contribution : Serbian Cyrillic traineddata file

65 views

Skip to first unread message

ShreeDevi Kumar

unread,

Nov 3, 2014, 10:55:09 PM11/3/14

to tesser...@googlegroups.com, tesser...@googlegroups.com, Ray Smith, zdenko podobny

* Changed subject to Serbian Cyrillic

* Please note that issues allow attachments only up to 10MB. So, if the traineddata zipped version is larger than that, please host it elsewhere (eg. github) and provide a link. Ray/Jeff/Zdenko, please correct, if that is not the case.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Nov 4, 2014 at 7:35 AM, ShreeDevi Kumar <shree...@gmail.com> wrote:

Thanks for clarifying and giving more details.

I am cc:ing this email to the tesseract developers group and Ray for answer to your question "how to submit this file to Tesseract's repository?. "

Meanwhile, I suggest that you add an 'issue' and attach the traineddata.

Thanks!

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Nov 4, 2014 at 1:08 AM, Puramoca021 <puram...@gmail.com> wrote:
Hi Devi,

Unfortunately, you are slightly misinformed as well.

The file with trained data for Serbian language that is currently in Tesseract's repository contains LATIN characters.
What I made is corpus of trained data that recognizes Serbian Cyrillic characters.

A good summary and explanation what Serbian Cyrillic is can be found here (Wikipedia article). Please pay attention to section "Modern alphabet" in Wikipedia article.
What current version of Tesseract's srp.traineddata can recognize are letters in column labelled "Latin" (see Wikipedia article).
I would like to submit file with trained data which will make Tesseract recognize letters in column "Cyrillic" (again, see Wikipedia article).

Again, I did not get a clear answer to my question - how to submit this file to Tesseract's repository?

Shall I assume that I need to open an issue and submit trained data there? Please clarify.

Regards,
Zoltan

понедељак, 03. новембар 2014. 19.45.38 UTC+1, shree је написао/ла:
There already is language data for srp - please see

https://code.google.com/p/tesseract-ocr/source/browse/srp/?repo=langdata

and

https://code.google.com/p/tesseract-ocr/source/browse/srp.traineddata?repo=tessdata

Ray Smith, the lead developer of tesseract at Google is planning to release updated versions of traineddata soon as part of 3.04 release.

If your traineddata has something additional that is not there in the existing set, then please add as attachment to an issue so that it can be tested.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Nov 4, 2014 at 12:02 AM, Puramoca021 <puram...@gmail.com> wrote:

On Sunday, November 2, 2014 4:45:32 PM UTC+1, Vladimir Radnovic wrote:
Hi, Zdravo Zoltane
za sta ti treba novi traindata ? imas vise nacina da odradis traning pa ako ti treba pomoc ti se javi

You have severas ways to traind data.... what u need for ?
pozdrav
vladimir

Hi Vladimir,

I am afraid you did not understand me ... I think I was not clear enough:

- I do not need new traindata. I made new traindata for Serbian Cyrillic myself and I would like to offer this train data to all Tesseract users that need to OCR text printed in Serbian Cyrillic.

My question is: How do I send this file (srp.traineddata) to you, Tesseract developers and maintainers?

By zipping it and sending via email?
By uploading to a file sharing service? If so, which one?
By making a torrent out of it?

Please advise

Regards,
Zoltan

On Saturday, 1 November 2014 21:12:04 UTC+1, Puramoca021 wrote:
Hi,

I have trained unreleased Tesseract 3.04 (available only in Subversion repository) to recognize Serbian Cyrillic. Instructions for training Tesseract 3 were strictly followed - I used script tesstrain.sh and provided required files.

My question is: what is the procedure for submitting new trained data so that they are available for new, upcoming version of Tesseract ?

Best regards,
Zoltan

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0362254d-260d-49fa-af8b-c098b50811f0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/29a8e468-3f2d-4350-b48b-e925791086e2%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Puramoca021

unread,

Nov 4, 2014, 8:44:58 AM11/4/14

to tesser...@googlegroups.com, tesser...@googlegroups.com, thera...@gmail.com, zde...@gmail.com

Hi ShreeDevi,

I opened issue 1373 and attached Serbian Cyrillic trained data there. It is less than 4 Mb in size, comparable to trained data for other languages/alphabets.

Regards,

Zoltan

уторак, 04. новембар 2014. 04.55.09 UTC+1, shree је написао/ла:

Reply all

Reply to author

Forward

0 new messages