Adding new language to Tesseract?

Puramoca021

unread,

Nov 1, 2014, 4:12:04 PM11/1/14

to tesser...@googlegroups.com

Hi,

I have trained unreleased Tesseract 3.04 (available only in Subversion repository) to recognize Serbian Cyrillic. Instructions for training Tesseract 3 were strictly followed - I used script tesstrain.sh and provided required files.

My question is: what is the procedure for submitting new trained data so that they are available for new, upcoming version of Tesseract ?

Best regards,

Zoltan

Vladimir Radnovic

unread,

Nov 2, 2014, 10:45:32 AM11/2/14

to tesser...@googlegroups.com

Hi, Zdravo Zoltane
za sta ti treba novi traindata ? imas vise nacina da odradis traning pa ako ti treba pomoc ti se javi

You have severas ways to traind data.... what u need for ?

pozdrav

vladimir

Puramoca021

unread,

Nov 3, 2014, 1:32:24 PM11/3/14

to tesser...@googlegroups.com

On Sunday, November 2, 2014 4:45:32 PM UTC+1, Vladimir Radnovic wrote:

Hi, Zdravo Zoltane
za sta ti treba novi traindata ? imas vise nacina da odradis traning pa ako ti treba pomoc ti se javi

You have severas ways to traind data.... what u need for ?
pozdrav
vladimir

Hi Vladimir,

I am afraid you did not understand me ... I think I was not clear enough:

- I do not need new traindata. I made new traindata for Serbian Cyrillic myself and I would like to offer this train data to all Tesseract users that need to OCR text printed in Serbian Cyrillic.

My question is: How do I send this file (srp.traineddata) to you, Tesseract developers and maintainers?

By zipping it and sending via email?

By uploading to a file sharing service? If so, which one?

By making a torrent out of it?

Please advise

Regards,

Zoltan

ShreeDevi Kumar

unread,

Nov 3, 2014, 1:45:38 PM11/3/14

to tesser...@googlegroups.com

There already is language data for srp - please see

https://code.google.com/p/tesseract-ocr/source/browse/srp/?repo=langdata

and

https://code.google.com/p/tesseract-ocr/source/browse/srp.traineddata?repo=tessdata

Ray Smith, the lead developer of tesseract at Google is planning to release updated versions of traineddata soon as part of 3.04 release.

If your traineddata has something additional that is not there in the existing set, then please add as attachment to an issue so that it can be tested.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0362254d-260d-49fa-af8b-c098b50811f0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Puramoca021

unread,

Nov 3, 2014, 2:38:11 PM11/3/14

to tesser...@googlegroups.com

Hi Devi,

Unfortunately, you are slightly misinformed as well.

The file with trained data for Serbian language that is currently in Tesseract's repository contains LATIN characters.

What I made is corpus of trained data that recognizes Serbian Cyrillic characters.

A good summary and explanation what Serbian Cyrillic is can be found here (Wikipedia article). Please pay attention to section "Modern alphabet" in Wikipedia article.

What current version of Tesseract's srp.traineddata can recognize are letters in column labelled "Latin" (see Wikipedia article).

I would like to submit file with trained data which will make Tesseract recognize letters in column "Cyrillic" (again, see Wikipedia article).

Again, I did not get a clear answer to my question - how to submit this file to Tesseract's repository?

Shall I assume that I need to open an issue and submit trained data there? Please clarify.

Regards,

Zoltan

ShreeDevi Kumar

unread,

Nov 3, 2014, 9:05:46 PM11/3/14

to tesser...@googlegroups.com, tesser...@googlegroups.com, Ray Smith

Thanks for clarifying and giving more details.

I am cc:ing this email to the tesseract developers group and Ray for answer to your question "how to submit this file to Tesseract's repository?. "

Meanwhile, I suggest that you add an 'issue' and attach the traineddata.

Thanks!

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/29a8e468-3f2d-4350-b48b-e925791086e2%40googlegroups.com.

Puramoca021

unread,

Nov 4, 2014, 2:43:46 AM11/4/14

to tesser...@googlegroups.com, tesser...@googlegroups.com, thera...@gmail.com

Hi ShreeDevi,

Many thanks for providing support and clear answer!

As recommended, I opened issue 1373. Let's see what happens.

Regards,

Zoltan

iram akbar

unread,

Nov 10, 2014, 5:14:45 AM11/10/14

to tesser...@googlegroups.com, tesser...@googlegroups.com, thera...@gmail.com

Hi.

@Puramoca021 can you please share what tools you are using for Tesseract training data. i am Training the data for Arabic language as Tesseract did in tessdata. i am using jtessbox builder for TIFF generation and Serak for training. but i am getting some issues with Serak specially.

Question: what tools you have used to train the data?

Reply all

Reply to author

Forward