Adding new language to Tesseract?

711 views
Skip to first unread message

Puramoca021

unread,
Nov 1, 2014, 4:12:04 PM11/1/14
to tesser...@googlegroups.com
Hi,

I have trained unreleased Tesseract 3.04 (available only in Subversion repository) to recognize Serbian Cyrillic. Instructions for training Tesseract 3 were strictly followed - I used script tesstrain.sh and provided required files.

My question is: what is the procedure for submitting new trained data so that they are available for new, upcoming version of Tesseract ?


Best regards,
Zoltan

Vladimir Radnovic

unread,
Nov 2, 2014, 10:45:32 AM11/2/14
to tesser...@googlegroups.com
Hi, Zdravo Zoltane
za sta ti treba novi traindata ? imas vise nacina da odradis traning pa ako ti treba pomoc ti se javi

You have severas ways to traind data.... what u need for ?
pozdrav
vladimir

Puramoca021

unread,
Nov 3, 2014, 1:32:24 PM11/3/14
to tesser...@googlegroups.com

On Sunday, November 2, 2014 4:45:32 PM UTC+1, Vladimir Radnovic wrote:
Hi, Zdravo Zoltane
za sta ti treba novi traindata ? imas vise nacina da odradis traning pa ako ti treba pomoc ti se javi

You have severas ways to traind data.... what u need for ?
pozdrav
vladimir


Hi Vladimir,

I am afraid you did not understand me ... I think I was not clear enough:

- I do not need new traindata. I made new traindata for Serbian Cyrillic myself and I would like to offer this train data to all Tesseract users that need to OCR text printed in Serbian Cyrillic.

My question is: How do I send this file (srp.traineddata) to you, Tesseract developers and maintainers?

By zipping it and sending via email?
By uploading to a file sharing service? If so, which one?
By making a torrent out of it?

Please advise

Regards,
Zoltan

ShreeDevi Kumar

unread,
Nov 3, 2014, 1:45:38 PM11/3/14
to tesser...@googlegroups.com
There already is language data for srp - please see 


and


Ray Smith, the lead developer  of tesseract at Google is planning to release updated versions of traineddata soon as part of 3.04 release.

If  your traineddata has something additional that is not there in the existing set, then please add as attachment to an issue so that it can be tested.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0362254d-260d-49fa-af8b-c098b50811f0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Puramoca021

unread,
Nov 3, 2014, 2:38:11 PM11/3/14
to tesser...@googlegroups.com
Hi Devi,

Unfortunately, you are slightly misinformed as well.

The file with trained data for Serbian language that is currently in Tesseract's repository contains LATIN characters.
What I made is corpus of trained data that recognizes Serbian Cyrillic characters.

A good summary and explanation what Serbian Cyrillic is can be found here (Wikipedia article). Please pay attention to section "Modern alphabet" in Wikipedia article.
What current version of Tesseract's srp.traineddata can recognize are letters in column labelled "Latin" (see Wikipedia article).
I would like to submit file with trained data which will make Tesseract recognize letters in column "Cyrillic" (again, see Wikipedia article).

Again, I did not get a clear answer to my question - how to submit this file to Tesseract's repository?

Shall I assume that I need to open an issue and submit trained data there? Please clarify.


Regards,
Zoltan

ShreeDevi Kumar

unread,
Nov 3, 2014, 9:05:46 PM11/3/14
to tesser...@googlegroups.com, tesser...@googlegroups.com, Ray Smith
Thanks for clarifying and giving more details. 

I am cc:ing this email to the tesseract developers group and Ray for answer to your question "how to submit this file to Tesseract's repository?. "

Meanwhile, I suggest that you add an 'issue' and attach the traineddata.

Thanks!

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Puramoca021

unread,
Nov 4, 2014, 2:43:46 AM11/4/14
to tesser...@googlegroups.com, tesser...@googlegroups.com, thera...@gmail.com
Hi ShreeDevi,

Many thanks for providing support and clear answer!

As recommended, I opened issue 1373. Let's see what happens.

Regards,
Zoltan

iram akbar

unread,
Nov 10, 2014, 5:14:45 AM11/10/14
to tesser...@googlegroups.com, tesser...@googlegroups.com, thera...@gmail.com
Hi.

@Puramoca021  can you please share what tools you are using for Tesseract training data. i am Training the data for Arabic language as Tesseract did in tessdata. i am using jtessbox builder for TIFF generation and Serak for training. but i am getting some issues with Serak specially. 
Question: what tools you have used to train the data?
Reply all
Reply to author
Forward
0 new messages