How to train tesseract with new script?

111 views
Skip to first unread message

Moni

unread,
Apr 4, 2019, 5:31:16 PM4/4/19
to tesser...@googlegroups.com
Hi all
I am planning to train the ancient scripts for language translation. Is there any alternate rather than amazon mechanical turk to train the character? in stroke format. Or else have to  train manually??

Thanks for taking time off your busy schedule... 

Soumik Ranjan Dasgupta

unread,
Apr 5, 2019, 10:31:42 AM4/5/19
to tesser...@googlegroups.com
If you have a font of the said script alphabet, yes, I think it is possible.   

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAGMnXaKf4D22zsN2S7yyPv%3DijgCBwhaqG3k3LofW_jAn9O06og%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Moni

unread,
Apr 8, 2019, 3:43:01 AM4/8/19
to tesser...@googlegroups.com
Thanks for your valuable response
Since the scripts doesn't have the trained data, trying to generate the trained data. For creating trained data, whether have to use tensorflow  or tesseract for training???

Thanks for taking time off your busy schedule...

Shree Devi Kumar

unread,
Apr 8, 2019, 4:59:54 AM4/8/19
to tesser...@googlegroups.com
Tesseract 4 LSTM training is done using tesseract, not tensowflow.

It is easiest to train using synthetic training data generated with training text and fonts. For ancient scripts it may need to be finetuned further using real life images.

I have tried training for Brahmi, Akkadian Cueniform and Coptic with synthetic data .

See 

 


For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Moni

unread,
Apr 8, 2019, 5:28:16 PM4/8/19
to tesser...@googlegroups.com
Thanks a lot for your response
I had gone through your page but for brahmi scripts its display error to show the raw data.. kindly help me with this...

Thank you for your consideration

Moni

unread,
Apr 9, 2019, 5:43:44 AM4/9/19
to tesser...@googlegroups.com
Hi good morning... Currently I am Phd scholar doing my research in ancient Tamil Inscriptions. Had seen your trained data for bramhi script and working with that but getting an error "Failed to load the language". If possible kindly share your language data.

Thanks for your cooperation..

suraa syss

unread,
Apr 19, 2019, 1:09:17 PM4/19/19
to tesseract-ocr
Because that script is not properly trained


On Tuesday, 9 April 2019 11:13:44 UTC+5:30, Moni wrote:
Hi good morning... Currently I am Phd scholar doing my research in ancient Tamil Inscriptions. Had seen your trained data for bramhi script and working with that but getting an error "Failed to load the language". If possible kindly share your language data.

Thanks for your cooperation..

On Mon, Apr 8, 2019 at 10:29 AM Shree Devi Kumar <shree...@gmail.com> wrote:
Tesseract 4 LSTM training is done using tesseract, not tensowflow.

It is easiest to train using synthetic training data generated with training text and fonts. For ancient scripts it may need to be finetuned further using real life images.

I have tried training for Brahmi, Akkadian Cueniform and Coptic with synthetic data .

See 

 

On Mon, Apr 8, 2019 at 9:13 AM Moni <moni....@gmail.com> wrote:
Thanks for your valuable response
Since the scripts doesn't have the trained data, trying to generate the trained data. For creating trained data, whether have to use tensorflow  or tesseract for training???

Thanks for taking time off your busy schedule...

On Fri, Apr 5, 2019 at 4:01 PM Soumik Ranjan Dasgupta <srd...@cse.jgec.ac.in> wrote:
If you have a font of the said script alphabet, yes, I think it is possible.   

On Thu, Apr 4, 2019, 11:01 PM Moni <moni....@gmail.com> wrote:
Hi all
I am planning to train the ancient scripts for language translation. Is there any alternate rather than amazon mechanical turk to train the character? in stroke format. Or else have to  train manually??

Thanks for taking time off your busy schedule... 

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

shree

unread,
Jun 30, 2019, 9:23:52 AM6/30/19
to tesseract-ocr
Please see https://github.com/Shreeshrii/tessdata_brahmi

I have uploaded the langdata as well as traineddata both for legacy tesseract and neural net tesseract. There are no wordlists/dawgs in this.


has two synthetic images and their OCR text for comparison.


On Tuesday, April 9, 2019 at 11:13:44 AM UTC+5:30, Moni wrote:
Hi good morning... Currently I am Phd scholar doing my research in ancient Tamil Inscriptions. Had seen your trained data for bramhi script and working with that but getting an error "Failed to load the language". If possible kindly share your language data.

Thanks for your cooperation..

On Mon, Apr 8, 2019 at 10:29 AM Shree Devi Kumar <shree...@gmail.com> wrote:
Tesseract 4 LSTM training is done using tesseract, not tensowflow.

It is easiest to train using synthetic training data generated with training text and fonts. For ancient scripts it may need to be finetuned further using real life images.

I have tried training for Brahmi, Akkadian Cueniform and Coptic with synthetic data .

See 

 

On Mon, Apr 8, 2019 at 9:13 AM Moni <moni....@gmail.com> wrote:
Thanks for your valuable response
Since the scripts doesn't have the trained data, trying to generate the trained data. For creating trained data, whether have to use tensorflow  or tesseract for training???

Thanks for taking time off your busy schedule...

On Fri, Apr 5, 2019 at 4:01 PM Soumik Ranjan Dasgupta <srd...@cse.jgec.ac.in> wrote:
If you have a font of the said script alphabet, yes, I think it is possible.   

On Thu, Apr 4, 2019, 11:01 PM Moni <moni....@gmail.com> wrote:
Hi all
I am planning to train the ancient scripts for language translation. Is there any alternate rather than amazon mechanical turk to train the character? in stroke format. Or else have to  train manually??

Thanks for taking time off your busy schedule... 

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages