New language traineddata based on the existing one.

95 views
Skip to first unread message

Iskander Sharipov

unread,
Jul 4, 2014, 5:15:52 AM7/4/14
to tesser...@googlegroups.com
I need to create new tessdata language, which is very similar to russian in charset.
Every time I try to do so by training tesseract on a box containing needed letters I get new traineddata,
which actually can recognize new symbols, but alas, forgets everything that original data was trained.

I've read some manuals, but all I found is to how create new language from scratch or how to
improve existing one results by manipulating box files.

Any help is appreciated.

Nick White

unread,
Jul 4, 2014, 10:37:36 AM7/4/14
to tesser...@googlegroups.com
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to tesseract-oc...@googlegroups.com.
> To post to this group, send email to tesser...@googlegroups.com.
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/msgid/
> tesseract-ocr/a8ee62b1-472d-4b8a-9ddd-3759ddb181d2%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Albrecht Hilker

unread,
Jul 4, 2014, 2:18:24 PM7/4/14
to tesser...@googlegroups.com
I'm facing the same problem.
But I fear that merging traineddata files is not implemented.
You always have to train from scratch.


On this page:
https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
you find the following sentence:
>> ....but note that there is no incremental training mode that allows you to add new training data to existing sets.


The problem is probably that all characters in a traineddata file have an ID starting with 1,2,3,4, and that all FontInfo's also have an ID and the features too.
To merge two traineddata files you would have to renumber all these IDs and detect which character is defined twice in both files.
I suppose that this is very complicated.

But it would be a great feature to have a merge function!
Maybe you add it ?

Reply all
Reply to author
Forward
0 new messages