Need help to train me first, that I could train tesseract (Eng/Rus/Hindi)

Alexander Gribanov

unread,

Feb 19, 2019, 11:50:29 PM2/19/19

to tesseract-ocr

Hello!

Just found a tesseract and it seems a very great and powerful instrument,

but as we say in Russia, equipment in the hands of the fool is a scrap-metal...

So please, if somebody would be kind and help me to give advice step-by-step:

1. What to do
2. What to read/watch

3. Take a look on the result and give me a hint where to go next

My subject actually is that I have a lot of scanned (and many not scanned yet) books in mixed languages,

like English, Russian, Hindi, Bengali, sometimes kind of diacritic symbols, etc...

Most of them, I have to idea, is there any fonts available, which were they printed with...

But I'm ready to select on the image for the first time some letters, words, etc

Then tell to the program, which letter from image means as unicode char (not sure how does it called correctly)

So this way maybe possible to create missing fonts

So as I understood, the training neural network is kinda spiral process:

1. We have an image

2. We tell to the network, which part of the image is a symbol and what that symbol is (character code).

This becomes a training materials

3. Network based on the first small experience (let's say 1 page) tries to recognize 2-nd page

4. We verify and correct if needed. It becomes more training materials

And so on, so steps 3-4 repeats until the whole book will not be recognized.

Sometimes step 2 will be invoked for new characters or patters, etc..

So I think, this is should be enough to understand my level on the subject and my goal,

so I request, please, if anybody would like to help me to establish the process

to recognize many rare books to be able to search and navigate among

tons of scriptures, which will be lost and burried by the time...

Thank You all very much,

best regards, Alexander

Shree Devi Kumar

unread,

Feb 20, 2019, 12:32:22 AM2/20/19

to tesser...@googlegroups.com

Please share a couple of scanned pages for testing.

You may be able to use existing traineddata files for English and Russian with -l eng+rus or for English and Hindi with -l eng+hin

For text with diacritics you can try -l script/Latin

This will give you an idea of current state. You can plan training after that.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f4d5673a-31f4-4c2b-91f2-6cb843943a41%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,

Feb 20, 2019, 2:12:21 AM2/20/19

to tesser...@googlegroups.com

Actually, for English + Hindi, use `script/Devanagari.traineddata`

for English + Bengali, try `eng+ben` or `script/Bengali`

Please check the language code for Russian.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Reply all

Reply to author

Forward