Steps to train with plenty of source files

Shawn Chen

unread,

Aug 31, 2017, 6:54:09 AM8/31/17

to tesseract-ocr

Hi All,

I am new to Tesseract and want to use it to recognize plenty of image files.

Followed the training instructions I know how to do the training just for one file and generate the traineddata.

But for multiple files i am not very clear about how to automate the process based on the generated traineddata.

It seems that I have to modify the box file manually to correct the wrongly recognized characters.

Is there any way to automate this process?

Thanks.

ShreeDevi Kumar

unread,

Aug 31, 2017, 9:01:31 AM8/31/17

to tesser...@googlegroups.com

There are traineddata available for most languages, in different versions -

for 3.04/3.05

for 4.00.00alpha intial version from 2016

best traineddata for 4.00.00alpha released last month

best traineddata for the script for 4.00.00alpha released last month

Please see https://github.com/tesseract-ocr/tessdata

You should try training only if these do not work for you.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/57fa0441-4410-4228-8808-73fcd743d6fe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shawn Chen

unread,

Aug 31, 2017, 11:34:31 PM8/31/17

to tesseract-ocr

Thanks Shree.

I will try these.

On Thursday, August 31, 2017 at 9:01:31 PM UTC+8, shree wrote:

There are traineddata available for most languages, in different versions -

for 3.04/3.05
for 4.00.00alpha intial version from 2016
best traineddata for 4.00.00alpha released last month
best traineddata for the script for 4.00.00alpha released last month

Please see https://github.com/tesseract-ocr/tessdata

You should try training only if these do not work for you.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Aug 31, 2017 at 2:24 PM, Shawn Chen <chenxi...@gmail.com> wrote:

Hi All,
I am new to Tesseract and want to use it to recognize plenty of image files.
Followed the training instructions I know how to do the training just for one file and generate the traineddata.
But for multiple files i am not very clear about how to automate the process based on the generated traineddata.
It seems that I have to modify the box file manually to correct the wrongly recognized characters.
Is there any way to automate this process?

Thanks.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Reply all

Reply to author

Forward