Regarding Tesseract OCR Training Data supporting Vietnamese

Tuan Nguyen Huy

unread,

Dec 8, 2014, 4:34:13 AM12/8/14

to tesser...@googlegroups.com

Dear all,

I'm a freelance software developer from Vietnam.

Currently I am working on improving the training data of Tesseract OCR for Vietnamese language.

I am having some troubles with training new data for Vietnamese languages as below:

1. Could someone share with me the process as well as the tools that Google used to make .tif/.box files?

And the guidelines of how to use the tools if possible.

2. Did Google add Vietnamese fonts to the current training data for Vietnamese?

If yes, could someone let me know how to check which fonts were used?

3. Could someone share with me some .tif/.box files that Google made and included in the current training data for Vietnamese ?

I would like to know what the standards for those .tif/.box files are (font size, image resolution, etc.)

Thank you very much for spending your time to answers my questions.

Best regards.

ShreeDevi Kumar

unread,

Dec 8, 2014, 7:36:48 AM12/8/14

to tesser...@googlegroups.com

Please see http://vietocr.sourceforge.net/usage.html

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/17fd7bce-0b24-4793-972c-a149229a899b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tuan Nguyen Huy

unread,

Dec 9, 2014, 2:01:40 AM12/9/14

to tesser...@googlegroups.com

Dear ShreeDevi,

I already checked that out and follow the process

which that guy used to make trained data.

However with the tif/box files that I made by his process, the resulting trained data

is worse than the one made by Google.

Therefore I would like to know how Google made their tif/box files so that I can follow their way.

If you know about that could you kindly let me know?

Thank you very much.

Tuan.

ShreeDevi Kumar

unread,

Dec 9, 2014, 8:22:21 AM12/9/14

to tesser...@googlegroups.com

https://code.google.com/p/tesseract-ocr/source/browse/vie?repo=langdata

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/872696b8-c934-44a8-b3cf-45d24c592515%40googlegroups.com.

Tuan Nguyen Huy

unread,

Dec 11, 2014, 1:15:23 AM12/11/14

to tesser...@googlegroups.com

Hi,

Could you tell me what these 2 files are used for?
alphabet, vie.training-text

And one more question: As I checked with vie.traineddata from Google, it includes around 500 fonts but the size is just ~5MB. I also tried building

around 50 fonts but my trained data's size is ~7MB. I don't know why it's bigger than google's trained data while I used the same word list and rules with Google.

Just different from tif/box files.

Do my tif/box files make that difference in trained data's size?

Thank you very much.

ShreeDevi Kumar

unread,

Dec 11, 2014, 8:19:04 AM12/11/14

to tesser...@googlegroups.com

Hi,

I don't know the process that google uses so can't answer questions related to that.

training-text is what's used for creating box/tiff files, in the current version of tesseract (git source).

Please see https://code.google.com/p/tesseract-ocr/source/browse/training/tesstrain.sh

for a shell script which provides an easy way to execute various phases of training

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8a4d233f-5703-43c8-92df-731b01fd5359%40googlegroups.com.

Reply all

Reply to author

Forward