Tesseract returns empty result with custom language but not english

Brennan Nunamaker

unread,

Jul 6, 2015, 8:45:10 AM7/6/15

to tesser...@googlegroups.com

Hello,

I just generated the traineddata file for an old historical version of latin text, but when I run tesseract on the .tif that I used to train tesseract for the language (as well as with other sample images), it returns an empty result. However, when I use the English language for classification, it generates text with a few errors due to a lack of recognition for some specific characters. (Meaning that the fault lies with the traineddata and not the samples I am running it on)

Why could this be? I have been struggling to even generate the traineddata, and ended up using a fairly short training text (see attachment). Do I need to use a longer training text/tif?

If anyone could point me in the right direction I would be extremely grateful.

Thanks in advance!
-Brennan

nlg.PalemonasMUFIRegular.exp0.tif

ShreeDevi Kumar

unread,

Jul 6, 2015, 9:03:20 AM7/6/15

to tesser...@googlegroups.com

Did you try with the Latin traineddata

https://github.com/tesseract-ocr/tessdata/blob/master/lat.traineddata?raw=true

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/29355c0a-deeb-4f65-a176-9abae60bcb9c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Brennan Nunamaker

unread,

Jul 6, 2015, 9:07:36 AM7/6/15

to tesser...@googlegroups.com

I need to use my own trained data, because in the future we will be using it on text that has no trained data, so we will have to generate it ourselves. If I don't understand what I am doing wrong, I won't be able to...

Thank you anyway

Brennan Nunamaker

unread,

Jul 6, 2015, 9:09:06 AM7/6/15

to tesser...@googlegroups.com

For clarification: With "text", I meant languages

ShreeDevi Kumar

unread,

Jul 6, 2015, 9:12:19 AM7/6/15

to tesser...@googlegroups.com

Please see https://github.com/tesseract-ocr/langdata/tree/master/lat

which has the language data used for latin. You can use this as the basis to create your own traineddata file for an old historical version of latin

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/71c3b314-ff5f-4387-bf5f-ffc2cc6d2875%40googlegroups.com.

ShreeDevi Kumar

unread,

Jul 6, 2015, 9:17:50 AM7/6/15

to tesser...@googlegroups.com

You may also find it helpful to read Training Tesseract for Ancient Greek OCR by Nick White - http://ancientgreekocr.org/e29-a01.pdf

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Brennan Nunamaker

unread,

Jul 6, 2015, 10:00:45 AM7/6/15

to tesser...@googlegroups.com

This is very helpful, thank you!

ShreeDevi Kumar

unread,

Jul 6, 2015, 11:19:25 AM7/6/15

to tesser...@googlegroups.com

You may also want to see the latest code and the tesstrain.sh script for the newer developments in training at

https://github.com/tesseract-ocr/tesseract/tree/master/training

Also see the release history on http://ancientgreekocr.org/

since Nick updated the software for the changes in tesseract - the article is older.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/18f2c8de-df85-4afa-9aaf-e9d5be47862c%40googlegroups.com.

Tom Morris

unread,

Jul 6, 2015, 11:44:18 AM7/6/15

to tesser...@googlegroups.com

Be sure to check https://github.com/tesseract-ocr/langdata before assuming that the language that you need isn't supported. Dozens of new languages were added a couple of weeks ago.

Tom

Reply all

Reply to author

Forward