Tesseract returns empty result with custom language but not english

127 views
Skip to first unread message

Brennan Nunamaker

unread,
Jul 6, 2015, 8:45:10 AM7/6/15
to tesser...@googlegroups.com
Hello,

I just generated the traineddata file for an old historical version of latin text, but when I run tesseract on the .tif that I used to train tesseract for the language (as well as with other sample images), it returns an empty result. However, when I use the English language for classification, it generates text with a few errors due to a lack of recognition for some specific characters. (Meaning that the fault lies with the traineddata and not the samples I am running it on)

Why could this be? I have been struggling to even generate the traineddata, and ended up using a fairly short training text (see attachment). Do I need to use a longer training text/tif?

If anyone could point me in the right direction I would be extremely grateful.

Thanks in advance!
-Brennan
nlg.PalemonasMUFIRegular.exp0.tif

ShreeDevi Kumar

unread,
Jul 6, 2015, 9:03:20 AM7/6/15
to tesser...@googlegroups.com

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/29355c0a-deeb-4f65-a176-9abae60bcb9c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Brennan Nunamaker

unread,
Jul 6, 2015, 9:07:36 AM7/6/15
to tesser...@googlegroups.com
I need to use my own trained data, because in the future we will be using it on text that has no trained data, so we will have to generate it ourselves. If I don't understand what I am doing wrong, I won't be able to...

Thank you anyway

Brennan Nunamaker

unread,
Jul 6, 2015, 9:09:06 AM7/6/15
to tesser...@googlegroups.com
For clarification: With "text", I meant languages

ShreeDevi Kumar

unread,
Jul 6, 2015, 9:12:19 AM7/6/15
to tesser...@googlegroups.com

which has the language data used for latin. You can use this as the basis to create your own traineddata file for an old historical version of latin 

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

ShreeDevi Kumar

unread,
Jul 6, 2015, 9:17:50 AM7/6/15
to tesser...@googlegroups.com
You may also find it helpful to read Training Tesseract for Ancient Greek OCR by Nick White -  http://ancientgreekocr.org/e29-a01.pdf 

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Brennan Nunamaker

unread,
Jul 6, 2015, 10:00:45 AM7/6/15
to tesser...@googlegroups.com
This is very helpful, thank you!

ShreeDevi Kumar

unread,
Jul 6, 2015, 11:19:25 AM7/6/15
to tesser...@googlegroups.com
You may also want to see the latest code and the tesstrain.sh script for the newer developments in training at

Also see the release history on http://ancientgreekocr.org/
since Nick updated the software for the changes in tesseract - the article is older.


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Tom Morris

unread,
Jul 6, 2015, 11:44:18 AM7/6/15
to tesser...@googlegroups.com
Be sure to check https://github.com/tesseract-ocr/langdata before assuming that the language that you need isn't supported.  Dozens of new languages were added a couple of weeks ago.

Tom
Reply all
Reply to author
Forward
0 new messages