Are traineddata files platform dependent?

Gary

unread,

Jan 8, 2022, 1:11:43 AM1/8/22

to tesseract-ocr

I'm attempting to tesstrain a new model on an embedded device (armv7-k2.6) where tesseract was cross-compiled using Debian for OpenWRT/Entware.

Are traineddata files platform dependent? If so, is it possible to take a similar route as was taken to cross-compile the language traineddata files?

Respectfully,

Gary

Zdenko Podobny

unread,

Jan 9, 2022, 1:54:42 PM1/9/22

to tesser...@googlegroups.com

I would expect that before posting to the forum, you made your part - tests.

So what is the output of your tests? What error/problem did you encounter?

Zdenko

so 8. 1. 2022 o 7:11 'Gary' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/eeec5f7d-6947-4788-b53a-618dbcabdcccn%40googlegroups.com.

Gary

unread,

Jan 9, 2022, 3:18:08 PM1/9/22

to tesseract-ocr

Zdenko,

I thought that this might be a common-knowledge type question, so I thought I would ask in hopes of a quick answer prior to pursuing a specific route.

I'm in the process of creating a cross-compiler build environment to test for myself.

Thank you for your response.

Respectfully,

Gary

unread,

Jan 11, 2022, 8:14:22 PM1/11/22

to tesseract-ocr

I can confirm that traineddata files are NOT platform dependent. However, you need to ensure that tesseract is using the Neural nets LSTM engine.

I verified this by tesstrain training using a Debian Live DVD (without a cross-compiler environment). I then copied the traineddata file to an implementation of tesseract on an embedded platform making use of Entware. Originally, I configured tesseract to use the Legacy engine as it initially produced the best results. However, I discovered that when using my custom traineddata file it required the Neural nets LSTM engine. Additionally, I found that since I trained the traineddata file on the tessdata_best and my good-truth that it was more efficient to just use my traineddata file without additional models.

I have to say that my tesseract is a quick study. Using tessdata_best with my good-truth, my tesseract went from a 25% accuracy to.75% compared to the initial Legacy, English configuration.

The evidence that traineddata files are NOT platform dependent is that you can download the language traineddata files directly from GitHub and use without modification.

It would have been nice to have know that traineddata files are NOT platform dependent, prior to waisting time trying to configure the cross-compiler for Entware.

I hope this post clarifies this for future tesseract operators who might have the same question.