Installing tessdata

Peter Kronenberg

unread,

Jan 21, 2021, 12:58:56 PM1/21/21

to tesser...@googlegroups.com

I see that the default tessdata just has English and OSD. I see all the other data at https://github.com/tesseract-ocr/tessdata. Do I just copy those to the same tessdata directory? The repo has a much larger version of eng.traineddata than what comes with Tesseract. Can I just replace it?

And what is the difference of the ones in the script directory?

In the directory from the initial install, not only do I have eng.traineddata, but there is also user-patterns, user-words and other files. Do those files exist for the other languages as well?

Peter Kronenberg

unread,

Jan 27, 2021, 4:50:18 PM1/27/21

to tesser...@googlegroups.com

Hi, can someone help with these questions? Just trying to understand better how the language models are used and what is the difference between them.

Thanks

Peter

From: tesser...@googlegroups.com <tesser...@googlegroups.com> On Behalf Of Peter Kronenberg
Sent: Thursday, January 21, 2021 12:59 PM
To: tesser...@googlegroups.com
Subject: {EXTERNAL}[tesseract-ocr] Installing tessdata

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

CAUTION: This email originated from outside of the organization. DO NOT click links or open attachments unless you recognize the sender and know the content is safe.

I see that the default tessdata just has English and OSD. I see all the other data at https://github.com/tesseract-ocr/tessdata. Do I just copy those to the same tessdata directory? The repo has a much larger version of eng.traineddata than what comes with Tesseract. Can I just replace it?

And what is the difference of the ones in the script directory?

In the directory from the initial install, not only do I have eng.traineddata, but there is also user-patterns, user-words and other files. Do those files exist for the other languages as well?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/MN2PR20MB268642993B65C83511CFAF88E7A19%40MN2PR20MB2686.namprd20.prod.outlook.com.

Shree Devi Kumar

unread,

Jan 27, 2021, 8:41:06 PM1/27/21

to tesseract-ocr

Please see

https://tesseract-ocr.github.io/tessdoc/Data-Files.html

Also the readme files in the three repos

https://github.com/tesseract-ocr/tessdata_fast

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/MN2PR20MB268647BB8BA42CE575E06764E7BB9%40MN2PR20MB2686.namprd20.prod.outlook.com.

Peter Kronenberg

unread,

Jan 28, 2021, 9:43:13 AM1/28/21

to tesser...@googlegroups.com

Thanks for those links. I think what I’m looking for is a more practical understanding of some of the differences, instead of technical details, which, not being a domain expert, I don’t fully understand.

For instance, I understand that there are 2 types of models, the LSTM OCR engine and the legacy engine. What is the practical difference between the two. In other words, if I go with the ‘best’ or ‘fast’ models, which only do LSTM OCR, what am I missing out on by not having legacy? Is there any reason I would stick with the legacy models at https://github.com/tesseract-ocr/tessdata

As for the difference between ‘fast’ and ‘best’, is there any quantitative difference that someone can point me to? In other words, how much better is ‘best’ and how much more time does it take. I guess I’m trying to decide the best one (no pun intended) for my application.

For the scripts, I haven’t found much definitive documentation on those. If I use a Script language, is that equivalent to just specifying all the languages that use that script? Is there any downside? Do all the scripts contain English? For example, if the language I’m dealing with is German, could I just specify Latin? Or would it be more accurate to specify ‘deu’. For something like Arabic, if I specified a script of Arabic, would that include Arabic, Farsi and other similar languages that use the same alphabet? Would it be just as accurate as specifying the specific language? And does the Arabic script contain English as well, so it could handle a mixed document?

Thank you

Peter

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVdDOzm5GMF0jSvfw7vSpMqeDRH%3Db90Qza4L%2B3tMM5UWg%40mail.gmail.com.

Reply all

Reply to author

Forward