tesseract data files

517 views
Skip to first unread message

Simon Eigeldinger

unread,
Mar 2, 2018, 3:48:48 PM3/2/18
to tesser...@googlegroups.com
Hi all,

Just looked at the git commits for tesseract and read that there has
been changes to the OCR modes.
are the 3 tessdata sets still valid?
tessdata_fast and tessdata_best have been updated so i guess those
reflect the latest developments but tessdata hasn't an update since
september.
is that 3rd set still useable or shouldn't that ome not be used anymore?
on the wiki
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
it's still listed as useable.

Any suggestions?

Greetings and thanks,
Simon

---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https://www.avast.com/antivirus

ShreeDevi Kumar

unread,
Mar 2, 2018, 11:12:57 PM3/2/18
to tesser...@googlegroups.com
Hi Simon,

If you are planning to package using 4.00alpha from master branch, please use traineddata files from tessdata_fast. These are the files that have been shipped for Ubuntu 18.04 and included in Debian. See https://github.com/tesseract-ocr/tesseract/wiki for some links.

You can update the wiki page re cygwin.

FYI - tessdata repo supports both --oem 0 and --oem 1, but the files are older and may NOT be fully compatible with current code.

tessdata_best has files which can be used for further finetune/plusminus type training.

tessdata_fast has faster integer models and is the recommended one to be used for OCR. 

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com



--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3c4c0b75-b411-3227-26e1-d1d2485b9572%40vol.at.
For more options, visit https://groups.google.com/d/optout.

ShreeDevi Kumar

unread,
Mar 2, 2018, 11:16:16 PM3/2/18
to tesser...@googlegroups.com
> tessdata repo supports both --oem 0 and --oem 1, but the files are older and may NOT be fully compatible with current code.

The results may vary depending on language and oem used. I have NOT tested this much, since newer traineddata give better accuracy for Indian languages.


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Simon Eigeldinger

unread,
Mar 4, 2018, 12:26:01 PM3/4/18
to tesser...@googlegroups.com
Hi ShreeDevi,

I have scraped the cygwin builds.
i am using now the builds i get from the appveyor builds which just
needs me to repackage the resulting stuff.

so tessdata_best isn't like the wiki says for better accuracy?

greetings,
Simon

Am 03.03.2018 um 05:12 schrieb ShreeDevi Kumar:
> Hi Simon,
>
> If you are planning to package using 4.00alpha from master branch, please
> use traineddata files from tessdata_fast. These are the files that have
> been shipped for Ubuntu 18.04 and included in Debian. See
> https://github.com/tesseract-ocr/tesseract/wiki for some links.
>
> You can update the wiki page re cygwin.
>
> FYI - tessdata repo supports both --oem 0 and --oem 1, but the files are
> older and may NOT be fully compatible with current code.
>
> tessdata_best has files which can be used for further finetune/plusminus
> type training.
>
> *tessdata_fast has faster integer models and is the recommended one to be
> used for OCR. *
>> email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,
Mar 4, 2018, 12:43:56 PM3/4/18
to tesser...@googlegroups.com
The traineddata files in tessdata_best are larger in size and OCR takes more time. They are supposedly slightly more accurate, but there are no definitive results provided by Ray.

tessdata_fast is what has been shipped for Debian and Ubuntu, so that seems the way to go for doing OCR. These however cannot be used for fine-tune training. 

Those who want to do training, need to use files from tessdata_best.

Simon Eigeldinger

unread,
Mar 4, 2018, 1:34:57 PM3/4/18
to tesser...@googlegroups.com
Hm.
I guess i just ship all 3 of them. *lol*
and add the text of the wiki to the readme.

Greetings,
Simon
Reply all
Reply to author
Forward
0 new messages