tessdata on github

833 views
Skip to first unread message

John Muccigrosso

unread,
Aug 16, 2016, 3:06:56 PM8/16/16
to tesseract-ocr
Do I understand correctly that the files that are installed into the tessdata dir are broken into two groups on Github, the first being the traineddata files in the tesseract-ocr/tessdata repository and the second everything else at tesseract-ocr/tesseract/tessdata?

I'd discovered this because I was using the —tessdata option to point to my local mirror of the traineddata from Github and of course ran into problems without the pdf config file.

Zdenko Podobný

unread,
Aug 17, 2016, 3:54:29 AM8/17/16
to tesser...@googlegroups.com
tesseract library/engine[1] is separated from language trained data[2].
Main reason for this split is size of trained data and users need only few of them.
Trained data should be placed to the same tessdata directory where tesseract looks for config files (well config files are not needed if user use API of popper command line options)

Zdenko

On Tue, Aug 16, 2016 at 9:06 PM, John Muccigrosso <jmuc...@gmail.com> wrote:
Do I understand correctly that the files that are installed into the tessdata dir are broken into two groups on Github, the first being the traineddata files in the tesseract-ocr/tessdata repository and the second everything else at tesseract-ocr/tesseract/tessdata?

I'd discovered this because I was using the —tessdata option to point to my local mirror of the traineddata from Github and of course ran into problems without the pdf config file.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/89ccb328-6237-4e35-931d-d36834048ab8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

John Muccigrosso

unread,
Aug 17, 2016, 9:32:29 AM8/17/16
to tesseract-ocr
On Wednesday, August 17, 2016 at 3:54:29 AM UTC-4, zdenop wrote:
tesseract library/engine[1] is separated from language trained data[2].
Main reason for this split is size of trained data and users need only few of them.
Trained data should be placed to the same tessdata directory where tesseract looks for config files (well config files are not needed if user use API of popper command line options)

Thanks. So it was as I had thought.

I understand the motive, but I think it's worth noting that this means it's not possible to just point tessdata to a local clone of the repository. I'll probably symlink the data files into mine, but that'll mean re-building those every time there's an update. Pesky.

Zdenko Podobný

unread,
Aug 17, 2016, 9:37:12 AM8/17/16
to tesser...@googlegroups.com
If there is other solution how to separate "must" part of the project with "optional" data on github.com, please share it.

Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

John Muccigrosso

unread,
Aug 18, 2016, 10:26:36 AM8/18/16
to tesseract-ocr
On Wednesday, August 17, 2016 at 9:37:12 AM UTC-4, zdenop wrote:
If there is other solution how to separate "must" part of the project with "optional" data on github.com, please share it.

The issue is that this separation creates the problem I mentioned, that you can't simply clone the github repository and use it directly as your tessdata dir. Combining the two (optional and "must", as you say) would mean that some people would probably delete files from their clone to keep disk-space usage down. At least that's what I did.

My own process is to install tesseract via homebrew. That gets me a minimal set-up WRT the trained data files and means that I get updated upon major releases that make it to homebrew. Then I use the data files from github. This means that when tesseract gets updated via homebrew, I have to recreate the symlinks. Not a big deal, but not nothing either.

So it's a trade-off. Some people would likely modify their set-up in either case, either to copy or link files as now, or to delete them. My current thinking is that the latter would be preferable for me, but I recognize that not everyone will agree with that. I assume it's possible to have an installation via homebrew (or whatever) that ignores the "extra" data files, or possibly two separate installations, a minimal and a full one.

ShreeDevi Kumar

unread,
Aug 18, 2016, 6:23:50 PM8/18/16
to tesser...@googlegroups.com
I am wondering whether it would be possible to download only the needed traineddata files from tessdata repo (optional) into the designated tessdata-dir (which has the required tessdata files).

I found the following options but haven't been able to try them out yet ..

using svn export

python script to download multiple friles from a repo





ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

ShreeDevi Kumar

unread,
Aug 18, 2016, 7:29:58 PM8/18/16
to tesser...@googlegroups.com
Someone more familiar with git and github can suggest whether submodules would be a good option for langdata and tessdata,



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Aug 17, 2016 at 7:06 PM, Zdenko Podobný <zde...@gmail.com> wrote:

Marco Atzeri

unread,
Aug 19, 2016, 2:07:45 AM8/19/16
to tesser...@googlegroups.com
On 19/08/2016 00:22, ShreeDevi Kumar wrote:
> I am wondering whether it would be possible to download only the needed
> traineddata files from tessdata repo (optional) into the designated
> tessdata-dir (which has the required tessdata files).
>
> I found the following options but haven't been able to try them out yet ..
>
> 1. https://coderwall.com/p/o2fasg/how-to-download-a-project-subdirectory-from-github
> using svn export
>
> 2. https://github.com/intezer/GithubDownloader
> python script to download multiple friles from a repo
>
>


you can download any single file with wget or curl:

wget https://raw.githubusercontent.com/tesseract-ocr/tessdata/master/COPYING

Regards
Marco

ShreeDevi Kumar

unread,
Aug 19, 2016, 3:28:39 AM8/19/16
to tesser...@googlegroups.com

Marco,
For certain languages, multiple data   files are required in tessdata directory, eg. Eng, ara, hin , etc. Is there an easy way to get eg. hin.*

- sent from my phone. excuse the brevity.


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Reply all
Reply to author
Forward
0 new messages