git mirror of tesseract-ocr

560 views
Skip to first unread message

Will Manley

unread,
Feb 25, 2014, 12:13:15 PM2/25/14
to tesser...@googlegroups.com
For anyone who's interested: I've created a git mirror of tesseract-ocr.  It includes history from both googlecode SVN and sourceforge.net CVS.  I've tidied up the commit/author information where I could.

The git repo is here:

    https://github.com/wmanley/tesseract-ocr

The script I wrote to create it is here:

    https://gist.github.com/wmanley/9213023

Thanks

Will

Nick White

unread,
Mar 3, 2014, 11:29:01 AM3/3/14
to tesser...@googlegroups.com
Hi Will,

On Tue, Feb 25, 2014 at 09:13:15AM -0800, Will Manley wrote:
> For anyone who's interested: I've created a git mirror of tesseract-ocr. It
> includes history from both googlecode SVN and sourceforge.net CVS. I've tidied
> up the commit/author information where I could.

That's great, thanks for that. Is it automatically updated when
there are new SVN commits?

It would be handy if we could switch to using git as part of the
main project; having proper local git branching is incredibly
useful :)

Does anyone have an objection to that?

Nick

Will Manley

unread,
Mar 5, 2014, 9:42:54 AM3/5/14
to tesser...@googlegroups.com

On Monday, 3 March 2014 16:29:01 UTC, Nick White wrote:
On Tue, Feb 25, 2014 at 09:13:15AM -0800, Will Manley wrote:
> For anyone who's interested: I've created a git mirror of tesseract-ocr.  It
> includes history from both googlecode SVN and sourceforge.net CVS.  I've tidied
> up the commit/author information where I could.

That's great, thanks for that. Is it automatically updated when
there are new SVN commits?

Not at the moment, no. But the script is written so it can easily be re-run whenever desired.

It would be handy if we could switch to using git as part of the
main project; having proper local git branching is incredibly
useful :)

So I had kind-of hoped that there would be some desire within the project for such a migration when I did the conversion - this was the reason I went to the trouble of pulling in the CVS history and fixing up the authors.  The hope was that it there was some latent desire to move to git within the project and it was just a lack of time/expertise to make it happen.  The theory was that if I were to do a good enough job it makes the choice to migrate to git much "cheaper" and thus maybe the tesseract devs would choose to migrate.

I didn't want to suggest it myself as it would be rather presumptive of me having not even yet submitted a patch to tesseract.

Nick White

unread,
Mar 5, 2014, 10:20:28 AM3/5/14
to tesser...@googlegroups.com
On Wed, Mar 05, 2014 at 06:42:54AM -0800, Will Manley wrote:
> So I had kind-of hoped that there would be some desire within the project for
> such a migration when I did the conversion - this was the reason I went to the
> trouble of pulling in the CVS history and fixing up the authors. The hope was
> that it there was some latent desire to move to git within the project and it
> was just a lack of time/expertise to make it happen. The theory was that if I
> were to do a good enough job it makes the choice to migrate to git much
> "cheaper" and thus maybe the tesseract devs would choose to migrate.

Good theory. I certainly would love the project to move to git.

I suspect it would make it a lot easier for Ray to merge the changes
he makes back into the project, but there would probably be a
learning curve.

Ray, what say you? As I said, it would be really helpful to me as
I'm starting to get more interested in digging deeper into the code.

Nick

Ray Smith

unread,
Mar 5, 2014, 8:42:43 PM3/5/14
to tesser...@googlegroups.com
Well I'm ambivalent about it, as I would have to learn git from scratch, although that may be useful. I currently only use svn on a very basic level anyway as most of the time I use a perforce-like system, and merge with the svn repository in a separate svn client.

I have been told that the merge facility with git is "really good" but I currently use meld for all my merges and it works very well for that. Anyone with experience of both like to comment on that?

If we were to switch to git, there is a button in the admin pages that could just make it happen. I have no idea how easy it would be to add the sourceforge change data to it.

Any more comments either way?
Ray.



--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Will Manley

unread,
Mar 6, 2014, 8:20:51 AM3/6/14
to tesser...@googlegroups.com
On Thursday, 6 March 2014 01:42:43 UTC, Ray wrote:
Well I'm ambivalent about it, as I would have to learn git from scratch, although that may be useful. I currently only use svn on a very basic level anyway as most of the time I use a perforce-like system, and merge with the svn repository in a separate svn client.

I have been told that the merge facility with git is "really good" but I currently use meld for all my merges and it works very well for that. Anyone with experience of both like to comment on that?

`git mergetool` is the command to use meld to resolve merges.  I would also recommend setting `merge.conflictstyle` to `diff3` as this will show the original essentially showing you the information to allow you to do a manual 3-way merge.  You can set that option with:

    git config --global merge.conflictstyle diff3

If we were to switch to git, there is a button in the admin pages that could just make it happen. I have no idea how easy it would be to add the sourceforge change data to it.

I don't know what that button will do in detail but if you were to click it I could splice the sourceforge data on and fix the authorship fairly easily afterwards.

git clones made before the splicing won't be compatible with after so if you are going to do it let me know when and I can do the splicing and update quickly before others have a chance to clone the new git repo.

Thanks

Will

Tom Morris

unread,
Mar 6, 2014, 9:26:26 AM3/6/14
to tesser...@googlegroups.com
There are two semi-independent changes which could be made:

1. Switch from svn to git for source control
2. Switch from Google Code to Github for project hosting

I've used both SVN and Git extensively and greatly prefer Git.  It takes a little while to understand some of the fundamental differences for simple stuff the learning required is minimal and I think Ray would be up to speed in no time.  I think using Git would make it easier for outsiders to contribute (in addition to all its other advantages like lightweight branches, off-line development, etc)

Although Google Code supports Git now, most people associate Git with hosting on Github.  This brings some additional social aspects like being able to track who's forking, submit pull requests, etc.  I find these really useful and Github supports file distribution for kit download, unlike Google Code, but, on the other hand, I find the Github issue tracker to be much weaker than the Google Code tracker.  Github probably has slightly better wiki support, but not by much.

I've migrated a couple of projects from Google Code to Github and, overall, have found it to be a win, although there are annoyances like the issue tracker.

Tom

Will Manley

unread,
Mar 9, 2014, 7:37:42 PM3/9/14
to tesser...@googlegroups.com
On Thursday, 6 March 2014 14:26:26 UTC, Tom Morris wrote:
There are two semi-independent changes which could be made:

1. Switch from svn to git for source control
2. Switch from Google Code to Github for project hosting

I think for now it would be best to keep the two independent.  This is not to preclude moving to github in the future, but I'm concerned that by trying to answer both at the same time we could get bogged down in discussion while just moving to git is seeming fairly uncontroversial at this time.

If we are to do this I'm keen to not lose momentum.  I guess the next step is for Ray to decide definitively whether he wants to make this change and what the time-scale would be.  I would suggest within a week if that's acceptable.  And then at a mutually acceptable time (I'm in GMT timezone) Ray can click the button, I'll re-run the script and provide replacement commits on github that Ray can then push to google-code.

Thanks

Will

Nick White

unread,
Mar 10, 2014, 10:27:22 AM3/10/14
to tesser...@googlegroups.com
On Sun, Mar 09, 2014 at 04:37:42PM -0700, Will Manley wrote:
> I think for now it would be best to keep the two independent. This is not to
> preclude moving to github in the future, but I'm concerned that by trying to
> answer both at the same time we could get bogged down in discussion while just
> moving to git is seeming fairly uncontroversial at this time.

I agree. Moving from Google Code may be reasonable at some point,
but it's a different discussion, and ultimately moving to git is far
more important.

Ray, I recently decided to learn git properly, and found these two
resources really useful:
* https://github.com/pluralsight/git-internals-pdf/releases
* https://www.youtube.com/watch?v=1ffBJ4sVUb4

They both emphasise learning the basics of the git internals first,
which I now agree is absolutely the right way to learn it.

Nick White

unread,
May 31, 2014, 12:08:51 AM5/31/14
to tesser...@googlegroups.com
When we move to git we should get rid of everything in tessdata/
except the tessconfigs/ and configs/ directories.

I've just been reading the manpage for 'git filter-branch' for the
first time, and it looks like the right way to do that would be a
command like this:

git filter-branch --index-filter "git rm --cached --ignore-unmatch tessdata/*traineddata tessdata/*cube*" HEAD

Then any repository cloned from that should be lovely and svelte.
This command could be added to Will's script just before the git
push.

If we wanted to preserve the history of the tessdata stuff we'd just
destroyed, we could do something like this on another copy of the
repository, to discard all but that directory:

git filter-branch --subdirectory-filter tessdata/ -- --all

And then push that to another repository, named say
"tessdata-historical".

We should probably wait until the 3.03 release to bother with this,
I'm just writing this now so we (I) remember.

Nick

Max Pole

unread,
Jun 3, 2014, 8:53:38 PM6/3/14
to tesser...@googlegroups.com

When we move to git we should get rid of everything in tessdata/  
except the tessconfigs/ and configs/ directories.

That's a very important point that has not been stressed enough, IMHO. Currently, the tessdata/ directory within the SVN repository contains a huge amount of binary data - language files. They won't change often (once in two years) but tesseract won't work without them.

Checking out the current SVN repository is a pain because it wastes a significant amount of time and storage - one rarely need all these files during code development. Therefore it would be smarter to have the option to "load" language files on demand instead of getting the whole bunch of stuff at once.

Switching to git could make things even more complicated because git wasn’t designed to cope with large binary files. Git repositories will much likely become bloated rather quickly, not to mention performance degradation. SVN has an advantage here because it doesn't store revision history on the client side.

Maybe we could start searching for a better solution for this problem. There are several options to consider: separate repository, git-annex (unfortunately, this does not work on Windows because of weird dependencies), git-submodule or git-fat.

Best regards
Maxim

Jan Ruzicka

unread,
Jun 29, 2014, 11:56:35 PM6/29/14
to tesser...@googlegroups.com
Hi
Did anybody looked into use of reposurgeoun [1] for the repository surgery and/or cleanup?
Jan

Jeff Breidenbach

unread,
Jun 30, 2014, 4:47:33 PM6/30/14
to tesser...@googlegroups.com
I hear that git is slowly getting better with large binary files. No direct experience, though.

David Arnold

unread,
Jul 30, 2014, 11:21:07 AM7/30/14
to tesser...@googlegroups.com
I think Tom Morris made an important point form an ecosystem perspective: "Git[hub] would make it easier for outsiders to contribute" [and familiarize]. I don't think this is controversial, but I think this (my) outsider perspective might have some marginal added value. 2cts.

Janusz S. Bien

unread,
Jul 30, 2014, 11:51:22 AM7/30/14
to tesser...@googlegroups.com
Quote/Cytat - David Arnold <dgx.a...@gmail.com> (Wed 30 Jul 2014
05:21:07 PM CEST):

> I think Tom Morris made an important point form an *ecosystem perspective*:
> "Git[hub] would make it easier for outsiders to contribute" [and
> familiarize]. I don't think this is controversial, but I think this (my)
> outsider perspective might have some marginal added value. 2cts.

Just of pure curiosity, what about mercurial?

Best regards

Janusz

--
Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra
Lingwistyki Formalnej)
Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/

Nick White

unread,
Aug 6, 2014, 10:28:10 AM8/6/14
to tesser...@googlegroups.com
On Wed, Jul 30, 2014 at 05:45:01PM +0200, Janusz S. Bien wrote:
> Quote/Cytat - David Arnold <dgx.a...@gmail.com> (Wed 30 Jul 2014
> 05:21:07 PM CEST):
>
> >I think Tom Morris made an important point form an *ecosystem perspective*:
> >"Git[hub] would make it easier for outsiders to contribute" [and
> >familiarize]. I don't think this is controversial, but I think this (my)
> >outsider perspective might have some marginal added value. 2cts.
>
> Just of pure curiosity, what about mercurial?

Naa, fewer people are familiar with it, so it would make it a little
harder for outsiders to contribute.

Though I disagree that github would make it easier for outsiders to
contribute. Anybody who knows how git works shouldn't care if it's
hosted on github or elsewhere; it makes no difference. And anybody
who does not should probably learn the basics, rather than just rely
on prodding some web-based gui until it appears to behave as they
expect.

Ray Smith

unread,
Aug 8, 2014, 6:51:56 PM8/8/14
to tesser...@googlegroups.com
OK, after much time spend doing little about this, I now have a plan:
1. Switch tesseract-ocr to git.
2. Create 2 new repositories - tessdata and langdata.
3. Add new language source data to langdata, and .traineddata files to tessdata. (configs and tessconfigs stay with the source code.)
4. Updates to the new repositories are the releases of the big data blobs - they can be tagged with versions quite easily for clarity.
5. tarballs probably have to still go to Google drive.
6. Syncing and updating the code to fix more issues will be my learning experience with git, but it looks very similar to svn in many ways.

This makes the main source code repository a lot smaller, making it easier to dabble with it in a single language.

Big new language pushes are still relatively easy.

If it should become necessary, a switch to github in the future would be possible, but for now quotas seem to be more generous at Google code.



--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/20140806142657.GF7804%40manta.lan.
For more options, visit https://groups.google.com/d/optout.

Thomas G.

unread,
Aug 9, 2014, 2:20:55 PM8/9/14
to tesser...@googlegroups.com

Great!

I just learnt that this topic (github repo) exists (after having created my own git repo clone of tesseract).

I fully support a move to Github, or at least an official clone, the owner (Github user) should be "Tesseract", if this is possible, and the repo should be marked as the official Github repo (or clone) of Tesseract.

Jim O'Regan

unread,
Aug 9, 2014, 7:24:24 PM8/9/14
to tesser...@googlegroups.com
On 8 August 2014 23:51, Ray Smith <thera...@gmail.com> wrote:
> OK, after much time spend doing little about this, I now have a plan:
> 1. Switch tesseract-ocr to git.
> 2. Create 2 new repositories - tessdata and langdata.
> 3. Add new language source data to langdata, and .traineddata files to
> tessdata. (configs and tessconfigs stay with the source code.)
> 4. Updates to the new repositories are the releases of the big data blobs -
> they can be tagged with versions quite easily for clarity.
> 5. tarballs probably have to still go to Google drive.
> 6. Syncing and updating the code to fix more issues will be my learning
> experience with git, but it looks very similar to svn in many ways.
>
> This makes the main source code repository a lot smaller, making it easier
> to dabble with it in a single language.
>

Full history, minus language data, is around 30M. I did a conversion,
here: https://github.com/jimregan/tesseract/
wiki, here: https://github.com/jimregan/tesseract-wiki
tessdata is done, but my flaky connection doesn't seem to want to let
it go through.

Importing those, with git turned on, should go like this:
git clone https://github.com/jimregan/tesseract.git
git remote add googlecode https://tesseract-ocr.googlecode.com/git
git push googlecode master:master

git clone https://github.com/jimregan/tesseract-wiki.git
git remote add googlecode https://wiki.tesseract-ocr.googlecode.com/git
git push googlecode master:master


> Big new language pushes are still relatively easy.
>
> If it should become necessary, a switch to github in the future would be
> possible, but for now quotas seem to be more generous at Google code.

The great thing about distributed VCS is that you don't really need to
pick one - as long as they share the same origin, you can have
multiple mirrors, commits made to one will apply to the others.

Git tags on Github can take care of #5

--
<Sefam> Are any of the mentors around?
<jimregan> yes, they're the ones trolling you

Jan Ruzicka

unread,
Aug 10, 2014, 12:12:45 PM8/10/14
to tesser...@googlegroups.com

Hi,

For the question of Github user and repositories, can somebody request a Tesseract organization user. This way there can be administrators etc.

Jan

Jim O'Regan

unread,
Aug 10, 2014, 1:56:24 PM8/10/14
to tesser...@googlegroups.com
The git repository will be on Google Code, so there won't be much need
for an organisation on Github.

That said, I misremembered some details of Github's SVN import, and
created one along the way. When Ray has made the change to git, I'll
mirror the repositories there, but there's nothing to see in the
meantime.

Jim O'Regan

unread,
Aug 12, 2014, 7:52:02 AM8/12/14
to tesser...@googlegroups.com
Tessdata: git clone https://github.com/jimregan/tessdata.git

Nick White

unread,
Aug 12, 2014, 10:36:44 AM8/12/14
to tesser...@googlegroups.com
Hi Ray,

On Fri, Aug 08, 2014 at 03:51:55PM -0700, Ray Smith wrote:
> OK, after much time spend doing little about this, I now have a plan:
> 1. Switch tesseract-ocr to git.
> 2. Create 2 new repositories - tessdata and langdata.
> 3. Add new language source data to langdata, and .traineddata files to
> tessdata. (configs and tessconfigs stay with the source code.)
> 4. Updates to the new repositories are the releases of the big data blobs -
> they can be tagged with versions quite easily for clarity.
> 5. tarballs probably have to still go to Google drive.
> 6. Syncing and updating the code to fix more issues will be my learning
> experience with git, but it looks very similar to svn in many ways.

That plan sounds great to me, thanks for this.

A couple of "nice to have, but not worth spending much extra time
over" items:

- If you could use git tags to make clear exactly which langdata
commit the .traineddata in tessdata was built from, that would be
a nice thing.

- Add a line at the top of lang.config in the .traineddata that
makes clear the git revision it's built from, so add a rule doing
that to whatever make process you currently have.

Also, what do you want to do regarding training developed by others?
Could appropriately automated and self-contained trainings make it
into the langdata repository? That seems sensible to me.

Thanks again for this :)

Nick

Ray Smith

unread,
Aug 12, 2014, 1:54:16 PM8/12/14
to tesser...@googlegroups.com
Stage 1 now complete!
Git is a git! The documentation stinks! Getting rid of the langdata and .traineddata files was not possible using the official documentation and most of the stack overflow posts on the topic! Don't believe any of the examples on the git-filter-branch manpage, and instead go to :

The new git repository contains all the revisions and tags from svn, yet is 1/10 the size it could have been with the traineddata and langdata files.



On Tue, Aug 12, 2014 at 7:35 AM, Nick White <nick....@durham.ac.uk> wrote:
Hi Ray,

On Fri, Aug 08, 2014 at 03:51:55PM -0700, Ray Smith wrote:
> OK, after much time spend doing little about this, I now have a plan:
> 1. Switch tesseract-ocr to git.
> 2. Create 2 new repositories - tessdata and langdata.
> 3. Add new language source data to langdata, and .traineddata files to
> tessdata. (configs and tessconfigs stay with the source code.)
> 4. Updates to the new repositories are the releases of the big data blobs -
> they can be tagged with versions quite easily for clarity.
> 5. tarballs probably have to still go to Google drive.
> 6. Syncing and updating the code to fix more issues will be my learning
> experience with git, but it looks very similar to svn in many ways.

That plan sounds great to me, thanks for this.

A couple of "nice to have, but not worth spending much extra time
over" items:

- If you could use git tags to make clear exactly which langdata
  commit the .traineddata in tessdata was built from, that would be
  a nice thing.
Yes its a nice thing, but keeping them in sync could be really hard.
There was a lot of manual effort involved in purging the data of anything that could be considered sensitive data, so it might not get updated very often. 

- Add a line at the top of lang.config in the .traineddata that
  makes clear the git revision it's built from, so add a rule doing
  that to whatever make process you currently have.
That sounds like a nice idea and could be possible. 

Also, what do you want to do regarding training developed by others?
Could appropriately automated and self-contained trainings make it
into the langdata repository? That seems sensible to me.
Good question. Yes if the other developers are willing to provide their source data as well. 

Thanks again for this :)

Nick
--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-dev.

Nick White

unread,
Aug 12, 2014, 2:50:45 PM8/12/14
to tesser...@googlegroups.com
> The documentation stinks! Getting rid of the langdata and
> .traineddata files was not possible using the official documentation and most
> of the stack overflow posts on the topic! Don't believe any of the examples on
> the git-filter-branch manpage, and instead go to :
> https://help.github.com/articles/remove-sensitive-data which actually worked.

Eugh, sorry to hear that, and thanks for battling through it!

> Stage 1 now complete!
> Git is a git!
> ...
> The new git repository contains all the revisions and tags from svn, yet is 1/
> 10 the size it could have been with the traineddata and langdata files.

Fantastic, thanks Ray. I just pulled a fresh copy, and it all looks
good.

I'll re-apply my alternative Makefile stuff over that, and then I
can put up a clone of it on my own server, in case anyone prefers it
- I'd been wanting to do that for ages, and now it's easy! (I did
already have my own branch doing that using git-svn, but it was
way too large to bother putting anywhere.)

Nick

Shree

unread,
Aug 15, 2014, 4:50:13 AM8/15/14
to tesser...@googlegroups.com
Hi Ray,

When I try to access


I get the following error

Cannot connect to the real tessdata.tesseract-ocr.googlecode.com

Something is currently interfering with your secure connection to tessdata.tesseract-ocr.googlecode.com.

Try to reload this page in a few minutes or after switching to a new network. If you have recently connected to a new Wi-Fi network, finish logging in before reloading.

If you were to visit tessdata.tesseract-ocr.googlecode.com right now, you might share private information with an attacker. To protect your privacy, Chrome will not load the page until it can establish a secure connection to the real tessdata.tesseract-ocr.googlecode.com.

zdenko podobny

unread,
Aug 15, 2014, 5:30:07 AM8/15/14
to tesser...@googlegroups.com
FYI, I am able to browse repositories using below mentioned links (in Google Chrome)

Zdenko


Shree Devi Kumar

unread,
Aug 15, 2014, 6:57:29 AM8/15/14
to tesser...@googlegroups.com
Zdenko,

I am also able to browse the pages, error comes when trying to download a .zip.


the error was displayed - see image

Inline image 1

zdenko podobny

unread,
Aug 15, 2014, 9:00:25 AM8/15/14
to tesser...@googlegroups.com
I followed your steps and I got error too. (using tar.gz or raw produce the same error).

Zdenko


--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-dev.

Nick White

unread,
Aug 15, 2014, 9:04:57 AM8/15/14
to tesser...@googlegroups.com
On Fri, Aug 15, 2014 at 04:26:46PM +0530, Shree Devi Kumar wrote:
> I am also able to browse the pages, error comes when trying to download a .zip.
>
> e.g. when I tried to download zip from https://code.google.com/p/tesseract-ocr/
> source/browse/hin.cube.params?repo=tessdata

I can confirm this. An example failing URL is
https://tessdata.tesseract-ocr.googlecode.com/archive/bf82613055ebc6e63d9e3b438a5c234bfd638c93.zip

Jeff Breidenbach

unread,
Aug 25, 2014, 2:48:58 PM8/25/14
to tesser...@googlegroups.com

Shree Devi Kumar

unread,
Aug 26, 2014, 5:01:08 AM8/26/14
to tesser...@googlegroups.com
Jeff,

Command-line access

Get a local copy of the tesseract-ocr tessdata repository with this command:

But, this picks up the whole repository for all languages. Is there a way to just download traineddata for one language?

Thanks,
Shree 

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com



 

--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-dev/kJEYuvEZuDs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-de...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-dev.

Sriranga(80yrs)

unread,
Aug 26, 2014, 6:44:03 AM8/26/14
to tesser...@googlegroups.com
Yes I also wanted to download only one set of kan.langdata 


--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-dev.

Shree

unread,
Aug 29, 2014, 2:15:30 AM8/29/14
to tesser...@googlegroups.com
On Tuesday, August 26, 2014 4:14:03 PM UTC+5:30, sriranga(80yrsold) wrote:
Yes I also wanted to download only one set of kan.langdata 
 
 Srirangaji,

Git does not have any easy way of downloading just one file or one subdirectory. However, they offer a sparse checkout  which can be used for the same. Please see http://jasonkarns.com/blog/subdirectory-checkouts-with-git-sparse-checkout/ for instructions. 

I gave it a try just now and was able to selectively download files: I think this is supposed to work on git versions higher than 1.7 or so. 
Please see details below:

$ which git
/usr/bin/git
$ git --version
git version 2.1.0

$ mkdir mylangdata

$ cd mylangdata

$ git init
Initialized empty Git repository in /home/User/mylangdata/.git/

Updating langdata
remote: Counting objects: 846, done.
Receiving objects: 100% (846/846), 135.39 MiB | 1.62 MiB/s, done.
Resolving deltas: 100% (49/49), done.
 * [new branch]      bel_tarask -> langdata/bel_tarask
 * [new branch]      master     -> langdata/master
 
$ git config core.sparsecheckout true

$ echo eng/ >> .git/info/sparse-checkout

$ echo common.punc  >> .git/info/sparse-checkout

$ echo common.unicharambigs  >> .git/info/sparse-checkout

$ echo font_properties  >> .git/info/sparse-checkout

$ echo forbidden_characters_default  >> .git/info/sparse-checkout

$ echo kan/  >> .git/info/sparse-checkout

$ echo Kannada.unicharset >> .git/info/sparse-checkout

$ echo Kannada.xheights  >> .git/info/sparse-checkout

$ git pull langdata master
 * branch            master     -> FETCH_HEAD



Shree

unread,
Aug 29, 2014, 2:21:30 AM8/29/14
to tesser...@googlegroups.com
However, it is possible that the above actually downloads all the files in git but only makes a few available to the filesystem. So, I think, this method will not help in avoiding large downloads.

Shree

unread,
Aug 29, 2014, 3:11:58 AM8/29/14
to tesser...@googlegroups.com, Ray Smith

Paul Vorbach

unread,
Aug 30, 2014, 5:59:27 AM8/30/14
to tesser...@googlegroups.com
You can use the Git repository browser to download single files. Navigate to the file and then hit "View raw file" to download it.


Maybe not exactly what you want, but it circumvents downloading the complete repository.

Sriranga(80yrs)

unread,
Aug 30, 2014, 6:24:51 AM8/30/14
to tesser...@googlegroups.com
Paul Vorbach.
Thanks for the valuable guidance. when tried - error message displayed vide screenshot attached.
waited for 10 minutes - still unable to download  zip file for kan - but displayed same error again.

  Even navigate to the file and then hit "View raw file" to download it - but same error message displayed vide screenshot attached..
Where I made mistake?
sriranga(81)


unable to download.png

Paul Vorbach

unread,
Aug 30, 2014, 8:38:36 AM8/30/14
to tesser...@googlegroups.com
There's a bug in Google Code, I think. Remove the "s" from https://langdata.tesseract-ocr...., so the new address starts with http://langdata.tesseract-ocr.... and reload the page. It will either display the plain text file or start the download directly (depending on the file). If it shows the file you need to right click -> Save As...

Paul

Tom Morris

unread,
Aug 30, 2014, 11:42:51 AM8/30/14
to tesser...@googlegroups.com
On Sat, Aug 30, 2014 at 8:38 AM, Paul Vorbach <pa...@vorb.de> wrote:
There's a bug in Google Code, I think. Remove the "s" from https://langdata.tesseract-ocr...., so the new address starts with http://langdata.tesseract-ocr.... 

The bug isn't the https: (it should be downloadable vis https), it's the fact that their SSL cert doesn't match the domain.

Until they fix that though, downloading using http: is the only workaround.

Tom

Paul Vorbach

unread,
Aug 30, 2014, 11:45:27 AM8/30/14
to tesser...@googlegroups.com
Yes, I meant the bug is that it's linking to HTTPS despite the SSL
certificate is not valid for the domain.

Paul

Shree

unread,
Aug 31, 2014, 7:24:07 AM8/31/14
to tesser...@googlegroups.com
Thanks!

Yes, I  was able to download the files by changing from https: to http: 

e.g. for tamil traineddata

Sriranga(80yrs)

unread,
Aug 31, 2014, 9:47:37 AM8/31/14
to tesser...@googlegroups.com
Paul Vorbach,
thanks. I was able to download only Kan.traineddata file using as 

Sriranga(80yrs)

unread,
Aug 31, 2014, 11:35:21 PM8/31/14
to tesser...@googlegroups.com
for clarification requested for by the member of tesseract-ocr.

---------- Forwarded message ----------
From: Sathyanarayanarao Magadi Nanjappa <mns...@gmail.com>
Date: Mon, Sep 1, 2014 at 5:01 AM
Subject: Re: [tesseract-dev] Re: git mirror of tesseract-ocr
To: "Sriranga(80yrs)" <withblessing....@gmail.com>


how to know that it is a revision?
Is it useful to replace existing traineddata file in freeOCR/vietOCR?gimagereader?


On Sun, Aug 31, 2014 at 7:12 PM, Sriranga(80yrs) <withblessing....@gmail.com> wrote:
rao,
I was able to download the kan.traineddata and attached herewith for your usage. pl feedback to me whether it is same as old tdf  or has improved one?

Ray Smith

unread,
Sep 5, 2014, 12:25:28 PM9/5/14
to tesser...@googlegroups.com
These aren't new traineddata files (yet).
This is just the switch-over to git (which is strongly requested by users).

The plan is to move ahead with 3.04. After we have worked through these teething problems, and 3.04 is released, we can consider whether it would be better to switch to github.


Jeff Breidenbach

unread,
Sep 5, 2014, 8:42:46 PM9/5/14
to tesser...@googlegroups.com
I'm glad to hear that people have found a workaround to downloading a single
language file: use "http" instead of "https".

I was actually just about to suggest another short term workaround. It sounds 
like some folks have already (or could) mirror the relevant repository on GitHub 
which might have a better user interface for this type of thing.

Reply all
Reply to author
Forward
0 new messages