Tesseract moved to github.com

575 views
Skip to first unread message

zdenko podobny

unread,
Jun 14, 2015, 10:12:06 AM6/14/15
to tesser...@googlegroups.com
Hello,

main tesseract-ocr repository is now located at https://github.com/tesseract-ocr/tesseract
Source training data for tesseract languages are at https://github.com/tesseract-ocr/langdata

It looks like we are close to 3.04 release (e.g. it would be great if opencl issues are fixed), so it would be great if you can test these code and packaging, so there will be no surprise after 3.04.



Zdenko

gtess...@gmail.com

unread,
Jun 18, 2015, 4:18:13 AM6/18/15
to tesser...@googlegroups.com
Please put the new Tessdata.

Tom Morris

unread,
Jun 24, 2015, 2:57:21 PM6/24/15
to tesser...@googlegroups.com
That's excellent! It should make contributing much easier. I'm
willing to help out with updating the docs for the new home.

It would be useful to add a prominent "This project has moved to
Github" to the pages on Google Code. I think there's a way to get
Google Code to do it for every page using the admin settings.

On Github, tessdata is its own repo, but there's also a stub tessdata
directory in the main repo with some of the smaller config files, etc.
What's the recommended layout for development environments? Do we
copy and/or softlink the language files (and add them to .gitignore)
or do something else?

Tom
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-de...@googlegroups.com.
> To post to this group, send email to tesser...@googlegroups.com.
> Visit this group at http://groups.google.com/group/tesseract-dev.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-dev/CAJbzG8xLELNnCmLVh0vbcOHwmTOPci-p1MzTM65romjS3D7Dgw%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.

Tom Morris

unread,
Jun 24, 2015, 3:39:13 PM6/24/15
to tesser...@googlegroups.com
On Wed, Jun 24, 2015 at 2:57 PM, Tom Morris <tfmo...@gmail.com> wrote:

> It would be useful to add a prominent "This project has moved to
> Github" to the pages on Google Code. I think there's a way to get
> Google Code to do it for every page using the admin settings.

I just checked the settings on another Google Code project that I
administer and any admin should be able to go to:

https://code.google.com/p/tesseract/adminAdvanced

and add the Github URL to the "Project Moved" section of the page.

Tom

Ray Smith

unread,
Jun 24, 2015, 7:03:24 PM6/24/15
to tesser...@googlegroups.com
I just checked in 98 new .traineddata files into tessdata, ready for 3.04, and the corresponding source training data into langdata. Yippee!!

IMHO, this makes 3.04 ready to go, unless there are any more pressing outstanding issues, like open cl?

NOTE ara, eng, hin, kor traineddata are NOT updated due to regression. The other regressions are mostly fixed, with some dramatic improvements particularly for Indic (like 20% for kan for example)

What's the recommended layout for development environments?
When Jeff and I discussed this, I think we concluded that the majority of (even) developers wouldn't want to clutter their systems with all the traineddata files. (Including the .git it is now upto almost 2GB)
We therefore decided that the best solution would be for developers to symlink the few .traineddata files from their git repository to the stub directory, and just pull the few files that you want into your local git repository.

Opinions wanted. Does this work well enough? Better suggestions?
Also, is it even worth having an official release location with the traineddata files in them, or should we just recommend that people get them directly from github?
I still have that tesseract-ocr account with a Google Drive folder if anyone thinks that is any use.

The langdata repository is also very large, (1GB) but there is no stub directory, so it is easier to pull only part of it into the correct place without symlinks.

I put a big title about the project having moved to github on the old home page.
If you hit the "Project Moved" button, the entire project basically disappears, and I am not sure what stage we are at with porting the issues. (Actually I haven't even found the issues list for github yet!)
We can hit the moved button when I am sure everything is moved over.

ShreeDevi Kumar

unread,
Jun 24, 2015, 10:59:37 PM6/24/15
to tesser...@googlegroups.com
Ray,

Glad to see that we are getting closer to 3.04 release.

The old issues are visible on github at :

Maybe they need to merged in some way to show up.

I just checked on the 'bihari' langdata - https://github.com/tesseract-ocr/langdata/tree/master/bih

looks like that the training text has NOT been updated in response to 
originally

It would be great if these issues can be addressed.

Looking forward to testing the improvements in Indic.

Thanks!

Shree




ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

ShreeDevi Kumar

unread,
Jun 24, 2015, 11:48:51 PM6/24/15
to tesser...@googlegroups.com
Regarding the traineddata files,

I would suggest that at least OSD and Eng be included in the stub directory under tesseract so that the project is 'functional' without requiring extra downloads.



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jun 25, 2015 at 4:33 AM, Ray Smith <thera...@gmail.com> wrote:

zdenko podobny

unread,
Jun 25, 2015, 4:46:17 AM6/25/15
to tesser...@googlegroups.com
OpenCL would be nice to fix.

There is patch in issue 1488[1] for fix some memory leaks (I am not sure if this is valid for recent github code)

Issues with patches put Jim to github as pull request[2]. I would be great if you can check them and decide which of them make sense to implement in 3.04 or postpone them for next version (e.g. they need more testing)


Zdenko

Tom Morris

unread,
Jun 25, 2015, 11:41:05 AM6/25/15
to tesser...@googlegroups.com
On Wed, Jun 24, 2015 at 7:03 PM, Ray Smith <thera...@gmail.com> wrote:
> I just checked in 98 new .traineddata files into tessdata, ready for 3.04,
> and the corresponding source training data into langdata. Yippee!!

Yay new trained data!  That includes not only updates, but almost 40 new languages:  amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir,  uig, urd, uzb, uzb_cyrl, yid,


> What's the recommended layout for development environments?
> When Jeff and I discussed this, I think we concluded that the majority of
> (even) developers wouldn't want to clutter their systems with all the
> traineddata files. (Including the .git it is now upto almost 2GB)

Github actually says that they want repos kept under 1 GB, but it doesn't sound like a hard limit:

> We therefore decided that the best solution would be for developers to
> symlink the few .traineddata files from their git repository to the stub
> directory, and just pull the few files that you want into your local git
> repository.
>
> Opinions wanted. Does this work well enough? Better suggestions?

While I definitely appreciate having the large binary files in a separate repo, I agree with Shree that an easier mechanism is needed get the development environment configured.  Perhaps it means including a copy of the English data in the main repo, but another possibility would be a simple download script.  Having to synch the entire tessdata repo so that it can be softlinked (or downloading individual files by hand) is kind of onerous.  I just added a phony target to the ScrollView makefile to download the required JARs: https://github.com/tfmorris/tesseract/commit/baed0f07fa62477142c9b458401814d5eb98b716#diff-908634a168da5548101eae83eb1fc5b3R51
Perhaps we could do something similar for the core language files.


> Also, is it even worth having an official release location with the
> traineddata files in them, or should we just recommend that people get them
> directly from github?
> I still have that tesseract-ocr account with a Google Drive folder if anyone
> thinks that is any use.

Github releases can include arbitrary binary files, so you could use https://github.com/tesseract-ocr/tessdata/releases or https://github.com/tesseract-ocr/tesseract/releases  It would be useful to group the files so that you don't have to make multiple downloads for a single language (ie cube trained ones) and it may make sense to group together languages commonly used together.


> I put a big title about the project having moved to github on the old home
> page.
> If you hit the "Project Moved" button, the entire project basically
> disappears, and I am not sure what stage we are at with porting the issues.

In some ways that's actually a good thing because it prevents people from creating new issues on Google Code, but it looks like there are currently two disjoint issues lists.

There's a tool which will migrate the issues provided by Google Code: https://code.google.com/p/support-tools/wiki/IssueExporterTool
Another that I've used successfully in the past for a large project:  https://github.com/arthur-debert/google-code-issues-migrator

> (Actually I haven't even found the issues list for github yet!)


> We can hit the moved button when I am sure everything is moved over.

It looks like sources, wiki, and releases have all been migrated.  Issues should be the last missing piece.

Tom

Jan Ruzicka

unread,
Jun 25, 2015, 10:50:38 PM6/25/15
to tesser...@googlegroups.com
Hi

thanks to all of you for a GitHub migration!

Is there any better way to link to/view html documentation in repository then the long links[1]?

Some wiki pages still have links to google code SVN repository.
For example the page TrainingTesseract3.md [2], links to doc/combine_tessdata.1.html [3].

Jan

[1] http://htmlpreview.github.io/?https://github.com/tesseract-ocr/tesseract/blob/master/doc/combine_tessdata.1.html
OR
https://rawgit.com/tesseract-ocr/tesseract/master/doc/combine_tessdata.1.html

[2] https://github.com/tesseract-ocr/wiki/blob/master/TrainingTesseract3.md

[3] http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/combine_tessdata.1.html

zdenko podobny

unread,
Jun 26, 2015, 4:54:14 AM6/26/15
to tesser...@googlegroups.com
On Fri, Jun 26, 2015 at 4:50 AM, Jan Ruzicka <ruzic...@gmail.com> wrote:
Hi

thanks to all of you for a GitHub migration!

Is there any better way to link to/view html documentation in repository then the long links[1]?

 
Some wiki pages still have links to google code SVN repository.
For example the page TrainingTesseract3.md [2], links to doc/combine_tessdata.1.html [3].

That wiki repository will be removed. Correct wiki location is within tesseract repository:
 
--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-dev.

Jeff Breidenbach

unread,
Jun 27, 2015, 8:46:11 PM6/27/15
to tesser...@googlegroups.com
I am currently doing some ***test*** packaging for Ubuntu / Debian / etc. 
Don't worry, I'm absolutely, positively, not going to ship anything without 
careful coordination.  Just for fun, I'm doing all the builds on old ARM 
laptop using the sources from github.

First, the autogen.sh script is throwing an error with #!/bin/sh. Switching
it over to #!/bin/bash does the trick. Second, 'make dist' is missing some
documentation files, such as MOCRadaptingtesseract2.pdf. That prevents 
the tarball from being built.
MOCRadaptingtesseract2.pdf \
        PageLayoutAnalysisICDAR2.pdf tesseracticdar2007.pdf
MOCRadaptingtesseract2.pdf \
        PageLayoutAnalysisICDAR2.pdf tesseracticdar2007.pdf

Other than those minor quibbles, this is looking pretty good. I will report
more as things progress.

Jeff

Jeff Breidenbach

unread,
Jun 27, 2015, 9:22:37 PM6/27/15
to tesser...@googlegroups.com
Looks like ccstruct/imagedata.h does not make it into the 'make dist' tarball, which
breaks the build.

zdenko podobny

unread,
Jun 28, 2015, 4:12:29 PM6/28/15
to tesser...@googlegroups.com
Will you only report them or will you make pull request (or send a patch)?

Regarding language data I am not sure how to handle its distribution:
  • I have no clue about Mac ;-)
  • Windows installer could do it as in past: offer to download selected languages - this could be easy to implement.
  • I do not believe that any linux distribution will make package all languages (it will consume too much space). Maybe there could be some simple tool for downloading and installing them to tessdata (I put together quick code[1], but IMO it is not smart enough)



Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-dev.

Jeff Breidenbach

unread,
Jun 29, 2015, 1:40:31 AM6/29/15
to tesser...@googlegroups.com
This is what was required to get 'make-dist' to work. I left autogen alone 
since it works, albeit with an error message. My practice packages appear 
to work fine.
make-dist.diff

ShreeDevi Kumar

unread,
Jun 29, 2015, 4:08:31 AM6/29/15
to tesser...@googlegroups.com

Zdenko,

In the past there was a language install option which allowed install of one language. Is it possible to add it back to the install script.?

Then people can install their choice of language traineddata.

Thanks.

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-dev.
Reply all
Reply to author
Forward
0 new messages