Plans for 3.04 release

1,973 views
Skip to first unread message

Ray Smith

unread,
Oct 30, 2014, 12:43:10 AM10/30/14
to tesser...@googlegroups.com
What will be in it:
A bunch of fixes for issues, including 1245,1205,1241,899,1229,1246,1243,1264, 1207.

Language-specific issues: 792,865,758,969,1254. If there are any more like these, an email summarizing them would be really useful, as I am preparing to do some major retraining this week/next week.

I have fixed a bunch of problems with our internal tools for generating those langdata wordlist files. They will be totally refreshed for 3.04, and should be a lot better for a lot of languages, and include more languages.

The new release will include a refreshed set of traineddata files. The idea is to use regressions from training to flush out accuracy bugs, so it may take a while longer yet, but it ought to lead to at least some improvements.

That spreadsheet looked awfully long. I really appreciate Zdenko's efforts in summarizing the important issues, so if anyone else wants to help out with that, it would help. There is a trade-off though between fixing issues and getting the next release out...



On Wed, Oct 29, 2014 at 8:10 PM, Shree <shree...@gmail.com> wrote:
Hello Zdenko,

Thanks for the update. You may want to edit the subjectline to reflect the discussion - 'Plans for Tesseract-3.04rc'

Yes, it is possible that a number of issues would get resolved with new 'traineddata' files from Ray. However, if he is only planning to release the source language data files, then we may also have an extra task of building traineddata files from them. I hope Ray will clarify and also let us know timeline of expected release.

Is https://docs.google.com/spreadsheets/d/1ePMcP1f6ot0fMbBlZ40llC_7PX_1yib4RTN6N4G5OlI/edit#gid=0 the NEW issue tracker .. If so, I woudd suggest adding columns with the version of tesseract, o/s effected and date issue was filed. 

I can test on msys2, windows8 and maybe windows7. My interest is improving the training for Indic languages, so I would be testing the training tools also. My focus will be on Devanagari based Indic languages such as Hindi, Sanskrit, Marathi etc. I can also review Gujarati and Tamil for basic info. Srirangaji can test for Kannada.

Shree

On Thursday, October 30, 2014 2:55:19 AM UTC+5:30, Zdenko Podobný wrote:
I am sorry for late reply, but I am overload with my "regular tasks", so there is no time for free projects...

IMO these steps should be done before next release:
  1. Release of new language data - this promised by Ray in past and this looks like key open topic for me.
  2. Check the wikis and other project documentation files (INSTALL, README etc.) - first to check/improve content and than it should be checked by English native speaker for grammar etc.
  3. Check the issue tracker :-):
    • check if there is the issue valid with current code
    • check if there is test case, that should help replicate error (if not ask/create simple test case)
    • grouping of issues with extended info[1] (I can add edit right for those who want to collaborate) => this will trigger other actions: what shall be fixed for this release (e.g. issues related to language files), what will be postponed... IMO issues <= 1066 commented well, but double check will help.
Beside about mention it would be great if there is community testing team e.g. for each platform (linux, windows, ios) or maybe for each compiler (gcc, clang, msys2, msys, VS 2009, VS 2010...). It is important that tester would use it on regular base. The reason such team is e.g. issue 1354[1]: I guess that usage of uintptr_t will break support of VS2009 (which is needed for python2.x on windows)

Also especially windows packager(s) is welcomed (building library, creating installation etc.). 
There is a more things to be done (also after release of source code)... So somebody want to help just ask for task at this forum.




Zdenko

On Sat, Oct 25, 2014 at 8:49 AM, ShreeDevi Kumar <shree...@gmail.com> wrote:
Zdenko,

Do you know what milestones we are waiting for before the next release? 

Is there anything that the tess community can do to help?

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Oct 25, 2014 at 3:44 AM, Jeff Breidenbach <breid...@gmail.com> wrote:
Yesterday's Ubuntu release fixed the training tools omission,
as documented in the Tesseract FAQ. Otherwise it is the exact 
same code as Ubuntu 14.10.

The mixed language PDF improvement I mentioned in the 
previous post is complete. But it won't ship with Ubuntu until after
Tesseract has made a formal release. The next Ubuntu release
will be April 2015.


--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/6dbe79f7-206e-4eae-a5dc-ec6e7edf3af7%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/CAG2NduXV_LD3BcXa6PzPd_O4KjxeSLR%2BEps%3DoZh5EmJbua1YUA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/0cb0b0c3-b802-4c0a-9768-065c9c4b646d%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

zdenko podobny

unread,
Oct 30, 2014, 4:03:06 AM10/30/14
to tesser...@googlegroups.com
Thanks Ray,

please have a look (filter) issues with Group "Patch" and "Language data".
Part of "Language data" is language specific, but some symbols/letter (e.g. euro sign and rupee symbol or but IMO other common currency symbol, like $, £) could/should be included in all languages (Yet I did not check in which languages they are missing :-( )

Zdenko

ShreeDevi Kumar

unread,
Oct 30, 2014, 5:51:21 AM10/30/14
to tesser...@googlegroups.com
Thanks for the update, Ray. looking forward to 3.04 ...

On Thu, Oct 30, 2014 at 10:13 AM, Ray Smith <thera...@gmail.com> wrote:
What will be in it:
A bunch of fixes for issues, including 1245,1205,1241,899,1229,1246,1243,1264, 1207.

​Interesting, none of these are listed under open issues.​
 

Language-specific issues: 792,865,758,969,1254. If there are any more like these, an email summarizing them would be really useful, as I am preparing to do some major retraining this week/next week.

​Here are the language related issues.

I have fixed a bunch of problems with our internal tools for generating those langdata wordlist files. They will be totally refreshed for 3.04, and should be a lot better for a lot of languages, and include more languages.

The new release will include a refreshed set of traineddata files. The idea is to use regressions from training to flush out accuracy bugs, so it may take a while longer yet, but it ought to lead to at least some improvements.

That spreadsheet looked awfully long. I really appreciate Zdenko's efforts in summarizing the important issues, so if anyone else wants to help out with that, it would help. There is a trade-off though between fixing issues and getting the next release out...

​You could probably do an alpha/beta release with all the new traineddata and then decide which of the remaining issues get fixed in the final 3.04 release.

Thanks!​
 
 

ShreeDevi Kumar

unread,
Oct 30, 2014, 5:57:01 AM10/30/14
to tesser...@googlegroups.com
Correction - Issue 721 should read

721:Chinese OCR improvement using character frequency database

Shree

unread,
Nov 4, 2014, 11:53:17 PM11/4/14
to tesser...@googlegroups.com
Ray,

Good to see the status change to 'Started' and 'Accepted' on many of these issues.

Please also see
1362: Add support for sanskrit transliteration in latin/roman script

Shree

unread,
Nov 8, 2014, 1:23:24 PM11/8/14
to tesser...@googlegroups.com
Also see

1376:Remove archaic letters from Georgian training_text

Shree

unread,
Dec 17, 2014, 3:09:58 AM12/17/14
to tesser...@googlegroups.com, Ray Smith
Hello Ray,

Any update on this?
Should we expect a 3.04 release in 2014?

Thanks!
Shree

Gene Chan

unread,
Jan 12, 2015, 10:33:16 AM1/12/15
to tesser...@googlegroups.com, thera...@gmail.com
I recently discovered this project and an impressed - thanks for the great work! I am looking forward to a new official release as well - should we expect it in the near term?

Eren VELİBASA

unread,
Feb 5, 2015, 5:38:19 AM2/5/15
to tesser...@googlegroups.com
Hello Ray
Did you determine the date of release ?
30 Ekim 2014 Perşembe 06:43:10 UTC+2 tarihinde Ray yazdı:

Shree

unread,
Mar 3, 2015, 1:27:37 AM3/3/15
to tesser...@googlegroups.com
Zdenko/Jeff/Ray,

Ubuntu 15.04

Vivid Vervet

Rel

April 2015

January 2016


is it possible for the next release of Tesseract-ocr to be included in the above.

If new traineddata files are not ready, can we at least package the current GIT version which will offer the improvements over the last one year to a larger audeience.

Thanks!

Jeff Breidenbach

unread,
Apr 9, 2015, 4:35:02 PM4/9/15
to tesser...@googlegroups.com
The important cutoff dates for Ubuntu are marked "DebianImportFreeze".
We obviously missed the Feb 19 deadline for 15.04. The deadline for 
15.10 is August 7.

https://wiki.ubuntu.com/UtopicUnicorn/ReleaseSchedule

ShreeDevi Kumar

unread,
Apr 11, 2015, 4:36:51 AM4/11/15
to tesser...@googlegroups.com
Thanks, Jeff.

Is it possible to set some milestones for the next Tesseract release so that we don't miss the next "DebianImportFreeze" deadline?

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-dev.

Jeff Breidenbach

unread,
Jul 3, 2015, 11:23:25 PM7/3/15
to tesser...@googlegroups.com
I talked to Ray very briefly, and he expressed interest in doing more development
before the next Tesseract release. I speculate that this will push 3.04 beyond the
August 7 deadline for inclusion into the next Ubuntu release.

About 15 months ago, I took a snapshot of HEAD and shipped it on a whole
bunch of Linux distributions. I caused a whole lot of confusion with version 
numbers (sorry! sorry! sorry!) On the other hand, this newer code helped a
whole bunch of users.

Zdenko, I wonder if it makes sense to ship another code snapshot. What 
do you think? I've confirmed that the code at HEAD basically works, and 
is compatible with older training data. I haven't done any performance 
comparisons vs what Debian / Ubuntu ship right now.

If the answer is yes, please let me know what you would like it to be
called, so as to not repeat the my naming fiasco of 2014. Some 
traditional possibilities include:

 3.03.YYYYMMDD
 3.03.02+YYYYMMDD
 3.04~YYYYMMDD
 3.04~rc1
 3.04~rcYYYYMMDD
 3.04~beta1
 3.04~betaYYYYMMDD

I will not do anything without explicit blessing and consensus. Especially
from Zdenko & Ray. Thank you for consideration.

Jeff


ShreeDevi Kumar

unread,
Jul 4, 2015, 2:42:56 AM7/4/15
to tesser...@googlegroups.com
I think that there are enough improvements in the code to warrant another snapshot if the official release is not forthcoming. The last snapshot provided the pdf functionality, much appreciated by users!

My preference would be for naming to include 3.04 (since that's what the code from GIT has been reporting for a while) and also to include the YYYYMMDD date.

Since Ray is planning more development before release, I don't think RC will be appropriate for the current snapshot. I would suggest going with beta or beta1 along with the date.

Thanks!

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-dev.

Sriranga(81+yrsold)

unread,
Jul 4, 2015, 3:00:03 AM7/4/15
to tesser...@googlegroups.com
goog suggestion. I prefer as "3.04~beta1YYYYMMDD" instead of '3.04~betaYYYYMMDD"

Sriranga(81+yrsold)

unread,
Jul 4, 2015, 3:03:22 AM7/4/15
to tesser...@googlegroups.com
correction:
good suggestion. I prefer as "3.04~beta1YYYYMMDD" instead of '3.04~betaYYYYMMDD"

zdenko podobny

unread,
Jul 5, 2015, 11:23:09 AM7/5/15
to tesser...@googlegroups.com
I would suggest to release 3.04:
  • releasing rc1-xyz or rc2 IMO does not make sense. Code is stable, desired fixes are committed (or will be soon e.g. there is promise we will receive patches for opencl issues this week), language data are updated...
  • we need stable release. I expect that it will trigger some issues (e.g. which visual studio support will be requested, how language data will be distributed etc.) I am afraid that without stable release nobody will pay attention to them...
  • If there will some improvements ( in short time) we can release version 3.04.01 easily
  • There are several interesting code contributions (tsv output, monitor API extension, DAWG_TYPE_HFST used by OCRicola ) that IMO make sense to integrate into master branch after stable release (to avoid another testing and therefore postponing the release).

Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-dev.

Tom Morris

unread,
Jul 5, 2015, 11:41:32 AM7/5/15
to tesser...@googlegroups.com
I agree with Zdenko.  They're just numbers.  They're not in short supply.  The only significance is the semantic value that people associate with them.

I'd use 3.04 for this release, 3.04.1 if a bug fix release, if needed, and 3.05 for the next release if Ray's changes are features or enhancements.

The new language support is a huge deal, in my opinion, and something worth both releasing ASAP and highlighting in the release announcement.

Tom

Jeff Breidenbach

unread,
Jul 8, 2015, 8:23:53 PM7/8/15
to tesser...@googlegroups.com
Is there any chance I could ship on Friday to Debian Unstable? 
I have some time now, which will evaporate on July 20th. Shipping
sooner rather than later gives us time to iterate on packaging.
Debian Unstable is released daily, so we can update whenever
we wish. Hopefully, Zdenko or Ray will say: "Yes, and please call 
it XXXX"

Jeff

PS. I am currently exercising tesstrain.sh with mixed results so far.
I suspect there will be code iteration no matter what.
 

zdenko podobny

unread,
Jul 9, 2015, 7:06:44 AM7/9/15
to tesser...@googlegroups.com
I think it is Ray's right...

Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-dev.

Jeff Breidenbach

unread,
Jul 10, 2015, 3:06:21 AM7/10/15
to tesser...@googlegroups.com
Meeting with Ray in about 12 hours, fingers crossed.

By the way, the known packaging problems are fairly small: 

 * missing manpages for classifier_tester, set_unicharset_properties, text2image
 * I haven't even tried to package tesstrain.sh yet
 * lots of font challenges with tesstrain.sh 
      - can't find some of the fonts
      - others like 'noto' are opentype instead of truetype and don't seem to work

Also, where is the new location for tarballs now that things are on github? The old
  

  


 

zdenko podobny

unread,
Jul 10, 2015, 3:57:03 AM7/10/15
to tesser...@googlegroups.com
github.com creates tar ball/zip automatically for release or tag[1] or you can get the master copy by links like [2] or [3].
 

 
Zdenko  

-- 
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-dev.

Jim O'Regan

unread,
Jul 10, 2015, 4:57:10 AM7/10/15
to tesser...@googlegroups.com
When you push a tag, Github marks it as a release. It creates zip
files and tarballs, and allows you to upload additional files. These
generated tarballs will be missing the configure script, etc., so it
might be best to push those to a branch, and tag the commit on the
branch.

--
<Sefam> Are any of the mentors around?
<jimregan> yes, they're the ones trolling you

ShreeDevi Kumar

unread,
Jul 10, 2015, 8:39:39 AM7/10/15
to tesser...@googlegroups.com
 * I haven't even tried to package tesstrain.sh yet
 * lots of font challenges with tesstrain.sh 
      - can't find some of the fonts
      - others like 'noto' are opentype instead of truetype and don't seem to work


Lines 23-27

if [ "$(uname)" == "Darwin" ];then
FONTS_DIR="/Library/Fonts/"
else
FONTS_DIR="/usr/share/fonts/truetype/"
fi

This could be a possible reason for NOT finding the opentype fonts


Ray Smith

unread,
Jul 10, 2015, 7:15:45 PM7/10/15
to tesser...@googlegroups.com
OK, I just committed a change that fixes the opentype fonts problem, and gets rid of an annoying warning message during training.

I think we should now go with what we have and call it 3.04.00

If we get any more fixes soon, we can call that 3.04.01 etc, and that way Jeff will have plenty of time to iterate on the packaging for Debian.

Possible fixes that might be ready for 3.04.01:
OpenCL.
Font naming issues. I have been using noto fonts from a Google web page (https://www.google.com/get/noto/) and Jeff has found fonts with different names in Ubuntu. Jeff is looking in to which are the most up-to-date "official" names. The resolution might be for me to get the fonts from Ubuntu and change the names in the script, but again this might negatively impact windows users, although training isn't really supported for windows. (is it?)

Things that most likely will not be ready for 3.04.
How to solve the multiple repositories/tessdata overlay problem. That could turn into a big can of worms, but I welcome suggestions for how we could do it better for next time.
In particular, looking for suggestions from the Windows community, as Jeff says it isn't a big problem for him shipping to Debian.




--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-dev.

Jeff Breidenbach

unread,
Jul 10, 2015, 9:04:52 PM7/10/15
to tesser...@googlegroups.com
Awesome. Someone with wrote permission, please tag.

zdenko podobny

unread,
Jul 11, 2015, 4:00:19 AM7/11/15
to tesser...@googlegroups.com
Let's try:

I hope I catched and fixed all problems..


Zdenko

On Sat, Jul 11, 2015 at 3:04 AM, Jeff Breidenbach <breid...@gmail.com> wrote:
Awesome. Someone with wrote permission, please tag.
--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-dev.

Jeff Breidenbach

unread,
Jul 13, 2015, 1:21:26 AM7/13/15
to tesser...@googlegroups.com
Congratulations on the release, so far! The new code will appear in Debian 
Unstable tomorrow along with upgrades for all existing languages. I have a bit
more work to do on the new languages, which I will tackle this coming week. If 
there are any major problems I expect to find out quite quickly.

Over time this will automatically propagate to many other Linux distributions, 
including the upcoming Ubuntu release in October. For those that can't wait 
it is pretty easy to install Debian Unstable inside a chroot jail.  I've done this 
recently from both Ubuntu 14.04 and also a Chromebook running ChromeOS. 


That's all for now. In general, things seem to be going well.




ShreeDevi Kumar

unread,
Jul 13, 2015, 2:56:21 AM7/13/15
to tesser...@googlegroups.com

What will be the process for releasing this on other platforms ?

- sent from my phone. excuse the brevity

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-dev.

Jeff Breidenbach

unread,
Jul 13, 2015, 5:18:50 PM7/13/15
to tesser...@googlegroups.com
All the new languages in tessdata are being submitted to Debian Unstable
right now. Because they are new, there is a manual approval process. So
it may take some time (weeks?) before they reach users.

https://github.com/tesseract-ocr/tessdata

Shree

unread,
Oct 26, 2015, 4:11:03 AM10/26/15
to tesseract-dev

October 22nd

Warning /!\ FinalRelease Ubuntu Ubuntu 15.10


Tesseract 3.04 is included in the above release, see


Thanks, Jeff.

Zdenko, 
Are there links to 3.04 binaries for other OSes that can be shared too.
Thanks!

Sriranga(83yrsold)

unread,
Oct 26, 2015, 8:04:36 AM10/26/15
to tesser...@googlegroups.com

Just now I downloaded from source  as "

I shall be thankful to you if you kindly intimate me the step by step procedure to be followed for installing the tesseract-ocr in ubuntu 15.10.

with regards, sriranga

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-dev.

ShreeDevi Kumar

unread,
Oct 26, 2015, 10:53:39 AM10/26/15
to tesser...@googlegroups.com

Tesseract can be directly installed on Ubuntu using the apt-get install

Command: sudo apt-get install tesseract-ocr

You need the source, only if you want the latest  changes made after 3.04 release.


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Sriranga(83yrsold)

unread,
Oct 26, 2015, 12:41:14 PM10/26/15
to tesser...@googlegroups.com
Thanks for the valuable suggestion. Successfully installed as suggested.

Jeff Breidenbach

unread,
Feb 3, 2016, 7:12:16 PM2/3/16
to tesseract-dev
I should probably update the Debian Tesseract packages before 
Feb 18, because of a compatibility problem with Apple software.
It most likely requires a change to the invisible PDF font to fix. 
Feb 18 is the cutoff date to make the next Ubuntu release.


My question is:

 a) Do you want any code changes included?
 b) What version number would you like me to use?

If no code changes are desired, my temptation is to call the 
package 3.04.00-6, which means the 6th packaging revision 
of 3.04.00. The font change would be considered a patch 
applied at packaging time.

However, I've learned from hard experience to not just go do 
things like this without checking. Please tell me what you would 
prefer.

Cheers,
Jeff

zdenko podobny

unread,
Feb 4, 2016, 3:04:53 AM2/4/16
to tesser...@googlegroups.com
I think that we should release 3.04.01 (bug-fix) version:
  1. We planed to make it when we made 3.04.00 release because there was expectation  OpenCL fix will come soon...
  2. There is a lot of other fixes in 3.05 branch that need to be transferred to 3.04 branch (AFAIK only "monitor" "cmake" patches could be considered as new features => should stay in 3.05 branch)
IMO it would be nice to have additionally these issues solved ASAP:
  • API compatibility with 3.02 version (namely tesseract::TessBaseAPI::ProcessPages and  tesseract::TessBaseAPI::ProcessPage) that cause that some tesseract wrappers stop to work
  • OpenCL
  • check/fix docs (e.g. all examples on wiki should be tested with the latest code)
  • and of course close as much as possible open issues[1]




Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.

ShreeDevi Kumar

unread,
Feb 5, 2016, 5:45:35 AM2/5/16
to tesser...@googlegroups.com
It would be great if someone who has built winodws binary with the latest code on visual studio would package it for windows users so that it can be included as part of this release. Thanks.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Tom Morris

unread,
Feb 5, 2016, 9:18:23 AM2/5/16
to tesser...@googlegroups.com
On Thu, Feb 4, 2016 at 3:04 AM, zdenko podobny <zde...@gmail.com> wrote:
I think that we should release 3.04.01 (bug-fix) version:
  1. We planed to make it when we made 3.04.00 release because there was expectation  OpenCL fix will come soon...
  2. There is a lot of other fixes in 3.05 branch that need to be transferred to 3.04 branch (AFAIK only "monitor" "cmake" patches could be considered as new features => should stay in 3.05 branch)
I don't see a 3.05 branch. Is 'master' effectively the 3.05 branch?
 
IMO it would be nice to have additionally these issues solved ASAP:
  • API compatibility with 3.02 version (namely tesseract::TessBaseAPI::ProcessPages and  tesseract::TessBaseAPI::ProcessPage) that cause that some tesseract wrappers stop to work
  • OpenCL
  • check/fix docs (e.g. all examples on wiki should be tested with the latest code)
  • and of course close as much as possible open issues[1]
That all sounds like good stuff, but it sounds like it could be a lot to squeeze into the two week window that Jeff is talking about.

Tom 

zdenko podobny

unread,
Feb 5, 2016, 9:32:39 AM2/5/16
to tesser...@googlegroups.com
On Fri, Feb 5, 2016 at 3:18 PM, Tom Morris <tfmo...@gmail.com> wrote:
On Thu, Feb 4, 2016 at 3:04 AM, zdenko podobny <zde...@gmail.com> wrote:
I think that we should release 3.04.01 (bug-fix) version:
  1. We planed to make it when we made 3.04.00 release because there was expectation  OpenCL fix will come soon...
  2. There is a lot of other fixes in 3.05 branch that need to be transferred to 3.04 branch (AFAIK only "monitor" "cmake" patches could be considered as new features => should stay in 3.05 branch)
I don't see a 3.05 branch. Is 'master' effectively the 3.05 branch?

Yes ;-)
 
IMO it would be nice to have additionally these issues solved ASAP:
  • API compatibility with 3.02 version (namely tesseract::TessBaseAPI::ProcessPages and  tesseract::TessBaseAPI::ProcessPage) that cause that some tesseract wrappers stop to work
  • OpenCL
  • check/fix docs (e.g. all examples on wiki should be tested with the latest code)
  • and of course close as much as possible open issues[1]
That all sounds like good stuff, but it sounds like it could be a lot to squeeze into the two week window that Jeff is talking about.

We can make 3.04.02 soon, if there are people who would like to contribute... 

Tom 

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-dev.

Jeff Breidenbach

unread,
Feb 9, 2016, 3:09:14 PM2/9/16
to tesseract-dev
Yes, I'm very sorry about the short window. I think Feb 18 
is the hard deadline and we would need to be ready before 
then. The date comes from the Ubuntu release schedule.

We don't have to work around the Apple / Tesseract PDF 
compatibility problem this cycle, but it seems like a good 
idea. Release cycles are every six months.

I suspect the majority of Tesseract users are on Ubuntu 
or similar, due to ease of installation. Statistical sampling 
suggests that 2.6% of all such systems have Tesseract 
installed and 0.5% of all systems have run it recently. 
Ubuntu was estimated at 25 million total users in 2014.


If for some reason it is impossible to get a real 3.04.01 
together before deadline, what should I do? Skip the
release cycle? Declare 3.04.01 = 3.04.00 + font? 
Other?

zdenko podobny

unread,
Feb 10, 2016, 2:46:51 AM2/10/16
to tesser...@googlegroups.com
I will try to regenerate Makefile.in/autotools files in 3.04 branch during Friday and than I will change version number.
So please commit all you want to have in 3.04.01 until that time.
Then you can take any code snapshot and use it as 3.04.01. I will wait maybe another week for any report regarding build system (I would expect the first one from you :-) ) so we can fix issues before setting tag 3.04.01/official release



Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-dev.

zdenko podobny

unread,
Feb 12, 2016, 6:10:53 PM2/12/16
to tesser...@googlegroups.com
Done.

@all: please test build system on your preferred OS/platform. There should be no significant changes in comparison to 3.04.00 (e.g. cmake is not included), but improvements are welcomed.

Zdenko

Jeff Breidenbach

unread,
Feb 16, 2016, 1:42:53 PM2/16/16
to tesseract-dev
I have no trouble building from the 3.04 branch and am ready to ship.
If there are any objections or concerns, now is a good time.

Last commit: 8473e5a2628efc70fe24b253eff79c3d7ebcddee

P.S. I don't think this should matter to anyone here, but I am bumping 
the soversion in the Debian packaging. This is purely because of an 
earlier Debian packaging mistake with 3.04.00.



Tom Morris

unread,
Feb 16, 2016, 1:56:46 PM2/16/16
to tesser...@googlegroups.com
Builds and runs on Mac OS X.

I'd like to lobby for the inclusion of the fix for #225 because it's a breaking change (if you consider the hOCR output to be part of the public interface) to new functionality which was just introduced in 3.04.00 (adding line height parameters).

I don't think it was widely advertised, so hopefully no one or not many people are using it, but the less exposure the broken version gets, the less likely people are to depend on the way it's coded.

Of course, the more people who can review the fix, the better.

Tom

zdenko podobny

unread,
Feb 16, 2016, 3:20:59 PM2/16/16
to tesser...@googlegroups.com
Fix for #225 was merged... 
Can you have have a look at #223 and #224 (there is message "This branch has conflicts that must be resolved")

Zdenko

Jeff Breidenbach

unread,
Feb 16, 2016, 3:47:25 PM2/16/16
to tesseract-dev
I should probably ship some time within the next 24 hours....

zdenko podobny

unread,
Feb 16, 2016, 4:35:22 PM2/16/16
to tesser...@googlegroups.com
OK. So I made 3.04.01 release live:


Zdenko

On Tue, Feb 16, 2016 at 9:47 PM, Jeff Breidenbach <breid...@gmail.com> wrote:
I should probably ship some time within the next 24 hours....

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-dev.

Tom Morris

unread,
Feb 16, 2016, 5:53:44 PM2/16/16
to tesser...@googlegroups.com
On Tue, Feb 16, 2016 at 3:20 PM, zdenko podobny <zde...@gmail.com> wrote:
Fix for #225 was merged... 

Thanks!
 
Can you have have a look at #223 and #224 (there is message "This branch has conflicts that must be resolved")

I'll rebase them against the current head. I made them independent of each other in case some got accepted and not others, but they're all in the same area of the code.

They're not super critical for this release.

Tom 

Jeff Breidenbach

unread,
Feb 16, 2016, 7:54:13 PM2/16/16
to tesseract-dev
Thanks everyone. I have uploaded 3.04.01 (as per Zdenko's release) to 
Debian Unstable. Debian Unstable users should get it tomorrow. If all 
goes well, Ubuntu users will get it in April, as part of Ubuntu 16.04.

Cheers,
Jeff

Quan Nguyen

unread,
Feb 17, 2016, 8:47:55 PM2/17/16
to tesseract-dev
Tom,

Unfortunately, #225 has broken my builds on VS2013 with the following error:

Error 2 error C2057: expected constant expression tesseract-ocr\api\baseapi.cpp 1384 1 libtesseract304

It occurred on the line: char id_buffer[bufsize];

The compiler expects the array size must be a compile time constant, which bufsize is not. Is there a work around that?

Thanks,
Quan
Message has been deleted

Tom Morris

unread,
Feb 17, 2016, 9:16:07 PM2/17/16
to tesser...@googlegroups.com
Hi Quan,

Sorry for the trouble. I never build on Windows and didn't notice that the CI job failed. I inadvertently took advantage of a g++ extension that the Microsoft compiler doesn't support, but the next commit fixes the issue:


Unfortunately, I think it missed the train for 3.04.01, but my understanding was that that release was cut principally for Debian, which won't be affected.

If you build from the HEAD of master, you'll get the fix -- or you can cherry-pick it into whatever branch you're working on.

Apologies again for introducing the instability.

Tom

On Wed, Feb 17, 2016 at 9:08 PM, Quan Nguyen <nguy...@gmail.com> wrote:
I changed it as below to get it compiled:

char* id_buffer = new char[bufsize];

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-dev.

Quan Nguyen

unread,
Feb 17, 2016, 9:19:33 PM2/17/16
to tesseract-dev
I changed it as below to get it compiled:

char* id_buffer = new char[bufsize];

And delete id_buffer after use at the end of the function.

On Wednesday, February 17, 2016 at 7:47:55 PM UTC-6, Quan Nguyen wrote:
Message has been deleted

Quan Nguyen

unread,
Feb 17, 2016, 9:54:04 PM2/17/16
to tesseract-dev
Oh, that should fix it too, but it is imperative that the fix should also be applied to Branch 3.04 for VS2013 to build successfully.

Thank you.

Jeff Breidenbach

unread,
Feb 24, 2016, 8:41:52 PM2/24/16
to tesseract-dev
Not urgent, but please take a look at this discussion:

It suggests moving a three line function from baseapi.h to 
baseapi.cc for better ABI compatibility. Supposedly no 
downside. Thoughts?

Tom Morris

unread,
Feb 26, 2016, 12:48:27 AM2/26/16
to tesser...@googlegroups.com
On Wed, Feb 24, 2016 at 8:41 PM, Jeff Breidenbach <breid...@gmail.com> wrote:
Not urgent, but please take a look at this discussion:

It suggests moving a three line function from baseapi.h to 
baseapi.cc for better ABI compatibility.

That function (with a single line body) being, I think:


  PageIterator* AnalyseLayout() {
    return AnalyseLayout(false);
  }
 
Supposedly no 
downside. Thoughts?

It's a little bit difficult to follow the discussion over the months (and to know the back story for pissy little people like Julien Cristau "happy to remove your package" - seriously?), but it sounds like the supposed ABI breakage happened between 3.03 and 3.04 and the 3.04.01 change, whether it be a new SO name, as is current, or the revised header file, as is proposed, is designed to address that. Did I follow that all correctly?

It also sounds like there are only two known dependent packages, at least in the Debian ecosystem, and one of them has already patched around the problem on their end.

If the proposed header file change restores 3.03 ABI compatibility, that sounds like a no-brainer, although I suspect it's mostly moot at this point. Bumping the shareable version from 3 to 4 is definitely something that I think should be avoided if it can be.

Those are my thoughts from a long, long away, but hopefully it will spur others more knowledgeable to offer an opinion as well.

Tom

Jeff Breidenbach

unread,
Feb 27, 2016, 1:34:49 AM2/27/16
to tesseract-dev
Basically right. Looks like small accidental ABI breakage between 3.03 and 3.04.00. I tried  to deal with this by bumping the package soversion in 3.04.01. But was advised to undo that and instead repair the ABI break. So we're trying that now. We'll know in about a week if  it works. If successful, then it probably makes sense to do the same small change in the primary repository. No urgency at all. 

Regarding social etiquette, I've been lucky to always work with very reasonable people in person. And that's also been the case for the vast majority of online collaborations. Once in a while you get a real head scratcher, though. I try not to worry about it, and focus on how amazing it is that I can (sometimes) technically collaborate with someone despite that enormous barrier.

Rahul Yadav

unread,
Mar 8, 2016, 4:51:25 AM3/8/16
to tesseract-dev
Hi Zdenko,

In Tesseract all spaces are skipped between two words and Tesseract keeps only one word to make difference.
I went through blogs to see how we can put exact spaces as in pdf so that the converted text files looks like PDF file.
Please help me how we can change the BaseAPI.cpp to acheive this.
If anyone have already achieved this functionality.  Kindly provide the updated code.

Regards,
Rahul

zdenko podobny

unread,
Mar 8, 2016, 5:15:41 AM3/8/16
to tesser...@googlegroups.com
This is not true. It can be easily proved by 'tesseract testing\eurotext.tif eurotext'.
For asking support ("how to use tesseract") please use tesseract user forum.
And don't forget to describe what are you doing.

Zdenko

Reply all
Reply to author
Forward
0 new messages