Fwd: [tesseract-ocr/tesseract] Tag a new version for LSTM 4.0 (#995)

ShreeDevi Kumar

unread,

Jul 12, 2017, 1:41:10 AM7/12/17

to tesser...@googlegroups.com, tesser...@googlegroups.com

Forwarding update by Ray.

---------- Forwarded message ----------
From: theraysmith <notifi...@github.com>
Date: Wed, Jul 12, 2017 at 5:55 AM
Subject: Re: [tesseract-ocr/tesseract] Tag a new version for LSTM 4.0 (#995)
To: tesseract-ocr/tesseract <tess...@noreply.github.com>

I'm about ready to update the traineddatas. I have a training run almost
complete, and with accuracy that meets with my satisfaction.
There are a few regressions, but not too serious.
First though, I have to get some code reviewed in Google, and then make
some commits to github to match the new traineddatas.
Before that, there is the matter of a major pull...

Here's what's coming:

- Fix to issue 653: New components in traineddata file for the
unicharset, recoder and version string. Backwards compatible change, so the
LSTM component can still read older files.
- Change in training system. The above change makes open source training
impossible. Will add a new program to build a starter traineddata from a
unicharset and optional word lists.
- New "normalization" code to clean corpus text in all languages. That
was a big part of the work.
- Improvements to the trained networks to improve accuracy on single
characters and single words.
- 2 parallel sets of tessdata. "best" and "fast". "Fast" will exceed the
speed of legacy Tesseract in real time, provided you have the required
parallelism components, and in total CPU only slightly slower for English.
Way faster for most non-latin languages, while being <5% worse than "best"
Only "best" will be retrainable, as "fast" will be integer.

I have other stuff that is still incomplete, but that is a good list for
now.

BTW, in case you hadn't noticed, there was a breaking change that made old
lstmf files unusable. That was needed to fix LSTM for OSD. It has to know
the language of each training sample.
The new traineddatas will mostly be smaller than the older ones, as they
won't contain the legacy components, and no bigram dawgs are needed.

--
Ray.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub,

Jeff Breidenbach

unread,

Aug 27, 2017, 6:21:35 PM8/27/17

to tesseract-dev, tesser...@googlegroups.com

Alexander Pozdnyakov has done a really good job packing Tesseract in his

Personal Package Archive (PPA). I think it is getting to be time for wider usage,

so I'm working with him to promote these to official packages. First step is

Debian Experimental. That's a good place to work out problems, and hopefully

something can be ready for real users within a few weeks.

https://packages.qa.debian.org/t/tesseract.html

Jeff Breidenbach

unread,

Sep 6, 2017, 3:48:21 PM9/6/17

to tesseract-dev, tesser...@googlegroups.com

There are new LSTM train data files available, but we need to

reorganize things on GitHub to make them manageable.

Right now there are 5 repositories on GitHub for tesseract-ocr.

They are tesseract, langdata, tessdata, tesseract-ocr.github.io,

and docs.

Please make two new ones. I suggest we call them

lstm-best and lstm-fast but other choices are possible.

Please give me write permissions to both repositories.

I will add the files for lstm-fast once the repository

is created. For lstm-best please migrate these files:

https://github.com/tesseract-ocr/tessdata/tree/master/best

At the end of the day, we will have three sets of .traineddata

files on GitHub in three separate repositories. Most users

will want LSTM Fast and that is what will be shipped as

part of Linux distributions. LSTM Best is for people willing

to trade a lot of speed for slightly better accuracy. It is also

better for certain retraining scenarios for advanced users.

The third set is for the legacy recognizer.

I do not have sufficient permission to do this myself, as you

can see from the attached screenshot. Thank you.

screenshot.png

Jeff Breidenbach

unread,

Sep 10, 2017, 1:56:35 PM9/10/17

to tesseract-dev

Ping

Zdenko Podobný

unread,

Sep 10, 2017, 2:41:01 PM9/10/17

to tesser...@googlegroups.com

Also the Ray... But we have the same rights if I understood it corectly:

Zdenko

On Sun, Sep 10, 2017 at 7:56 PM, Jeff Breidenbach <breid...@gmail.com> wrote:

Ping
--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-dev+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/2cc671c5-c33f-4c91-a357-57098e590f21%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

ShreeDevi Kumar

unread,

Sep 12, 2017, 2:58:58 AM9/12/17

to tesser...@googlegroups.com

I notice that Ray has created the new repos, but they have not been populated with the traineddata files yet.

https://github.com/tesseract-ocr/tessdata_best

https://github.com/tesseract-ocr/tessdata_fast

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/CAJbzG8zUkBRS7iCxZoObeOZQw328Fx%2Bwoewa7KT2bb6PbnEfpg%40mail.gmail.com.

Jeff Breidenbach

unread,

Sep 14, 2017, 6:17:03 PM9/14/17

to tesseract-dev

Populated the new repositories, and removed the LSTM files from tessdata.

I'm sure documentation needs updating.

ShreeDevi Kumar

unread,

Sep 15, 2017, 2:02:50 AM9/15/17

to tesser...@googlegroups.com

Thanks!

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Sep 15, 2017 at 3:47 AM, Jeff Breidenbach <breid...@gmail.com> wrote:

Populated the new repositories, and removed the LSTM files from tessdata.
I'm sure documentation needs updating.

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-dev+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-dev.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/19cba2f2-5c8d-4618-9cc3-ce1ce16b935f%40googlegroups.com.

ShreeDevi Kumar

unread,

Sep 15, 2017, 2:47:55 AM9/15/17

to tesseract-dev

Most users
will want LSTM Fast and that is what will be shipped as
part of Linux distributions.

tessdata_fast only has the fast LSTM models. That means that --oem 0 and --oem 2 will NOT work with these.

However, there are a number of cases where the legacy engine performs better than LSTM. So, shouldn't the version being shipped as part of Linux distributions include traineddata files that support both engines.

I would suggest traineddata files with fast LSTM model and legacy model combined - similar to the 4.0 traineddata files from Nov 2016.

Jeff Breidenbach

unread,

Dec 20, 2017, 6:26:05 PM12/20/17

to tesseract-dev

The packages have made progress and will likely ship with Ubuntu 18.04.

I'll probably re-synchronize with Alex's latest PPA which is a Dec 15 git

snapshot. Does it have any known big problems?

https://packages.qa.debian.org/t/tesseract.html

ShreeDevi Kumar

unread,

Dec 21, 2017, 2:07:53 AM12/21/17

to tesser...@googlegroups.com

Jeff,

Do you know if Ray is planning to update tessdata before the debian release?

Which flavor of tessdata will be included in the package? Fast?

--

You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-dev+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-dev.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/9dd6c78b-305c-4782-a531-4799fe53947a%40googlegroups.com.

Jeff Breidenbach

unread,

Dec 31, 2017, 1:47:48 PM12/31/17

to tesseract-dev

Sorry for slow reply, it is due to holidays.

> Do you know if Ray is planning to update tessdata before the debian release.

He will not.

> Which flavor of tessdata will be included in the package? Fast?

Yes, fast. I think this is the best choice for most users, and I worry

about causing confusion by providing multiple options.

To give a small update, a Dec 15 git snapshot is now shipping as

part of Debian Unstable and Debian Testing. I expect it to be part

of Ubuntu 18.04 (releasing in April 2018) but has not yet been

integrated there. Thank you again to Alexander for doing 99%

of the work with his PPA.

If I am reading these survey numbers right, Tesseract is installed on

8% of Debian systems, and executed recently on 2% of them. There

are now 347 packages that depend on Tesseract, with 6 of them being

direct dependencies.

https://qa.debian.org/popcon.php?package=tesseract

If anyone notices any problems with any of these packages, this is

a very good time to speak up.

ShreeDevi Kumar

unread,

Jan 1, 2018, 8:53:29 AM1/1/18

to tesser...@googlegroups.com

Thanks for the update, Jeff. Happy New Year.

Does 'fast' traineddata support --oem 0 for all languages? If not, an appropriate user-friendly error message should be given.

Best wishes for the new year to the Tesseract-ocr team.

--

You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-dev+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-dev.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/f4306b09-47da-4d04-9cb3-ad1988374de4%40googlegroups.com.

Jeff Breidenbach

unread,

Jan 1, 2018, 1:47:43 PM1/1/18

to tesseract-dev

'Fast' refers to an LSTM neural net that uses integer math instead of floating point.

Therefore it does not support --eom 0 nor --oem 2. The current error message is

below. I agree that this message isn't wonderful, but I'm not sure what would be

better.

--

$ tesseract --oem 0 phototest.tif - -

Failed loading language 'eng'

Tesseract couldn't load any languages!

Could not initialize tesseract.

ShreeDevi Kumar

unread,

Jan 2, 2018, 3:46:06 AM1/2/18

to tesser...@googlegroups.com

I would suggest an error message mentioning the real issue. Something on the following lines:

OEM 0 or OEM 2 are NOT supported by this traineddata file.

It could also change the value of OEM to 1 and run the OCR.

I think there is a similar message for the opposite case where LSTM is requested but does not exist.

I notice that this thread is in the Dev mailing list. I will add a link to this discussion on GitHub.

--

You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-dev+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-dev.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/2703d7a2-44e4-493c-a2fe-86891e2f0933%40googlegroups.com.

Jeff Breidenbach

unread,

Feb 8, 2018, 12:09:42 PM2/8/18

to tesseract-dev

Ubuntu 18.04 contains a git snapshot from Dec 15 (commit cdc35338).

Realisticly, there are about two weeks left to fiddle around. Any requests?

https://wiki.ubuntu.com/BionicBeaver/ReleaseSchedule

https://launchpad.net/ubuntu/+source/tesseract

Reply all

Reply to author

Forward