Fwd: [tesseract-ocr/tesseract] Tag a new version for LSTM 4.0 (#995)

116 views
Skip to first unread message

ShreeDevi Kumar

unread,
Jul 12, 2017, 12:03:04 AM7/12/17
to tesser...@googlegroups.com, tesser...@googlegroups.com
​Forwarding update by Ray.


---------- Forwarded message ----------
From: theraysmith <notifi...@github.com>
Date: Wed, Jul 12, 2017 at 5:55 AM
Subject: Re: [tesseract-ocr/tesseract] Tag a new version for LSTM 4.0 (#995)
To: tesseract-ocr/tesseract <tess...@noreply.github.com>


I'm about ready to update the traineddatas. I have a training run almost
complete, and with accuracy that meets with my satisfaction.
There are a few regressions, but not too serious.
First though, I have to get some code reviewed in Google, and then make
some commits to github to match the new traineddatas.
Before that, there is the matter of a major pull...

Here's what's coming:

- Fix to issue 653: New components in traineddata file for the
unicharset, recoder and version string. Backwards compatible change, so the
LSTM component can still read older files.
- Change in training system. The above change makes open source training
impossible. Will add a new program to build a starter traineddata from a
unicharset and optional word lists.
- New "normalization" code to clean corpus text in all languages. That
was a big part of the work.
- Improvements to the trained networks to improve accuracy on single
characters and single words.
- 2 parallel sets of tessdata. "best" and "fast". "Fast" will exceed the
speed of legacy Tesseract in real time, provided you have the required
parallelism components, and in total CPU only slightly slower for English.
Way faster for most non-latin languages, while being <5% worse than "best"
Only "best" will be retrainable, as "fast" will be integer.

I have other stuff that is still incomplete, but that is a good list for
now.

BTW, in case you hadn't noticed, there was a breaking change that made old
lstmf files unusable. That was needed to fix LSTM for OSD. It has to know
the language of each training sample.
The new traineddatas will mostly be smaller than the older ones, as they
won't contain the legacy components, and no bigram dawgs are needed.


--
Ray.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub

Jeff Breidenbach

unread,
Aug 27, 2017, 6:21:42 PM8/27/17
to tesseract-dev, tesser...@googlegroups.com
Alexander Pozdnyakov has done a really good job packing Tesseract in his
Personal Package Archive (PPA). I think it is getting to be time for wider usage,
so I'm working with him to promote these to official packages. First step is 
Debian Experimental. That's a good place to work out problems, and hopefully
something can be ready for real users within a few weeks.


Jeff Breidenbach

unread,
Sep 6, 2017, 3:48:29 PM9/6/17
to tesseract-dev, tesser...@googlegroups.com
There are new LSTM train data files available, but we need to
reorganize things on GitHub to make them manageable.
Right now there are 5 repositories on GitHub for tesseract-ocr. 
They are tesseract, langdata, tessdata, tesseract-ocr.github.io
and docs.

Please make two new ones. I suggest we call them 
lstm-best and lstm-fast but other choices are possible.
Please give me write permissions to both repositories.
I will add the files for lstm-fast once the repository
is created. For lstm-best please migrate these files:

At the end of the day, we will have three sets of .traineddata
files on GitHub in three separate repositories. Most users
will want LSTM Fast and that is what will be shipped as
part of Linux distributions. LSTM Best is for people willing
to trade a lot of speed for slightly better accuracy. It is also
better for certain retraining scenarios for advanced users.
The third set is for the legacy recognizer.

I do not have sufficient permission to do this myself, as you
can see from the attached screenshot. Thank you.

screenshot.png
Reply all
Reply to author
Forward
0 new messages