distribute legacy train data files in Debian & friends?

144 views
Skip to first unread message

Jeff Breidenbach

unread,
Dec 4, 2018, 12:55:01 PM12/4/18
to tesseract-dev
There are 3 sets of trained data: best, fast and legacy. Right now only "fast" ships
with Debian. We did that because it seemed like a good balance between solving
most people's OCR needs, and minimizing confusion. But some people ask for more,
for example this request is for the legacy data.  What do people think? Yes or no?

  https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=913951

Shree Devi Kumar

unread,
Dec 4, 2018, 2:27:59 PM12/4/18
to tesser...@googlegroups.com
The traineddata format allows for both lstm and legacy models to coexist and be used by invoking appropriate --oem (0 or 1).

The tessdata repo has both lstm and legacy models.

tessdata_best and fast have only lstm models.

It should be possible to use combine_tessdata command to add the legacy models to best and fast repos.

There is a known issue though with --oem 2.


--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/554b878b-d6e2-4864-ac82-6c7ff5b13c7d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,
Dec 4, 2018, 4:14:02 PM12/4/18
to tesser...@googlegroups.com
Also note that for certain languages the lstm models provide much improved accuracy.

If comparative accuracy data is available for different languages for the different versions, it would help in making an informed decision.

stjo...@googlemail.com

unread,
Mar 5, 2019, 6:32:51 AM3/5/19
to tesseract-dev
I recently modified Tesseract to get image files by URL (https://github.com/tesseract-ocr/tesseract/pull/2134) and was already thinking about extending that to models.

It would be possible to extend the `-l` command line option to allow URLs. Tesseract could also download missing models on demand and cache them locally. It could require well defined prefixes for the language, so `-l eng` or `-l script/Latin` would use installed languages, while `-l tessdata/eng` or `-l tessdata_best/script/Latin` would use models from GitHub.

Some models load additional models automatically. This currently works with a fixed path and would have to be changed to relative paths.

Zdenko Podobny

unread,
Mar 23, 2019, 4:25:08 AM3/23/19
to tesser...@googlegroups.com
My opinion/suggestion is to have "lean" ocr library (libtesseract) with minimum dependencies, so it can be easily  integrated by other project. Maybe with simple example how to use library.

Than it would be great to have (feature rich) tool, that would be help standard use problem like you mention above e.g. downloading missing data files, fixing dpi, image preprocessing (fixing rotation, deskewing...) - so more external dependencies can be expected to have better user experience. 

With this scenario training tools would be separated too... ;-)

I believe this can bring more flexibility, because:
  • more user friendly frontend can be rapidly develop/released
  • adding new features will bring more problems (e.g. for downloading data: using proxy, parsing json data from github api), that are not related ocr itself
  • more advanced users can focus on improving API and OCR library (e.g. for python, java C# usage)
  • not to forget: others could focus training and looking for improvement at this area (also from coding point of view: e.g. using CUDA or OpenCL)
Zdenko


ut 5. 3. 2019 o 12:32 stjoweil via tesseract-dev <tesser...@googlegroups.com> napísal(a):
I recently modified Tesseract to get image files by URL (https://github.com/tesseract-ocr/tesseract/pull/2134) and was already thinking about extending that to models.

It would be possible to extend the `-l` command line option to allow URLs. Tesseract could also download missing models on demand and cache them locally. It could require well defined prefixes for the language, so `-l eng` or `-l script/Latin` would use installed languages, while `-l tessdata/eng` or `-l tessdata_best/script/Latin` would use models from GitHub.

Some models load additional models automatically. This currently works with a fixed path and would have to be changed to relative paths.

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-dev.
Reply all
Reply to author
Forward
0 new messages