Calling all language issues for 4.00!

656 views
Skip to first unread message

Ray

unread,
Jan 12, 2017, 12:06:39 PM1/12/17
to tesseract-dev
Update on progress of 4.00 alpha:

In a training session over the holiday break, I tried 17 different network architectures to experiment with smaller, faster networks.
The news is good!

Exactly how it will work in 4.00 is currently up for debate, but I now have a set of traineddata files that deliver ~3x speed-up at a cost of almost no loss in accuracy for most languages!
On a modern enough machine with multi-core +SSE/AVX-like SIMD instructions, these networks beat baseline tesseract for speed, even in Latin languages.

This may be provided as a second tessdata repo for those that want speed, or maybe the current traineddata files will just get replaced with the faster ones, since the accuracy and speed are so good.

Thanks to everyone who has contributed language-specific issues so far!
The main purpose of this post is a rallying cry for more.
Since the training cycle takes about 2 weeks, I'd like to fix as many language issues as possible before going back to training.

ShreeDevi Kumar

unread,
Jan 13, 2017, 12:42:22 AM1/13/17
to tesser...@googlegroups.com
Ray,
You could update tessdata repo with the new traineddata files now and tag as 4.0.0-alpha2. That way we can test and report issues against the latest and faster version of traineddata.
Thanks for all your work!

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-dev+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/cef8e5c7-1275-4dbc-a8c8-2dbe75c666f3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ShreeDevi Kumar

unread,
Jan 13, 2017, 5:41:46 AM1/13/17
to tesser...@googlegroups.com
Please also review the pending issues and pull requests on both langdata and tessdata repos.

- excuse the brevity, sent from mobile

ShreeDevi Kumar

unread,
Jan 15, 2017, 12:57:00 PM1/15/17
to tesser...@googlegroups.com
Ray,


It seems that deu training_text is missing ë and not recognizing ü.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shree Devi Kumar

unread,
Feb 7, 2017, 2:35:15 AM2/7/17
to tesseract-dev
Ray,

Any update on the training? When do you expect to upload the new traineddata?

Thanks!

James R Barlow

unread,
Feb 8, 2017, 2:17:54 AM2/8/17
to tesseract-dev
Hi Ray,

Umlaut on the letter in French ï seems to be read as a regular i, e.g. the word "ovoïde".

Stefan Weil

unread,
Mar 9, 2017, 3:05:58 PM3/9/17
to tesseract-dev
While looking for the cause of some OCR errors with German text, I noticed several weaknesses of the training data (see https://github.com/tesseract-ocr/langdata/issues/55 for examples). What is the best way to get those fixed in updates of tessdata? Where do systematic errors (like ii instead of ü or training words which don't exist in the language) come from, and how can we avoid them?

I also observe that most (all?) Latin based languages use the whole range of characters in real texts, but only part of those characters are present in the Tesseract training data. https://github.com/tesseract-ocr/langdata/pull/58 is an example of a missing character in English, but there are also more prominent deficits. For example the word café (written with accent) can be found in many (all?) European languages, and of course the accented e is available in French training data and even in the English data, but it is missing in German and other languages which also use it. As there is more and more exchange between countries, I see an increase of foreign names and spellings in German texts, and I expect that for other languages, too. So maybe all Latin based training data should include all variants of Latin characters (ASCII, umlauts, accented, ...), and the different language training data would only differ in word lists or probabilities of certain characters and their combinations. Could that also reduce the time needed for training?

Reply all
Reply to author
Forward
0 new messages