Update on progress of 4.00 alpha:
In a training session over the holiday break, I tried 17 different network architectures to experiment with smaller, faster networks.
The news is good!
Exactly how it will work in 4.00 is currently up for debate, but I now have a set of traineddata files that deliver ~3x speed-up at a cost of almost no loss in accuracy for most languages!
On a modern enough machine with multi-core +SSE/AVX-like SIMD instructions, these networks beat baseline tesseract for speed, even in Latin languages.
This may be provided as a second tessdata repo for those that want speed, or maybe the current traineddata files will just get replaced with the faster ones, since the accuracy and speed are so good.
Thanks to everyone who has contributed language-specific issues so far!
The main purpose of this post is a rallying cry for more.
Since the training cycle takes about 2 weeks, I'd like to fix as many language issues as possible before going back to training.