Tesseract 3.01 Released

131 views

Skip to first unread message

zdenko podobny

unread,

Oct 30, 2011, 10:24:13 AM10/30/11

to tesser...@googlegroups.com, tesser...@googlegroups.com

Hello all,

Tesseract 3.01 was released and you can find it in download section [1] or on the Project page in section "Featured".

Windows installer was build on Windows XP SP3 with VC++ 2008 Express, so maybe you will need Microsoft Visual C++ 2008 SP1 Redistributable Package (x86) [2]. Tesseract.exe is static build from VS2008 solution, so no additional libraries are needed.

Language data files created for 3.00 can be used with 3.01. Language data files created with tesseract 3.01 will not work with tesseract 3.00.

Tesseract release notes - V3.01

Thread-safety! Moved all critical globals and statics to members of the appropriate class. Tesseract is now thread-safe (multiple instances can be used in parallel in multiple threads.) with the minor exception that some control parameters are still global and affect all threads.
Added Cube, a new recognizer for Arabic. Cube can also be used in combination with normal Tesseract for other languages with an improvement in accuracy at the cost of (much) lower speed. There is no training module for Cube yet.
OcrEngineMode in Init replaces AccuracyVSpeed to control cube.
Greatly improved segmentation search with consequent accuracy and speed improvements, especially for Chinese.
Added PageIterator and ResultIterator as cleaner ways to get the full results out of Tesseract, that are not currently provided by any of the TessBaseAPI::Get* methods. All other methods, such as the ETEXT_STRUCT in particular are deprecated and will be deleted in the future.
ApplyBoxes totally rewritten to make training easier. It can now cope with touching/overlapping training characters, and a new boxfile format allows word boxes instead of character boxes, BUT to use that you have to have already boostrapped the language with character boxes. "Cyclic dependency" on traineddata.
Auto orientation and script detection added to page layout analysis.
Deleted lots of dead code.
Fixxht module replaced with scalable data-driven module.
Output font characteristics accuracy improved.
Removed the double conversion at each classification.
Upgraded oldest structs to be classes and deprecated PBLOB.
Removed non-deterministic baseline fit.
Added fixed length dawgs for Chinese.
Handling of vertical text improved.
Handling of leader dots improved.
Table detection greatly improved.
Fixed a couple of memory leaks.
Fixed font labels on output text. (Not perfect, but a lot better than before.)
Cleanup and more bug fixes
Special treatments for Hindi.
Support for build in VS2010 with Microsoft Windows SDK for Windows 7 (thanks to Michael Lutz)

[1] http://code.google.com/p/tesseract-ocr/downloads/list

[2] http://www.microsoft.com/download/en/details.aspx?id=5582&WT.mc_id=MSCOM_EN_US_DLC_DETAILS_121LSUS007998

Cong Nguyen

unread,

Oct 30, 2011, 11:16:15 AM10/30/11

to tesser...@googlegroups.com, tesser...@googlegroups.com

Thanks man!!!!!

Reply all

Reply to author

Forward

0 new messages