Tesseract 3.02.02 Released

1,104 views
Skip to first unread message

zdenko podobny

unread,
Nov 3, 2012, 10:50:00 AM11/3/12
to tesser...@googlegroups.com, tesser...@googlegroups.com
Hello all,

Tesseract OCR 3.02 was released (as 3.02.02) and you can find it in download section[1] or on the Project page in section "Featured".


Tesseract release notes - V3.02
  • Moved ResultIterator/PageIterator to ccmain.
  • Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic.
  • Added paragraph detection in layout analysis/post OCR.
  • Fixed inconsistent xheight during training and over-chopping.
  • Added simultaneous multi-language capability.
  • Refactored top-level word recognition module.
  • Added experimental equation detector.
  • Improved handling of resolution from input images.
  • Blamer module added for error analysis.
  • Cleaned up externally used namespace by removing includes from baseapi.h.
  • Removed dead memory management code.
  • Tidied up constraints on control parameters.
  • Added support for ShapeTable in classifier and training.
  • Refactored class pruner.
  • Fixed training leaks and randomness.
  • Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding.
  • Improved line detection and removal.
  • Added fixed pitch chopper for CJK.
  • Added UNICHARSET to WERD_CHOICE to make mult-language handling easier.
  • Fixed problems with internally scaled images.
  • Added page and bbox to string in tr files to identify source of training data better.
  • Fixes to Hindi Shiroreka splitter.
  • Added word bigram correction.
  • Reduced stack memory consumption and eliminated some ugly typedefs.
  • Added new uniform classifier API.
  • Added new training error counter.
  • Fixed endian bug in dawg reader.
  • C API (thanks to Tobias Müller)
  • New solution for VS 2008 (thanks to Tom Powers)
  • Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
Windows installer was build on Windows XP SP3 with NSIS tool. Tesseract.exe (and trainings tools) is 32bit static build with VC++ 2008 Express, so maybe you will need Microsoft Visual C++ 2008 SP1 Redistributable Package (x86) [2].

All google generated language data were updated (community language data files were not updated yet).
New languages available from google: afr, aze, bel, ben, chr, enm, epo, est, eus, frm, glg, ita_old, kan, mal, mkd, mlt, msa, spa_old, sqi, swa, tam, tel.
Cube data files are available for ita, fra, rus, spa too.
Added experimental equation detector (equ).
There is also new community language  Ancient Greek (grc) - thanks to Nick White.

Language data files created for 3.00 and 3.01 can be used in 3.02. Language data files created with Tesseract OCR 3.02 will not work in previous versions.

Thanks you all who shared your know-how and tested tesseract 3.02 in svn.
Thanks Google for supporting this project!

[1] http://code.google.com/p/tesseract-ocr/downloads/list
[2] http://www.microsoft.com/en-us/download/details.aspx?id=5582&WT.mc_id=MSCOM_EN_US_DLC_DETAILS_121LSUS007998

--
Zdenko Podobný
Community project contributor

Nick White

unread,
Nov 5, 2012, 7:51:23 AM11/5/12
to tesser...@googlegroups.com, tesser...@googlegroups.com
Great news, thanks everyone for the hard work, great job!

Nick
Message has been deleted

Nick White

unread,
Jan 22, 2013, 4:41:04 AM1/22/13
to tesser...@googlegroups.com
Hi Alex,

On Mon, Jan 21, 2013 at 10:19:41AM -0800, alex...@yahoo.com wrote:
> Is there any GUI for training? I got box files but then it got complicated. Gui
> would make it easier and more convenient. less errors after all.
> thanks.

No training GUI at the moment I'm afraid, no.

The training process is a bit long, but hopefully shouldn't be too
hard to follow. Let us know if you get stuck and we'll be able to
help you (and update the documentation if necessary to make things
clearer).

Nick

iram akbar

unread,
Nov 7, 2014, 5:56:12 AM11/7/14
to tesser...@googlegroups.com
Hi,
 
i want to make my own tessdata files for Arabic. anyone tell me how Tesseract use to make tessdata files for different language are they using tools for that if yes then what are the tools. see sample attached file(made by tesseract)
Question: How i can make tessdata file for Arabic 
ara.cube.word-freq

ShreeDevi Kumar

unread,
Nov 7, 2014, 6:35:38 AM11/7/14
to tesser...@googlegroups.com

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/18d6449b-fb33-498a-a1dc-d2a81c0213b6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ShreeDevi Kumar

unread,
Nov 7, 2014, 6:36:49 AM11/7/14
to tesser...@googlegroups.com

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Reply all
Reply to author
Forward
0 new messages