The answer to that question depends on several factors. Tesseract is
fairly mature and works on arbitrary binary documents. Tesseract
until 3.0 doesn't work well on isolated lines, but it also didn't have
much in the way of layout analysis. Tesseract 3.0 offers layout
analysis, a neural network recognizer, and improved language modeling.
OCRopus has not had a stable release yet. Its layout analysis is
probably better than Tesseracts. Its text recognition isn't as good
as Tesseract's yet, but it's rapidly improving. OCRopus also contains
a whole range of new technologies for page segmentation,
preprocessing, and language modeling. Our long term plan is to make
Tesseract available through OCR as well, once the 3.0 release and APIs
are stable.
OCRopus has largely moved to Python now, which has speeded up
development and makes it easier to create custom solutions.
The upshot is: both solutions are going to be a lot of work, and they
both have their limitations. If Tesseract gets your job done, just
use it for the time being.
Tom
On May 13, 6:06 pm, Christoph <
christoph.m.g.m...@googlemail.com>
wrote: