Re: Tesseract vs OCRopus

Tom Morris

unread,

May 4, 2010, 12:40:10 PM5/4/10

to ocr...@googlegroups.com

Anyone?

How about suggestions for a better forum in which to ask the question?

Tom

On Wed, Apr 28, 2010 at 3:30 PM, Tom Morris <tfmo...@gmail.com> wrote:
> Dennis' (m00tpoint) message reminded me of something I've been meaning
> to ask for a while.
>
> What exactly is the relationship between Tesseract and OCRopus these
> days? Are they just competitors to each other?
>
> My original understanding was the OCRopus was using the Tesseract
> recognition engine and was focusing on higher order issues like page
> segmentation/layout analysis, system integration, etc, but more
> recently I believe the Tesseract recognition engine has been replaced
> with either one built from scratch or one derived from a different
> source. Is this an accurate summary?
>
> Does anyone have a block diagram of the processing pipeline with the
> alternatives available for each stage in the pipe? Even better, one
> which includes an analysis of the strengths and weakness of the
> components relative to each other? (languages supported, error rates,
> etc)
>
> Thanks advance for any info.
>
> Tom
>

--
You received this message because you are subscribed to the Google Groups "ocropus" group.
To post to this group, send email to ocr...@googlegroups.com.
To unsubscribe from this group, send email to ocropus+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/ocropus?hl=en.

Tom

unread,

May 14, 2010, 3:52:53 PM5/14/10

to ocropus

Please see this thread:

http://groups.google.com/group/ocropus/browse_thread/thread/3bbb28b8e6ba6947/722eeaa44b0288ab?lnk=gst&q=tesseract#722eeaa44b0288ab

Tom

Tom

unread,

May 14, 2010, 4:08:38 PM5/14/10

to ocropus

> > My original understanding was the OCRopus was using the Tesseract
> > recognition engine and was focusing on higher order issues like page
> > segmentation/layout analysis, system integration, etc, but more
> > recently I believe the Tesseract recognition engine has been replaced
> > with either one built from scratch or one derived from a different
> > source. Is this an accurate summary?

Yes, roughly. We couldn't use straight Tesseract because it didn't
work well on isolated lines, so we added some wrappers around it that
allowed it to do so. These wrappers are broken now because the
Tesseract APIs have changed. In Tesseract 3.0, there are new APIs
that are supposed to be stable, so we will be building new interfaces
to Tesseract 3.0 when we have that. At that point, you can choose
again between Tesseract and the built-in OCRopus recognizers.

> > Does anyone have a block diagram of the processing pipeline with the
> > alternatives available for each stage in the pipe? Even better, one
> > which includes an analysis of the strengths and weakness of the
> > components relative to each other? (languages supported, error rates,
> > etc)

You can get a list of all OCRopus components with the "ocropus
components" command. The major components are:

ICleanupGray -- image preprocessing (default: StandardPreprocessing)
ISegmentPage -- page layout analysis (default: SegmentPageByRAST)
IRecognizeLine -- text line recognition (linerec; decided during
training)
IGenericFst -- language modeling (functionally, all implementations
are the same; you build language models with pyopenfst)

For each component, you can modify parameters. For example,

ocropus-pages -P SegmentPageByRAST:gap_factor=10 ...

(You can get usage information for the Python commands with the "-h"
argument: "ocropus-pages -h")

will run ocropus-pages with an instance SegmentPageByRAST and the
gap_factor set to 10. You can see all the available parameters with
"ocropus params SegmentPageByRAST".

For ICleanupGray and ISegmentPage, there are a few useful alternatives
and useful changes to parameter settings, since preprocessing and
segmentation are the most common sources of recognition problems. To
see what is happening during those stages, you can run ocropus-
binarize, ocropus-pseg, and ocropus-pages with the "-d" argument,
which will show you the output of binarization and/or segmentation.

IRecognizeLine is not settable in the recognizer because it is simply
a property of the model that you load. Once we have an interface to
Tesseract 3.0, you will be able to just load Tesseract for that
component.

The language models are generated in PyOpenFST; have a look at ocropus-
linefst and ocropy.fstutils.load_text_file_as_fst for a simple example
of how to construct those.

If you want to see how all the components play together, have a look
at ocropus-pages; it is fairly well commented now.

If you want to see how the line recognizer itself works, have a look
at ocropy/simplerec.py; again, it is fairly well commented. However,
that's the Python version of the line recognizer, which still lacks
some important functionality (statistical space models, size models)
that are in the C++ recognizer.

sup...@docit-solutions.com

unread,

Oct 25, 2014, 12:21:23 AM10/25/14

to ocr...@googlegroups.com

Tom, are you still working on OCR projects? if so please contact us to have a potential consulting opportunity

Reply all

Reply to author

Forward