Does OCRopus still use Tesseract for line recognition and...

john.d...@gmail.com

unread,

Aug 16, 2009, 9:58:58 PM8/16/09

to ocropus

Hi all,

I was wondering if OCRopus still uses Tesseract for line recognition?
From what I gather in the release notes for 0.4 (and from what I have
determined from putting print statements in the code to follow the
execution path), OCRopus no longer uses Tesseract, but rather a new
line recognizer created by you guys. If this is the case, could you
provide an overview of the changes required to have it again call
Tesseract? I thought it would be a simple one line change in ocr-
commands.cc by including the tesseract header and in the
main_lines2fsts( ) method changing:

linerec = glinerec::make_Linerec();

to

linerec = make_TesseractRecognizeLine();

but quite a few hours trying to get OCRopus to compile after making
this change has proved me wrong. On the other hand, if I am completely
mistaken and OCRopus still uses Tesseract, could you point me to the
file where OCRopus is calling Tesseract?

The group I am working with has been comparing OCRopus and Tesseract,
and we noticed that the text generated by OCRopus when following the
whole OCR process (i.e. ocropus book2pages dir book.tif; ocropus
pages2lines dir; ocropus lines2fsts dir; ocropus fsts2text dir)
differs from what we get when taking the lines produced by following
the procedure up to pages2lines and then feeding them into Tesseract,
so we were wondering what is up.

Currently Tesseract is providing us better results for our images than
OCRopus is, but we would like to see the results that OCRopus gives
when it is using Tesseract.

Thanks,

John

Thomas Breuel

unread,

Aug 17, 2009, 8:29:54 AM8/17/09

to ocr...@googlegroups.com

> I was wondering if OCRopus still uses Tesseract for line recognition?
> From what I gather in the release notes for 0.4 (and from what I have
> determined from putting print statements in the code to follow the
> execution path), OCRopus no longer uses Tesseract, but rather a new
> line recognizer created by you guys.

Correct.

> If this is the case, could you
> provide an overview of the changes required to have it again call
> Tesseract? I thought it would be a simple one line change in ocr-
> commands.cc by including the tesseract header and in the
> main_lines2fsts( ) method changing:
>
> linerec = glinerec::make_Linerec();
>
> to
>
> linerec = make_TesseractRecognizeLine();

Unfortunately, interfacing with Tesseract isn't easy; that's why we
don't have it in the default build anymore.

There is a separate subproject for a Tesseract interface called ocrotess here:

http://iupr1.cs.uni-kl.de/cgi-bin/hgwebdir.cgi/ocrotess/

> Currently Tesseract is providing us better results for our images than
> OCRopus is, but we would like to see the results that OCRopus gives
> when it is using Tesseract.

It's pointless to carry out performance comparisons between OCRopus
and Tesseract right now; the models shipping with the OCRopus
recognizer have been trained on only a small number of characters and
styles. They will perform well on some styles and poorly on others,
depending on resolution and fonts.

Furthermore, for book recognition, you should use book-adaptive
recognition with OCRopus, which results in substantial improvements in
recognition rates.

Tom

patrickq

unread,

Sep 16, 2009, 7:15:58 AM9/16/09

to ocropus

Hi Tom,

I just started using TesseractExtractResult() with Tesseract version
3.0, an API described in the header file as being part of the "OCRopus
add-on" but as far as I can tell, it is using the same training data
as Tesseract (eng.traineddata) and appears to use Tesseract. Yet you
seem to say that OCRopus is not using Tesseract - please clarify.

Thanks,
Patrick

Thomas Breuel

unread,

Sep 18, 2009, 12:19:35 AM9/18/09

to ocr...@googlegroups.com

On Wed, Sep 16, 2009 at 04:15, patrickq <patrick.q...@gmail.com> wrote:
> I just started using TesseractExtractResult() with Tesseract version
> 3.0, an API described in the header file as being part of the "OCRopus
> add-on" but as far as I can tell, it is using the same training data
> as Tesseract (eng.traineddata) and appears to use Tesseract. Yet you
> seem to say that OCRopus is not using Tesseract - please clarify.

The OCRopus line recognition command is "ocropus lines2fsts"; by
default, it uses the line recognizer defined in ocr-line, which has
nothing to do with Tesseract. OCRopus doesn't even link with
Tesseract by default anymore.

After the beta release, we hope to restore Tesseract as an optional
line recognizer as a plug-in (since OCRopus now supports plugins).

Tom

Bob Gustafson

unread,

Mar 8, 2010, 8:50:10 AM3/8/10

to ocr...@googlegroups.com

On Mon, Aug 17, 2009 at 6:29 AM, Thomas Breuel <tmb...@gmail.com> wrote:

...

Unfortunately, interfacing with Tesseract isn't easy; that's why we
don't have it in the default build anymore.

There is a separate subproject for a Tesseract interface called ocrotess here:

http://iupr1.cs.uni-kl.de/cgi-bin/hgwebdir.cgi/ocrotess/

When trying to access this link, I get:

he specified repository "ocrotess" is unknown, sorry. Please go back to the main repository list page.

Is there another name for this project?

Thomas Breuel

unread,

Mar 8, 2010, 9:34:26 AM3/8/10

to ocr...@googlegroups.com

You don't need ocrotess right now and it's broken anyway. There will
be a new interface to it, but that will have to wait for the Tesseract
3.0 release.

Tom

> --
> You received this message because you are subscribed to the Google Groups
> "ocropus" group.
> To post to this group, send email to ocr...@googlegroups.com.
> To unsubscribe from this group, send email to
> ocropus+u...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/ocropus?hl=en.
>

Reply all

Reply to author

Forward