djvu2hocr and hocr2djvused

261 views
Skip to first unread message

Janusz S. Bień

unread,
Mar 22, 2010, 11:57:42 AM3/22/10
to ho...@googlegroups.com

I would like to inform you about two tools available with ocrodjvu:

http://jwilk.net/software/ocrodjvu

Best regards

Janusz

--
,
dr hab. Janusz S. Bien, prof. UW - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - Warsaw University (Department of Formal Linguistics)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/

Janusz S. Bień

unread,
Mar 23, 2010, 6:34:04 AM3/23/10
to ho...@googlegroups.com
On Mon, 22 Mar 2010 jsb...@mimuw.edu.pl (Janusz S. Bień) wrote:

> I would like to inform you about two tools available with ocrodjvu:
>
> http://jwilk.net/software/ocrodjvu

I have just noticed that one of these tools, namely djvu2hocr, has
been already reported on the hocr-tools site

http://code.google.com/p/hocr-tools/issues/detail?id=3

but is still included in the "Planned converters" list...

Moreover there is no solution to the almost 2 years old issue

http://code.google.com/p/hocr-tools/issues/detail?id=2

and the work-around proposed does not seem to be valid for Debian.

I'm primarily interested in hocr-check. Do you have any suggestions
how to run it on Debian Squeeze?

Tom

unread,
Mar 26, 2010, 5:39:19 AM3/26/10
to hOCR
> Moreover there is no solution to the almost 2 years old issue
>
> http://code.google.com/p/hocr-tools/issues/detail?id=2
>
> and the work-around proposed does not seem to be valid for Debian.

Not much has changed there... there are several different Python XML
and HTML processing packages floating around and what is supported on
what platforms keeps changing. Do you have any idea which Python
libraries are going to be supported long term, support generating a
DOM tree and support xpath queries?

Tom

Jakub Wilk

unread,
Mar 26, 2010, 10:21:39 AM3/26/10
to ho...@googlegroups.com
* Tom <tmb...@gmail.com>, 2010-03-26, 02:39:

I would recommend you to use lxml[0]. While it does not support DOM[1],
it offers a few other document models, it supports XPath queries and it
appears to be actively maintained (two releases this year).

[0] http://codespeak.net/lxml/
[1] http://mail.python.org/pipermail/xml-sig/2007-July/011742.html

--
Jakub Wilk

Janusz S. Bień

unread,
Mar 29, 2010, 1:30:31 AM3/29/10
to ho...@googlegroups.com
On Fri, 26 Mar 2010 Jakub Wilk <jw...@jwilk.net> wrote:

[...]

> I would recommend you to use lxml[0]. While it does not support DOM[1],
> it offers a few other document models, it supports XPath queries and it
> appears to be actively maintained (two releases this year).
>
> [0] http://codespeak.net/lxml/
> [1] http://mail.python.org/pipermail/xml-sig/2007-July/011742.html

There is already a lxml version of hocr-combine

http://hocr-tools.googlecode.com/issues/attachment?aid=-5488035238836815653&name=hocr-combine

Would it be difficult to adapt to lxml the remaining tools, especially
hocr-check?

Jim Garrison

unread,
Mar 30, 2010, 2:46:19 AM3/30/10
to ho...@googlegroups.com
> There is already a lxml version of hocr-combine
>
> http://hocr-tools.googlecode.com/issues/attachment?aid=-5488035238836815653&name=hocr-combine
>
> Would it be difficult to adapt to lxml the remaining tools, especially
> hocr-check?

I made a quick effort at porting hocr-check so that it now uses lxml.
My branch is available at:

http://code.google.com/r/jim-lxml/source/browse/hocr-check?spec=svn5bfa2d6bdbe50c52b939fe606914eee7cf1cdb2c&r=5bfa2d6bdbe50c52b939fe606914eee7cf1cdb2c

Regards,
Jim

Janusz S. Bień

unread,
Mar 30, 2010, 2:38:49 PM3/30/10
to ho...@googlegroups.com
On Mon, 29 Mar 2010 Jim Garrison <j...@garrison.cc> wrote:

[...]

> I made a quick effort at porting hocr-check so that it now uses lxml.
> My branch is available at:
>
> http://code.google.com/r/jim-lxml/source/browse/hocr-check?spec=svn5bfa2d6bdbe50c52b939fe606914eee7cf1cdb2c&r=5bfa2d6bdbe50c52b939fe606914eee7cf1cdb2c

Thanks.

It started working on Debian Squeeze after changing META to lower case
(this was Jakub Wilk's suggestion, I don't know python).

Your sample file contains the element

<meta name='ocr-id' value='OCRopus Revision: 312'>

It seems that hocr-check treat ocr-id as obligatory.

However I can't find any mention of this element in the hOCR
specification... Am I missing something?

Tom

unread,
Mar 30, 2010, 5:23:13 PM3/30/10
to hOCR
It's called "ocr-system" in the spec. hocr-check should be changed to
reflect that.

Tom

On Mar 30, 11:38 am, jsb...@mimuw.edu.pl (Janusz S. Bień) wrote:
> On Mon, 29 Mar 2010  Jim Garrison <j...@garrison.cc> wrote:
>
> [...]
>
> > I made a quick effort at porting hocr-check so that it now uses lxml.
> > My branch is available at:
>

> >http://code.google.com/r/jim-lxml/source/browse/hocr-check?spec=svn5b...

Tom

unread,
Mar 30, 2010, 5:27:31 PM3/30/10
to hOCR
Great, thanks! Please let me know when I can look at merging it.

Tom

On Mar 29, 11:46 pm, Jim Garrison <j...@garrison.cc> wrote:
> > There is already a lxml version of hocr-combine
>

> >      http://hocr-tools.googlecode.com/issues/attachment?aid=-5488035238836...


>
> > Would it be difficult to adapt to lxml the remaining tools, especially
> > hocr-check?
>
> I made a quick effort at porting hocr-check so that it now uses lxml.
> My branch is available at:
>

> http://code.google.com/r/jim-lxml/source/browse/hocr-check?spec=svn5b...
>
> Regards,
> Jim

Janusz S. Bień

unread,
Apr 23, 2010, 5:18:53 AM4/23/10
to ho...@googlegroups.com
On Tue, 30 Mar 2010 Tom <tmb...@gmail.com> wrote:

> It's called "ocr-system" in the spec. hocr-check should be changed to
> reflect that.

Is it supposed to be obligatory?

Jakub Wilk pointed to me that hOCR created by Ocropus 0.3 does not
contain this field.

Best regards

Janusz

--
,
dr hab. Janusz S. Bien, prof. UW - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - Warsaw University (Department of Formal Linguistics)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/


--
Subscription settings: http://groups.google.com/group/hocr/subscribe?hl=en

Thomas Breuel

unread,
Apr 23, 2010, 8:38:20 AM4/23/10
to ho...@googlegroups.com
It should be there (just like ocr-capabiliites); the interpretation of the output just becomes more difficult otherwise.

It's not an issue for just generating a little data, but it becomes important in applications where you keep around OCR output for a long time and may need to script/fix it.

OCRopus 0.3 is alpha, and so is 0.4, so they are not intended for production use yet.  However, 0.4 generates it now.

Tom
--
Secretary: tmb+se...@iupr.com ++49 631 205 3456
Home Page: http://www.iupr.com  
Schedule Appointments: http://tungle.me/tmb

Janusz S. Bień

unread,
Feb 13, 2011, 3:24:12 PM2/13/11
to ho...@googlegroups.com
On Tue, 23 Mar 2010 jsb...@mimuw.edu.pl (Janusz S. Bień) wrote:

> On Mon, 22 Mar 2010 jsb...@mimuw.edu.pl (Janusz S. Bień) wrote:
>
>> I would like to inform you about two tools available with ocrodjvu:
>>
>> http://jwilk.net/software/ocrodjvu
>
> I have just noticed that one of these tools, namely djvu2hocr, has
> been already reported on the hocr-tools site
>
> http://code.google.com/p/hocr-tools/issues/detail?id=3
>
> but is still included in the "Planned converters" list...

Nothing changed after almost a year :-(

Regards

JSB

Thomas Breuel

unread,
Feb 14, 2011, 3:06:17 PM2/14/11
to ho...@googlegroups.com
We've been busy with converting the main part of OCRopus to working with Unicode and ligatures, introducing new training and adaptation tools, etc.  That's why we haven't done much with the output for the time being.

Tom
-- Contact info, appointments: www.iupr.com/tmb


2011/2/13 Janusz S. Bień <jsb...@mimuw.edu.pl>

Janusz S. Bień

unread,
Feb 16, 2011, 2:33:37 PM2/16/11
to ho...@googlegroups.com
On Mon, 14 Feb 2011 Thomas Breuel <t...@cs.uni-kl.de> wrote:

> 1. (*) text/plain ( ) text/html

>
> We've been busy with converting the main part of OCRopus to working with
> Unicode and ligatures, introducing new training and adaptation tools, etc.
> That's why we haven't done much with the output for the time being.
>
> Tom

I'm afraid you've misunderstood my letter. It was not about your work
(which I appreciate very much), but about the misleading information
on the Web page.

Best regards

Janusz

Reply all
Reply to author
Forward
0 new messages