http://jwilk.net/software/ocrodjvu
Best regards
Janusz
--
,
dr hab. Janusz S. Bien, prof. UW - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - Warsaw University (Department of Formal Linguistics)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
> I would like to inform you about two tools available with ocrodjvu:
>
> http://jwilk.net/software/ocrodjvu
I have just noticed that one of these tools, namely djvu2hocr, has
been already reported on the hocr-tools site
http://code.google.com/p/hocr-tools/issues/detail?id=3
but is still included in the "Planned converters" list...
Moreover there is no solution to the almost 2 years old issue
http://code.google.com/p/hocr-tools/issues/detail?id=2
and the work-around proposed does not seem to be valid for Debian.
I'm primarily interested in hocr-check. Do you have any suggestions
how to run it on Debian Squeeze?
Not much has changed there... there are several different Python XML
and HTML processing packages floating around and what is supported on
what platforms keeps changing. Do you have any idea which Python
libraries are going to be supported long term, support generating a
DOM tree and support xpath queries?
Tom
I would recommend you to use lxml[0]. While it does not support DOM[1],
it offers a few other document models, it supports XPath queries and it
appears to be actively maintained (two releases this year).
[0] http://codespeak.net/lxml/
[1] http://mail.python.org/pipermail/xml-sig/2007-July/011742.html
--
Jakub Wilk
[...]
> I would recommend you to use lxml[0]. While it does not support DOM[1],
> it offers a few other document models, it supports XPath queries and it
> appears to be actively maintained (two releases this year).
>
> [0] http://codespeak.net/lxml/
> [1] http://mail.python.org/pipermail/xml-sig/2007-July/011742.html
There is already a lxml version of hocr-combine
http://hocr-tools.googlecode.com/issues/attachment?aid=-5488035238836815653&name=hocr-combine
Would it be difficult to adapt to lxml the remaining tools, especially
hocr-check?
I made a quick effort at porting hocr-check so that it now uses lxml.
My branch is available at:
Regards,
Jim
[...]
> I made a quick effort at porting hocr-check so that it now uses lxml.
> My branch is available at:
>
> http://code.google.com/r/jim-lxml/source/browse/hocr-check?spec=svn5bfa2d6bdbe50c52b939fe606914eee7cf1cdb2c&r=5bfa2d6bdbe50c52b939fe606914eee7cf1cdb2c
Thanks.
It started working on Debian Squeeze after changing META to lower case
(this was Jakub Wilk's suggestion, I don't know python).
Your sample file contains the element
<meta name='ocr-id' value='OCRopus Revision: 312'>
It seems that hocr-check treat ocr-id as obligatory.
However I can't find any mention of this element in the hOCR
specification... Am I missing something?
Tom
On Mar 30, 11:38 am, jsb...@mimuw.edu.pl (Janusz S. Bień) wrote:
> On Mon, 29 Mar 2010 Jim Garrison <j...@garrison.cc> wrote:
>
> [...]
>
> > I made a quick effort at porting hocr-check so that it now uses lxml.
> > My branch is available at:
>
> >http://code.google.com/r/jim-lxml/source/browse/hocr-check?spec=svn5b...
Tom
On Mar 29, 11:46 pm, Jim Garrison <j...@garrison.cc> wrote:
> > There is already a lxml version of hocr-combine
>
> > http://hocr-tools.googlecode.com/issues/attachment?aid=-5488035238836...
>
> > Would it be difficult to adapt to lxml the remaining tools, especially
> > hocr-check?
>
> I made a quick effort at porting hocr-check so that it now uses lxml.
> My branch is available at:
>
> http://code.google.com/r/jim-lxml/source/browse/hocr-check?spec=svn5b...
>
> Regards,
> Jim
> On Mon, 22 Mar 2010 jsb...@mimuw.edu.pl (Janusz S. Bień) wrote:
>
>> I would like to inform you about two tools available with ocrodjvu:
>>
>> http://jwilk.net/software/ocrodjvu
>
> I have just noticed that one of these tools, namely djvu2hocr, has
> been already reported on the hocr-tools site
>
> http://code.google.com/p/hocr-tools/issues/detail?id=3
>
> but is still included in the "Planned converters" list...
Nothing changed after almost a year :-(
Regards
JSB
> 1. (*) text/plain ( ) text/html
>
> We've been busy with converting the main part of OCRopus to working with
> Unicode and ligatures, introducing new training and adaptation tools, etc.
> That's why we haven't done much with the output for the time being.
>
> Tom
I'm afraid you've misunderstood my letter. It was not about your work
(which I appreciate very much), but about the misleading information
on the Web page.
Best regards
Janusz