hocr-tools on Debian sid

137 views
Skip to first unread message

jsbien

unread,
May 7, 2013, 11:03:11 AM5/7/13
to ho...@googlegroups.com
I try to run some hOCR tools on Debian sid.

As the official site

http://code.google.com/p/hocr-tools/

seems not maintained, there are 9 forks listed here

http://code.google.com/p/hocr-tools/source/clones

and an unknown number of other forks, e.g.

https://github.com/CUSAT/hocr-tools

which looks quite recent.

I'm making my experiments on versions from  https://github.com/CUSAT/hocr-tools and http://code.google.com/r/zdenop-hocr-tools-lxml/source/browse.

When I try to run hocr-eval lines, I get

  File "./hocr-eval-geom", line 8, in <module>
    from xml.dom.ext.reader import HtmlLib
ImportError: No module named ext.reader

Where I can find on Debian sid the missing module for Python 2.7.3?

Best regards

Janusz

Janusz S. Bień

unread,
May 7, 2013, 11:09:27 AM5/7/13
to ho...@googlegroups.com
On Tue, 7 May 2013 jsbien <jsb...@mimuw.edu.pl> wrote:

> 1. (*) text/plain ( ) text/html
>
> I try to run some hOCR tools on Debian sid.
>
> As the official site
>
> http://code.google.com/p/hocr-tools/
>
> seems not maintained, there are 9 forks listed here
>
> http://code.google.com/p/hocr-tools/source/clones
>
> and an unknown number of other forks, e.g.
>
> https://github.com/CUSAT/hocr-tools
>
> which looks quite recent.
>
> I'm making my experiments on versions from
> https://github.com/CUSAT/hocr-tools and
> http://code.google.com/r/zdenop-hocr-tools-lxml/source/browse.
>
> When I try to run hocr-eval lines, I get

Of course I mean hocr-eval-geom.

>
> File "./hocr-eval-geom", line 8, in <module>
> from xml.dom.ext.reader import HtmlLib
> ImportError: No module named ext.reader
>
> Where I can find on Debian sid the missing module for Python 2.7.3?
>
> Best regards
>
> Janusz
> <http://code.google.com/r/zdenop-hocr-tools-lxml/>
>
> --
> You received this message because you are subscribed to the Google Groups "hOCR" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to hocr+uns...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

--
,
Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/

Jim Garrison

unread,
May 7, 2013, 12:45:49 PM5/7/13
to ho...@googlegroups.com
On 05/07/13 08:09, Janusz S. Bień wrote:
> On Tue, 7 May 2013 jsbien <jsb...@mimuw.edu.pl> wrote:
>
>> 1. (*) text/plain ( ) text/html
>>
>> I try to run some hOCR tools on Debian sid.
>>
>> As the official site
>>
>> http://code.google.com/p/hocr-tools/
>>
>> seems not maintained, there are 9 forks listed here
>>
>> http://code.google.com/p/hocr-tools/source/clones
>>
>> and an unknown number of other forks, e.g.
>>
>> https://github.com/CUSAT/hocr-tools
>>
>> which looks quite recent.
>>
>> I'm making my experiments on versions from
>> https://github.com/CUSAT/hocr-tools and
>> http://code.google.com/r/zdenop-hocr-tools-lxml/source/browse.
>>
>> When I try to run hocr-eval lines, I get
>
> Of course I mean hocr-eval-geom.
>
>>
>> File "./hocr-eval-geom", line 8, in <module>
>> from xml.dom.ext.reader import HtmlLib
>> ImportError: No module named ext.reader
>>
>> Where I can find on Debian sid the missing module for Python 2.7.3?

Unfortunately the tool is trying to import PyXML, which is a dead
project [1]. This is filed as issue #2 in the hocr-tools bug tracker
[2]. The second fork you mention is actually a fork of my fork, within
which I had ported both hocr-check and hocr-combine to use lxml, a more
widely used Python XML library. Unfortunately, I do not believe that
anybody has ported the remaining modules to use a recent library (but I
have not looked at all the forks).

It should be a manageable amount of work to port everything, one module
at a time, to lxml or another library. There are a number of such
libraries listed at [3], but it would be nice to standardize on one and
have the entire code base use it for consistency. So I think the next
steps forward are:

1. Collect the code from the various forks, pulling all existing
improvements into a common repository;
2. Create a wiki page where people vote on which tools they would like
to see ported;
3. Encourage volunteers to fork the repository and port any of the
existing tools to lxml (or whatever library we standardize on).
4. Any time somebody is generous enough to do such volunteer work, pull
it into the common repository as quickly as possible.

What do people think?

Cheers,
Jim

[1]
http://georgik.sinusgear.com/2011/01/10/dead-project-warning-pyxml-does-not-work-with-python2-6/
http://mail.python.org/pipermail/xml-sig/2010-November/012245.html
[2] https://code.google.com/p/hocr-tools/issues/detail?id=2
[3] http://wiki.python.org/moin/PythonXml

Janusz S. Bień

unread,
May 8, 2013, 1:11:52 PM5/8/13
to ho...@googlegroups.com, Jakub Wilk
On Tue, 07 May 2013 Jim Garrison <j...@garrison.cc> wrote:

[...]

> It should be a manageable amount of work to port everything, one module
> at a time, to lxml or another library. There are a number of such
> libraries listed at [3], but it would be nice to standardize on one and
> have the entire code base use it for consistency. So I think the next
> steps forward are:
>
> 1. Collect the code from the various forks, pulling all existing
> improvements into a common repository;
> 2. Create a wiki page where people vote on which tools they would like
> to see ported;
> 3. Encourage volunteers to fork the repository and port any of the
> existing tools to lxml (or whatever library we standardize on).
> 4. Any time somebody is generous enough to do such volunteer work, pull
> it into the common repository as quickly as possible.

Perhaps hocr-combine can be replaced by Jakub Wilk's hocr-concat,
available at https://bitbucket.org/jwilk/marasca-wbl/ in the wbl
branch.

Regards

Janusz

Janusz S. Bień

unread,
Jul 6, 2013, 7:22:48 AM7/6/13
to ho...@googlegroups.com
On Tue, 07 May 2013 Jim Garrison <j...@garrison.cc> wrote:

[...]

> It should be a manageable amount of work to port everything, one module
> at a time, to lxml or another library.

[...]

In the meantime a student of mine made an attempt to circumvent the
problems:

https://bitbucket.org/pjte/hocr-tools-tesseract-fix

Best regards

Janusz
Reply all
Reply to author
Forward
0 new messages