hOCR vs. ALTO

496 views
Skip to first unread message

Janusz S. Bień

unread,
Apr 19, 2010, 7:43:54 AM4/19/10
to ho...@googlegroups.com

I've just learned about ALTO:

http://www.loc.gov/standards/alto/

Do you have any opinion about this standard?

I had no time yet neither to read the standard nor to browse their
list archive (http://listserv.loc.gov/archives/alto.html),
nevertheless I would like to pose the question:

As ALTO is supported by an organizational structure (Editorial board
with monthly teleconferences etc.) and Library of Congress, does hOCR
have any chances of long term survival?

Best regards

JSB

--
,
dr hab. Janusz S. Bien, prof. UW - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - Warsaw University (Department of Formal Linguistics)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/


--
Subscription settings: http://groups.google.com/group/hocr/subscribe?hl=en

Tom

unread,
Apr 19, 2010, 8:56:52 AM4/19/10
to hOCR
> As ALTO is supported by an organizational structure (Editorial board
> with monthly teleconferences etc.) and Library of Congress, does hOCR
> have any chances of long term survival?

Think of METS/ALTO as DocBook or SGML and hOCR as HTML. I don't see
why these standards can't exist side-by-side.

hOCR can be converted to METS/ALTO. But unlike ALTO, hOCR can also be
viewed, edited, and scripted using standard desktop tools since it is
valid HTML. You can view hOCR on your iPad and even switch between
paged and continuous display.

Technically, I think ALTO has a number of problems. Among them are
the following. The way ALTO has been grafted on top of METS makes it
harder to manipulate files. For rendering, you need a new rendering
pipeline. And a lot of the OCR-related properties that ALTO provides
tags for don't seem to be well defined. And there are features in
ALTO that seem to serve little practical purpose and were only stuck
in because someone thought they might be nice to have--not a good
approach to standardization in my view.

Long term, I think there's a good change that standards like METS/ALTO
will just go away, no matter how much large institutions hang on to
it. METS/ALTO requires the creation and maintenance of a separate
toolset from that that most people use day-to-day. Or, if the content
is stored in METS/ALTO format in some backend system and intended to
be made available to normal desktop users, it needs to be converted to
HTML for display anyway, and then they might as well embed hOCR
metadata in the conversion output.

In any case, have a look at the METS/ALTO documentation and at the
hOCR documentation and then pick whichever you think is easier to use
and work with. Since hOCR is intended to contain the complete OCR
output, you don't lose anything by storing your output in hOCR format.

Tom

Janusz S. Bień

unread,
Apr 19, 2010, 9:02:28 AM4/19/10
to ho...@googlegroups.com
On Mon, 19 Apr 2010 Tom <tmb...@gmail.com> wrote:

>> As ALTO is supported by an organizational structure (Editorial board
>> with monthly teleconferences etc.) and Library of Congress, does hOCR
>> have any chances of long term survival?
>
> Think of METS/ALTO as DocBook or SGML and hOCR as HTML. I don't see
> why these standards can't exist side-by-side.
>
> hOCR can be converted to METS/ALTO.


[...]

> In any case, have a look at the METS/ALTO documentation and at the
> hOCR documentation and then pick whichever you think is easier to use
> and work with. Since hOCR is intended to contain the complete OCR
> output, you don't lose anything by storing your output in hOCR format.

Thanks for explanation. I'm glad there is no essential conflict
between the standards.

Best regards

Janusz

--
,
dr hab. Janusz S. Bien, prof. UW - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - Warsaw University (Department of Formal Linguistics)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/


Reply all
Reply to author
Forward
0 new messages