> As ALTO is supported by an organizational structure (Editorial board
> with monthly teleconferences etc.) and Library of Congress, does hOCR
> have any chances of long term survival?
Think of METS/ALTO as DocBook or SGML and hOCR as HTML. I don't see
why these standards can't exist side-by-side.
hOCR can be converted to METS/ALTO. But unlike ALTO, hOCR can also be
viewed, edited, and scripted using standard desktop tools since it is
valid HTML. You can view hOCR on your iPad and even switch between
paged and continuous display.
Technically, I think ALTO has a number of problems. Among them are
the following. The way ALTO has been grafted on top of METS makes it
harder to manipulate files. For rendering, you need a new rendering
pipeline. And a lot of the OCR-related properties that ALTO provides
tags for don't seem to be well defined. And there are features in
ALTO that seem to serve little practical purpose and were only stuck
in because someone thought they might be nice to have--not a good
approach to standardization in my view.
Long term, I think there's a good change that standards like METS/ALTO
will just go away, no matter how much large institutions hang on to
it. METS/ALTO requires the creation and maintenance of a separate
toolset from that that most people use day-to-day. Or, if the content
is stored in METS/ALTO format in some backend system and intended to
be made available to normal desktop users, it needs to be converted to
HTML for display anyway, and then they might as well embed hOCR
metadata in the conversion output.
In any case, have a look at the METS/ALTO documentation and at the
hOCR documentation and then pick whichever you think is easier to use
and work with. Since hOCR is intended to contain the complete OCR
output, you don't lose anything by storing your output in hOCR format.
Tom