that's a good point, at least using a XML variant of HTML (like XHTML) would be nice, I've proposed this several years ago on the OCRopus list.
> So, I suggest we establish a definition now so that the engines that
> create OCR data have a known format to work towards during their
> roadmaps.
>
> hOCR is a good starting point so extending the idea from just lines of
> data to logical words and characters (both levels storing x-y position
> data) and switching to XML would, I think, cover many requirements.
>
> Does anyone have any suggestions or comments? Perhaps you see a need
> for additional metadata?
My advice usually is: Stick with existing standards and try to enhance them.
Please have a look at (from the field of digital libraries):
* ALTO http://www.loc.gov/standards/alto/
* textMD http://www.loc.gov/standards/textMD/
* And if you have the requirement to link the fulltext and the source images: METS http://www.loc.gov/standards/mets/
From the research community
* PAGE http://www.primaresearch.org/papers/ICPR2010_Pletschacher_PAGE.pdf
There were some more on the public pages of the OCRopus Google group but these seem to have vanished.
Best,
Christian
Can you say more about the forms processing application market that is
growing?
Thomas L. Packer
~~~~~~~~~~~~~~~~~~~~
-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2012.0.1913 / Virus Database: 2112/4794 - Release Date: 02/07/12
[...]
> From the research community
> * PAGE http://www.primaresearch.org/papers/ICPR2010_Pletschacher_PAGE.pdf
Just for your information:
Sample PAGE files:
http://dl.psnc.pl/activities/projekty/impact/results/
A quick and dirty converter from PAGE to hOCR (called pageparser):
https://bitbucket.org/jwilk/marasca-wbl
For me PAGE is completely unacceptable. For example, it allows to
store information about the color but not about the font...
Best regards
JSB
--
,
Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
Well, I tend to agree, but my experience is that people confuse feature richness with complexity.
> Bear in mind that standards set for archival use have to cater for
> long-term metadata (modification information, versions and so on),
> what we're doing with OCR data is feeding information into other
> formats: it's purely a transition format and so it doesn't need these
> overheads.
As long as such "overhead" is optional, it's not really overhead. For interoperability between systems it's sometimes needed to take use cases (or requirements) of others into account as well.
> Alto comes closer than the others you mention in terms of fitting what
> we need but I think it is overkill in many regards.
Interesting, all complains about ALTO I know (by digital library guys) claim the opposite.
Maybe we should gather a list of requirements for a new format...
Best,
Christian
> Best,
> Dave