Logical progression for hOCR?

124 views
Skip to first unread message

David Evans

unread,
Feb 7, 2012, 2:28:22 PM2/7/12
to hOCR
Would this be the right place to propose an extension to the hOCR
format?
There is an increasing requirement to understand a documents
construction to a character level - think of forms processing
applications as a perfect (and growing) market - and HTML is no longer
the most appropriate data exchange: XML has taken over.

So, I suggest we establish a definition now so that the engines that
create OCR data have a known format to work towards during their
roadmaps.

hOCR is a good starting point so extending the idea from just lines of
data to logical words and characters (both levels storing x-y position
data) and switching to XML would, I think, cover many requirements.

Does anyone have any suggestions or comments? Perhaps you see a need
for additional metadata?

Christian Mahnke

unread,
Feb 7, 2012, 2:43:21 PM2/7/12
to ho...@googlegroups.com
Hi,

>
> Would this be the right place to propose an extension to the hOCR
> format?
> There is an increasing requirement to understand a documents
> construction to a character level - think of forms processing
> applications as a perfect (and growing) market - and HTML is no longer
> the most appropriate data exchange: XML has taken over.

that's a good point, at least using a XML variant of HTML (like XHTML) would be nice, I've proposed this several years ago on the OCRopus list.

> So, I suggest we establish a definition now so that the engines that
> create OCR data have a known format to work towards during their
> roadmaps.
>
> hOCR is a good starting point so extending the idea from just lines of
> data to logical words and characters (both levels storing x-y position
> data) and switching to XML would, I think, cover many requirements.
>
> Does anyone have any suggestions or comments? Perhaps you see a need
> for additional metadata?

My advice usually is: Stick with existing standards and try to enhance them.

Please have a look at (from the field of digital libraries):
* ALTO http://www.loc.gov/standards/alto/
* textMD http://www.loc.gov/standards/textMD/
* And if you have the requirement to link the fulltext and the source images: METS http://www.loc.gov/standards/mets/

From the research community
* PAGE http://www.primaresearch.org/papers/ICPR2010_Pletschacher_PAGE.pdf

There were some more on the public pages of the OCRopus Google group but these seem to have vanished.

Best,
Christian

Thomas Packer

unread,
Feb 7, 2012, 2:42:11 PM2/7/12
to ho...@googlegroups.com
Good topic. I hope you get feedback. I wish there were a uniform,
data-rich OCR output format with at least bounding box coordinates--perhaps
also font size, style, alignment, etc.

Can you say more about the forms processing application market that is
growing?

Thomas L. Packer
~~~~~~~~~~~~~~~~~~~~

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2012.0.1913 / Virus Database: 2112/4794 - Release Date: 02/07/12

David Evans

unread,
Feb 8, 2012, 8:32:27 AM2/8/12
to hOCR
Hi Christian,

I know these standards exist today and, while some lack the detail we
need for OCR positional data, others are vastly over-complex.
Bear in mind that standards set for archival use have to cater for
long-term metadata (modification information, versions and so on),
what we're doing with OCR data is feeding information into other
formats: it's purely a transition format and so it doesn't need these
overheads.

Alto comes closer than the others you mention in terms of fitting what
we need but I think it is overkill in many regards.

Best,
Dave

David Evans

unread,
Feb 8, 2012, 8:57:15 AM2/8/12
to hOCR
Thomas,

Good points - perhaps we should invite people to submit a wish-list of
requirements. With this information, we could possibly look at the
alternatives that exist today and see if anything fits the remit we
have. If there's nothing available then at least we know what we
should put in a new proposal! ;)
I'm certainly with you on bounding box co-ords. for line, word and
character; font size (at the character level, ideally) and style. I'm
not so convinced, personally, of the value of alignment to forms
processing / data extraction companies because it usually offers
nothing in the way of identifying the importance of an item of text -
but I recognise others may well find this a useful attribute.

The forms market? I've been involved with data extraction from images
since 1989 and have seen the market change, grow and generally become
more 'useful' to businesses. The big forms companies (we all know who
they are, right?) have produced excellent packages that work to
address most any type of business document one might see - but in
doing so, they have made the configuration of the applications rather
complex - and the costs of getting these into businesses have risen to
the point that there is now a large gap at the lower end of the market
that's currently available.
As a result of this, I'm seeing a growing number of specialist
software houses offering solutions that address a vertical or niche
market. These guys are all building their own logic to 'understand'
the content of a document and absolute positional data is obviously
key to further this understanding. Most deploy commercial OCR packages
that offer their own versions of (mostly in xml format) but without a
standard for this data, users are finding it hard to exchange one OCR
engine for another without substantial code changes - this is
crippling innovation - and they're unable to look at open source
engines because they lack the detail in their metadata.

A new open standard would, I believe, extend the reach of engines like
Tesseract and OCRpus, making them usable by many more organisations,
extending their reach and offering users a genuinely open exchange for
data.

--Dave

Janusz S. Bień

unread,
Feb 8, 2012, 9:46:01 AM2/8/12
to ho...@googlegroups.com
On Tue, 7 Feb 2012 Christian Mahnke <cma...@googlemail.com> wrote:

[...]

Just for your information:

Sample PAGE files:

http://dl.psnc.pl/activities/projekty/impact/results/

A quick and dirty converter from PAGE to hOCR (called pageparser):

https://bitbucket.org/jwilk/marasca-wbl

For me PAGE is completely unacceptable. For example, it allows to
store information about the color but not about the font...

Best regards

JSB

--
,
Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/

Christian Mahnke

unread,
Feb 8, 2012, 9:58:18 AM2/8/12
to ho...@googlegroups.com
Hi Dave,

> I know these standards exist today and, while some lack the detail we
> need for OCR positional data, others are vastly over-complex.

Well, I tend to agree, but my experience is that people confuse feature richness with complexity.

> Bear in mind that standards set for archival use have to cater for
> long-term metadata (modification information, versions and so on),
> what we're doing with OCR data is feeding information into other
> formats: it's purely a transition format and so it doesn't need these
> overheads.

As long as such "overhead" is optional, it's not really overhead. For interoperability between systems it's sometimes needed to take use cases (or requirements) of others into account as well.


> Alto comes closer than the others you mention in terms of fitting what
> we need but I think it is overkill in many regards.

Interesting, all complains about ALTO I know (by digital library guys) claim the opposite.

Maybe we should gather a list of requirements for a new format...


Best,
Christian

> Best,
> Dave

Reply all
Reply to author
Forward
0 new messages