Hocr format spec

610 views
Skip to first unread message

julien

unread,
Oct 2, 2009, 5:21:27 AM10/2/09
to ocropus
I am currently using the ocr system Cuneiform. For flexibility I want
to use the hocr format.

In order to standardize the output from Cuneiform, I want to follow
the standard as close as possible.
Ocropus refers to this page for the standard:
http://docs.google.com/View?docid=dfxcv4vc_67g844kf

I have not been able to find any other spec so I suppose this is still
the official standard (last update 2007).
Who would be the owner of the hocr spec? Are any changes foreseen/
planned?


Thanks in advance,
Regards
Julien

Thomas Breuel

unread,
Oct 5, 2009, 1:06:40 AM10/5/09
to ocr...@googlegroups.com
> I am currently using the ocr system Cuneiform. For flexibility I want
> to use the hocr format.

Great!

> In order to standardize the output from Cuneiform, I want to follow
> the standard as close as possible.
> Ocropus refers to this page for the standard:
> http://docs.google.com/View?docid=dfxcv4vc_67g844kf
>
> I have not been able to find any other spec so I suppose this is still
> the official standard (last update 2007).

Yes, that's the official document.

> Who would be the owner of the hocr spec?

I maintain it.

> Are any changes foreseen/planned?

No; most of the hard parts of OCR output formats (styles, fonts,
script-dependent issues) are taken care of by the HTML spec. hOCR
just describes how to denote OCR-specific information like bounding
boxes.

If there is something completely different you need (e.g.,
bibliographic markup, etc.), just use and/or define a separate
microformat to represent it.

If there is something engine-specific you need, pick an ocrx_... tag
that doesn't conflict with an existing one.

ocr_... tags are intended to represent engine-independent information,
so for that, it's probably a good idea to talk about it before picking
a new tag.

Tom

Reply all
Reply to author
Forward
0 new messages