hocr to ALTO XML converter?

1,211 views
Skip to first unread message

Christian Pietsch

unread,
Nov 15, 2013, 8:39:10 AM11/15/13
to tesser...@googlegroups.com
Dear devs,

I would like to use Tesseract for processing digitized books in an academic library. Unfortunately, our IT pipeline expects ALTO XML or TEI XML word coordinates and does not know about hocr.

Is there a tool or code snippet for converting the hocr output produced by Tesseract or OCRopus to ALTO or (partial) TEI XML? I have searched the interwebs for traces of such a tool, but all I could find was a statement that it should be possible to create it. If nothing turns up, I would be prepared to hack an XSLT stylesheet that does the jobs, or, failing that, some code in a scripting language.

Cheers!
Christian

Nick White

unread,
Nov 18, 2013, 5:58:04 AM11/18/13
to tesser...@googlegroups.com
Hi Christian,

On Fri, Nov 15, 2013 at 05:39:10AM -0800, Christian Pietsch wrote:
> I would like to use Tesseract for processing digitized books in an academic
> library. Unfortunately, our IT pipeline expects ALTO XML or TEI XML word
> coordinates and does not know about hocr.

I've wanted the same thing, actually, and when I get the time would
like to write a proper ALTO XML export option. But "when I get the
time" is unlikely to be particularly soon :(

> Is there a tool or code snippet for converting the hocr output produced by
> Tesseract or OCRopus to ALTO or (partial) TEI XML? I have searched the
> interwebs for traces of such a tool, but all I could find was a statement that
> it should be possible to create it.

I don't think there is such a tool, unfortunately. If there was, I'd
expect it to be on the tools site of the hOCR project:
https://code.google.com/p/hocr-tools/

> If nothing turns up, I would be prepared to
> hack an XSLT stylesheet that does the jobs, or, failing that, some code in a
> scripting language.

That would be great - do share what you come up with!

If you're comfortable with C++ an alternative to converting hOCR
would be to write the ALTO export code directly. That would be more
work for sure, but not that difficult. Take a look at the
GetHOCRText function in baseapi.cpp if you're curious.

I've used TEI a little in the past, but hadn't considered using it
directly in OCR output. It's an intimidatingly massive XML spec; is
there a good reason for outputing OCR results directly to TEI? I
would have thought it wouldn't be particularly useful until the TEI
had been manually marked up anyway, but maybe I'm missing something.

Thanks, and please keep me informed of how you get on (also, I'm
sure the hOCR project people would be interested to hear about any
conversion scripts, however hacky).

Nick

Stefan Weil

unread,
May 21, 2016, 2:23:46 PM5/21/16
to tesseract-dev
Am Montag, 18. November 2013 11:58:04 UTC+1 schrieb Nick White:
Hi Christian,

On Fri, Nov 15, 2013 at 05:39:10AM -0800, Christian Pietsch wrote:
> I would like to use Tesseract for processing digitized books in an academic
> library. Unfortunately, our IT pipeline expects ALTO XML or TEI XML word
> coordinates and does not know about hocr.

I've wanted the same thing, actually, and when I get the time would
like to write a proper ALTO XML export option. But "when I get the
time" is unlikely to be particularly soon :(

> Is there a tool or code snippet for converting the hocr output produced by
> Tesseract or OCRopus to ALTO or (partial) TEI XML? I have searched the
> interwebs for traces of such a tool, but all I could find was a statement that
> it should be possible to create it.


https://github.com/UB-Mannheim/ocr-fileformat supports transformations
between hOCR and ALTO in both directions.

Reply all
Reply to author
Forward
0 new messages