Hi Christian,
On Fri, Nov 15, 2013 at 05:39:10AM -0800, Christian Pietsch wrote:
> I would like to use Tesseract for processing digitized books in an academic
> library. Unfortunately, our IT pipeline expects ALTO XML or TEI XML word
> coordinates and does not know about hocr.
I've wanted the same thing, actually, and when I get the time would
like to write a proper ALTO XML export option. But "when I get the
time" is unlikely to be particularly soon :(
> Is there a tool or code snippet for converting the hocr output produced by
> Tesseract or OCRopus to ALTO or (partial) TEI XML? I have searched the
> interwebs for traces of such a tool, but all I could find was a statement that
> it should be possible to create it.
I don't think there is such a tool, unfortunately. If there was, I'd
expect it to be on the tools site of the hOCR project:
https://code.google.com/p/hocr-tools/
> If nothing turns up, I would be prepared to
> hack an XSLT stylesheet that does the jobs, or, failing that, some code in a
> scripting language.
That would be great - do share what you come up with!
If you're comfortable with C++ an alternative to converting hOCR
would be to write the ALTO export code directly. That would be more
work for sure, but not that difficult. Take a look at the
GetHOCRText function in baseapi.cpp if you're curious.
I've used TEI a little in the past, but hadn't considered using it
directly in OCR output. It's an intimidatingly massive XML spec; is
there a good reason for outputing OCR results directly to TEI? I
would have thought it wouldn't be particularly useful until the TEI
had been manually marked up anyway, but maybe I'm missing something.
Thanks, and please keep me informed of how you get on (also, I'm
sure the hOCR project people would be interested to hear about any
conversion scripts, however hacky).
Nick