Unclear about alternative readings

51 views
Skip to first unread message

Nick White

unread,
Apr 29, 2014, 12:45:08 PM4/29/14
to ho...@googlegroups.com
Hi all,

I'm interested in adding some alternative readings output to the
hOCR generated by Tesseract, so was pleased to find that section 10
of the hOCR specification seems to provide for that.

However, I have a few questions:

- Is it envisaged that these alternatives should be within a
ocrx_word tag?
- What's meant by the "x_cost" property? It isn't mentioned
elsewhere in the spec. Do you mean x_wconf? That would seem
reasonable.
- It would be nice to be able to also include the most probable
interpretation directly within the ocrx_word span, so that adding
these alternatives didn't render parsers that weren't expecting them
to consider the word blank.

So I imagine producing output like like the following:

<span class='ocrx_word' title='bbox 183 552 280 614; x_wconf 70'>
cat
<span class='alternatives'>
<ins class='alt' title='x_wconf 70'>cat</ins>
<del class='alt' title='x_wconf 50'>cab</del>
</span>
</span>

Does that all sound reasonable and good?

Nick
Reply all
Reply to author
Forward
0 new messages