Nick White
unread,Apr 29, 2014, 12:45:08 PM4/29/14Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to ho...@googlegroups.com
Hi all,
I'm interested in adding some alternative readings output to the
hOCR generated by Tesseract, so was pleased to find that section 10
of the hOCR specification seems to provide for that.
However, I have a few questions:
- Is it envisaged that these alternatives should be within a
ocrx_word tag?
- What's meant by the "x_cost" property? It isn't mentioned
elsewhere in the spec. Do you mean x_wconf? That would seem
reasonable.
- It would be nice to be able to also include the most probable
interpretation directly within the ocrx_word span, so that adding
these alternatives didn't render parsers that weren't expecting them
to consider the word blank.
So I imagine producing output like like the following:
<span class='ocrx_word' title='bbox 183 552 280 614; x_wconf 70'>
cat
<span class='alternatives'>
<ins class='alt' title='x_wconf 70'>cat</ins>
<del class='alt' title='x_wconf 50'>cab</del>
</span>
</span>
Does that all sound reasonable and good?
Nick