Hyphen and punctuation in output ALTO

157 views
Skip to first unread message

Jean-Luc Arvers

unread,
Sep 18, 2020, 12:59:24 PM9/18/20
to tesseract-dev
Hello:

Is it possible to have an ALTO output with the inclusion of hyphens at the end of the line?

At the moment, for the hyphens I get:

<String ID = "string_15" HPOS = "1002" VPOS = "606" WIDTH = "88" HEIGHT = "30" WC = "0.92" CONTENT = "quin -" />
</TextLine>
<TextLine ID = "line_3" HPOS = "493" VPOS = "624" WIDTH = "560" HEIGHT = "48">
<String ID = "string_16" HPOS = "493" VPOS = "624" WIDTH = "51" HEIGHT = "38" WC = "0.92" CONTENT = "tal," /> <SP WIDTH = "21" VPOS = " 624 "HPOS =" 544 "/>

I would need to have:

<String ID = "string_15" HPOS = "1002" VPOS = "606" WIDTH = "88" HEIGHT = "30" WC = "0.92" CONTENT = "quin-" SUBS_TYPE = "HypPart1" SUBS_CONTENT = "quintal" />
<HYP CONTENT = "" WIDTH = "14" HPOS = "..." VPOS = "..." />
</TextLine>
<TextLine ID = "line_3" HPOS = "493" VPOS = "624" WIDTH = "560" HEIGHT = "48">
<String ID = "string_16" HPOS = "493" VPOS = "624" WIDTH = "51" HEIGHT = "38" WC = "0.92" CONTENT = "tal," SUBS_TYPE = "HypPart2" SUBS_CONTENT = "quintal" />

Same question to "isolate" the punctuation: "." at the end of the line, ",", ";", etc. When these characters are "stuck" to the text, they make searching on the word impossible.

Thank you for your feedback (sorry if I couldn't find the answer in the discussions)
Reply all
Reply to author
Forward
0 new messages