Output of tesseract is not as useful without font baseline information?

ch...@sc3.net

unread,

Sep 11, 2013, 11:14:22 AM9/11/13

to tesser...@googlegroups.com

I'm trying to OCR some PDFs I have, and it's mostly successful - I'm using GhostScript to convert my PDF pages into images, and I'm feeding those images into Tesseract, and the magic is happening.

What I'm struggling with, is that the position information that Tesseract gives me doesn't seem to allow me to position the characters for display or for creating a stream to insert back into my PDF file.

If I want to create a PDF stream that draws the OCR'd text in place, I need to know where the baseline of each character is; if I want to display the OCR output on the Windows screen I have more flexibility but only if I assume that Windows has generated the exact same font that Tesseract has been trained on, which is probably not safe.

So I've created an image that contains the text "Will o the wisp", and the OCR is working well. For the "W" I get back the bounds of that glyph, and for that particular character it's probably safe to assume the bottom of the glyph is on the baseline. However, for the "p" I also get back the bounds of the glyph, so the bottom y-coordinate is baseline minus descent, which leaves me with no way to determine where its baseline is. So how do I draw it accurately?

I can ask Windows for the bounds of each glyph in its font, and I can use that information to estimate the baseline in the Tesseract-generated data, but I'm finding that is rather inaccurate.

I assume other people have solved this problem already, is there something obvious I'm missing?

Thanks,

Chris

p.s. I realize there is software out there that will OCR PDFs and do this work for me - for my project, the OCR is part of a larger process and so I really need to have more manual control.

Tom Morris

unread,

Sep 12, 2013, 2:17:14 PM9/12/13

to tesser...@googlegroups.com

On Wednesday, September 11, 2013 11:14:22 AM UTC-4, ch...@sc3.net wrote:

What I'm struggling with, is that the position information that Tesseract gives me doesn't seem to allow me to position the characters for display or for creating a stream to insert back into my PDF file.

If I want to create a PDF stream that draws the OCR'd text in place, I need to know where the baseline of each character is; if I want to display the OCR output on the Windows screen I have more flexibility but only if I assume that Windows has generated the exact same font that Tesseract has been trained on, which is probably not safe.

So I've created an image that contains the text "Will o the wisp", and the OCR is working well. For the "W" I get back the bounds of that glyph, and for that particular character it's probably safe to assume the bottom of the glyph is on the baseline. However, for the "p" I also get back the bounds of the glyph, so the bottom y-coordinate is baseline minus descent, which leaves me with no way to determine where its baseline is. So how do I draw it accurately?

I can ask Windows for the bounds of each glyph in its font, and I can use that information to estimate the baseline in the Tesseract-generated data, but I'm finding that is rather inaccurate.

It seems like you're doing half of one thing and half of another. If you want to display the extract glyphs at their original location, all you need is the bounding box (or even simpler, you could just show the original image). If you want to draw a line of text using a new font on a different display, you're going to want to use the font metrics associated with the new font, not the old font. From a line of characters with their bounding boxes it should be possible to compute an approximate baseline, but I really wonder if that's what you want to be doing rather than positioning text at the block level (or not at all).

If you display the original image and just keep the text for search, cut & paste, etc like most PDF viewers do, I suspect your users will be happier.

Tom

William Xue

unread,

Mar 14, 2015, 4:53:09 PM3/14/15

to tesser...@googlegroups.com, ch...@sc3.net

在 2013年9月11日星期三 UTC+8下午11:14:22，ch...@sc3.net写道：

So, I wonder have you figure it out ? And how ?

Reply all

Reply to author

Forward