I'm trying to OCR some PDFs I have, and it's mostly successful - I'm using GhostScript to convert my PDF pages into images, and I'm feeding those images into Tesseract, and the magic is happening.
What I'm struggling with, is that the position information that Tesseract gives me doesn't seem to allow me to position the characters for display or for creating a stream to insert back into my PDF file.
If I want to create a PDF stream that draws the OCR'd text in place, I need to know where the baseline of each character is; if I want to display the OCR output on the Windows screen I have more flexibility but only if I assume that Windows has generated the exact same font that Tesseract has been trained on, which is probably not safe.
So I've created an image that contains the text "Will o the wisp", and the OCR is working well. For the "W" I get back the bounds of that glyph, and for that particular character it's probably safe to assume the bottom of the glyph is on the baseline. However, for the "p" I also get back the bounds of the glyph, so the bottom y-coordinate is baseline minus descent, which leaves me with no way to determine where its baseline is. So how do I draw it accurately?
I can ask Windows for the bounds of each glyph in its font, and I can use that information to estimate the baseline in the Tesseract-generated data, but I'm finding that is rather inaccurate.
I assume other people have solved this problem already, is there something obvious I'm missing?
Thanks,
Chris
p.s. I realize there is software out there that will OCR PDFs and do this work for me - for my project, the OCR is part of a larger process and so I really need to have more manual control.