Hi Banti,
I've been working on something similar to this, but it's not ready for
doing exactly what you want. Basically, I have a tool to convert the
text layers of a PDF to hOCR, one of the output formats from Tesseract.
If you run that, and then also OCR the entire PDF with Tesseract, you
could try to "merge" the two hOCR files into one, preferring the
extracted text over the Tesseract text if they overlap - or do it based
on word confidence or so.
Of course, you'll have to figure out a proper scale, since Tesseract
requires the PDF to be rendered to an image, and the image pixels need
to line up with the hOCR coordinates extracted from the PDF.
You can find the tool here (but keep in mind I'm still actively working
on it / breaking things):
https://github.com/internetarchive/archive-hocr-tools/blob/master/bin/pdf-to-hocr
I don't (yet) have a tool to merge hOCR files.
Regards,
Merlijn