Hello
I'm in the process of looking for a C++ OCR library for recognizing difficult to parse text in PDF files and I'm wondering if tesseract-OCR is used for this kind of thing.
Basically, some PDF files are corrupted or have non-standard encoding and I can't parse them using existing parsing tools built in C++. What I would then normally do is convert the pdf page (each page, one at a time) into an image file and then re-print it as a PDF file. I would then run Adobe's OCR Text Recognition function on it and then go on to parse the pdf file. .
I'm wondering if tesseract can be used for this kind of thing? I need an OCR library in C++ to incorporate in my programs and I'm unsure if tesseract is such a library or not.
Thanks