Q: I am using TextExtractor/pdf2text and it's output contains garbled, unreadable text. What does that mean?
A: Some PDF files have garbled encoding (built-in or in PDF font
dictionary) and other have incorrect ToUnicode mapping. In general,
there is no ‘perfect’ solution and trusting either encoding or ToUnicode
can be error prone. In v.5.9.2 based on request from some users (asking
for text output that is more consistent with Acrobat) we switched to
using font encoding first during Unicode mapping. The downside is that
some files which PDFNet processed without ‘issues’ were garbled.
Since v5.9.2.0 we made further progress and can now extract correct text from even more documents (without running OCR). Unfortunately not all documents can be recovered.
Q: Is there anything else I can do to extract garbled text?
A: If you simply want to extract textual data from the document you can integrate
pdf2image tool or
PDFDraw class with any OCR solution.
If you want to recreate a document with correct text information you can integrate PDFNet with any OCR output (e.g. tesseract,
abby, etc) by creating
'Searchable PDF Images' from scanned PDF.
Q: Can I automatically detect documents with missing unicode mapping?
A: Unfortunately there is no simple PDF property that can be checked to
identify which files are garbled.