Extracting garbled text

Skip to first unread message

Anatoly Kudrevatukh

Nov 5, 2013, 2:00:44 PM11/5/13
to pdfne...@googlegroups.com
Q: I am using TextExtractor/pdf2text and it's output contains garbled, unreadable text. What does that mean?
A: Some PDF files have garbled encoding (built-in or in PDF font dictionary) and other have incorrect ToUnicode mapping. In general, there is no ‘perfect’ solution and trusting either encoding or ToUnicode can be error prone. In v.5.9.2 based on request from some users (asking for text output that is more consistent with Acrobat) we switched to using font encoding first during Unicode mapping. The downside is that some files which PDFNet processed without ‘issues’ were garbled. Since v5.9.2.0 we made further progress and can now extract correct text from even more documents (without running OCR). Unfortunately not all documents can be recovered.

Q: Is there anything else I can do to extract garbled text?
A: If you simply want to extract textual data from the document you can integrate pdf2image tool or PDFDraw class with any OCR solution.
If you want to recreate a document with correct text information you can integrate PDFNet with any OCR output (e.g. tesseract, abby, etc) by creating 'Searchable PDF Images' from scanned PDF.

Q: Can I automatically detect documents with missing unicode mapping?
A: Unfortunately there is no simple PDF property that can be checked to identify which files are garbled.
Reply all
Reply to author
0 new messages