PDF throws exceptions accessing DocInfo

70 views
Skip to first unread message

Ryan

unread,
Aug 9, 2016, 3:53:47 PM8/9/16
to PDFTron PDFNet SDK
Question:

We found a few documents out there throwing exception when accessing doc.GetDocInfo(). Here are the 4 top exceptions we get:

PDFNetException - Code:0 
File
   ObjParser.cpp 
Func
   trn::SDF::ObjParser::GetObj 
Line:    340 
Expr:    m_operand_stack.size() >= 1 
Message: Operator endobj expects a single argument

PDFNetException - Code:0 
File
   ObjParser.cpp 
Func
   trn::SDF::ObjStmParser::ObjStmParser 
Line
   40
Expr
   GetObj() 
Message: Compressed object is corrupt

PDFNetException - Code:0 
File
   Parser.cpp 
Func
   trn::SDF::Parser::LexDict 
Line
   349 
Expr
   num_elements>=0 && (num_elements % 2 == 0) 
Message: the number of key-value elements should be even

PDFNetException - Code:0 
File:
    Parser.cpp
Func
   trn::SDF::Parser::LexDict
Line:
    358 
Expr:
    !m_operand_stack[i-1]->IsIndirect() && m_operand_stack[i-1]->IsName() 
Message: Bad key



In these cases where the document info is corrupt, should we assume that the entire document is corrupt? Should we avoid reading/rendering/writing to the file?

Answer:

There are essentially two types of "corruption" in a PDF. Bad XRef table, and malformed content.

The first thing that happens when opening a PDF is reading the XRef table, which provides the exact byte offsets of objects. If this turns out be incorrect, the table is "repaired". To see an example XRef see the red section here: https://www.pdftron.com/pdfnet/intro.html#pdf_intro

To detect this case, see this post.

Ideally, in this case the file gets saved with e_remove_unused flag.

Either way, from this point on, even though the XRef was repaired, nothing was actually accessed. When you start doing operations on the file, such as viewing, then any object can throw an exception, such as the ones you see above, which are the second case of "corruption". PDFNet has over 10 years of dealing with malformed PDF files, and does its best to complete actions, but if this is not possible, then exceptions are thrown.

In these cases where the document info is corrupt, should we assume that the entire document is corrupt? Should we avoid reading/rendering/writing to the file?

Generally, you can keep using these files and interacting with them. Though it is possible that writing to them might make things worse, but reading and viewing would be fine.

For example, if you are trying to populate a UI element with the document info (such as Author), then just catch the PDFNet exceptions, and leave the fields blank. Typically PDF viewers would not report errors in this case.

Reply all
Reply to author
Forward
0 new messages