Portable Document Format (PDF) forensic analysis is a type of request we encounter often in our computer forensics practice. The requests usually entail PDF forgery analysis or intellectual property related investigations. In virtually all cases, I have found that the PDF metadata contained in metadata streams and the document information dictionary have been instrumental. I will provide a brief overview of these metadata sources and then provide an example of how they can be useful during PDF forensic analysis.
In the computer forensics context, PDF files can be a treasure trove of metadata. Metadata can be stored in a PDF document in a document information dictionary and/or in one or more metadata streams. A metadata stream, whose contents are represented in Extensible Markup Language (XML), may contain metadata for an entire document, and for components within a document.
Here is how the document information dictionary looked for the sample file. Please note that the values below match those of their semantically equivalent counterparts in the metadata stream for the document.
An update has been released for the first quarter 2017 Inpatient /Outpatient data dictionary. The previous released version in the 1Q2017 CD/DVD discs (Last Updated: December, 2017) has been found that the coding scheme for the data element PAT_STATUS was misaligned. The updated data dictionary contains the correct code scheme for PAT_STATUS.
The PDF version is determined by the data specified in the PDF header and theVersion key of the document catalog dictionary.In the event that these two values do not match, the Versionkey is taken as the authoritative value.
PDF logical structure shares basic features with standard document markup languages such as HTML, SGML, and XML. A document's logical structure is expressed as a hierarchy of structure elements, each represented by a dictionary object. Like their counterparts in other markup languages, PDF structure elements can have content and attributes. In PDF, rendered document content takes over the role occupied by text in HTML, SGML, and XML.
The logical structure of a document is described by a hierarchy of objects called the structure hierarchy or structure tree. At the root of the hierarchy is a dictionary object called the structure tree root, located by means of the StructTreeRoot entry in the document catalog. See Section 14.7.2, ("Structure Hierarchy") in PDF 1.7 (ISO 32000-1): Table 322 shows the entries in the structure tree root dictionary. The K entry specifies the immediate children of the structure tree root, which are structure elements.
The objective of this technique is to notify the user when a field that must be completed has not been completed in a PDF form. Required fields are implemented using the /Ff entry in the form field's dictionary (see Table 220 in Section 12.7 (Interactive Forms) of PDF 1.7 (ISO 32000-1). This is normally accomplished using a tool for authoring PDF.
The heuristics used by assistive technology will sometimes use the text label if a programmatically associated label cannot be found. The TU entry (which is the tooltip) of the field dictionary is the programmatically associated label (see Example 3 below and Table 220 in PDF 1.7 (ISO 32000-1)). Therefore, add a tooltip to each field to provide a label that assistive technology can interpret.
The intent of this technique is to show how a descriptive title for a PDF document can be specified for assistive technology by using the /Title entry in the document information dictionary and by setting the DisplayDocTitle flag to True in a viewer preferences dictionary. This is typically accomplished by using a tool for authoring PDF.
The following code fragment illustrates code that is typical for using the /Lang entry in the structure element dictionary. In this case, the /Lang entry applies to the marked-content sequence having an MCID (marked-content identifier) value of 0 within the indicated page's content stream.
582128177f