ole2 container identification

Donald Mennerich

unread,

Jan 29, 2014, 12:53:30 PM1/29/14

to droid...@googlegroups.com

Hello,

I am working on adding some code to an application that does filtetype identification with droid/pronom. I'm runing into a problem with Ole2 containers. Here's a method that I put together to me the puid and mime for a known .docx file for a test, it's essentially taken verbatim out of the ReportPrinter class in the command-line application:

    private void identify(IdentificationRequest request, ContainerSignatureDefinitions defs) throws IOException {
        Ole2ContainerContentIdentifier identifier = new Ole2ContainerContentIdentifier();
        identifier.init(defs, "OLE2");
        Ole2IdentifierEngine engine = new Ole2IdentifierEngine();
        ContainerFileIdentificationRequestFactory requestFactory = new ContainerFileIdentificationRequestFactory();
        engine.setRequestFactory(requestFactory);
        identifier.setIdentifierEngine(engine);
        IdentificationResultCollection resultCollection = new IdentificationResultCollection(request);
        resultCollection = identifier.process(request.getSourceInputStream(), resultCollection);
        List<IdentificationResult> results = resultCollection.getResults();
        for(IdentificationResult result: results){
            System.out.println(result.getPuid() + ": " + result.getMimeType());
        }
    }

When I run this I'm getting an error that I'm using the wrong part of POI, not HSSF, this is a Word 2013 .docx that I'm using

Exception in thread "main" org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF)
 at org.apache.poi.poifs.storage.HeaderBlockReader.<init>(HeaderBlockReader.java:116)
 at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:153)
 at uk.gov.nationalarchives.droid.container.ole2.Ole2IdentifierEngine.process(Ole2IdentifierEngine.java:59)
 at AbstractContainerContentIdentifier.process(AbstractContainerContentIdentifier.java:118)
 at Ole.identify(Ole.java:56)

I think that the "identifier.init(defs, "OLE2");" is the wonky bit of code, Can anyone point me in the right direction how to initlize the identifier correctly. I'm having trouble finding how it works in the source.

Thanks,

Don

Matt Palmer

unread,

Jan 30, 2014, 3:48:43 AM1/30/14

to droid...@googlegroups.com

Hi Don,

Can't comment on the code itself (hard to debug on a phone!) but I would note that ole2 is the container format for older office documents. Docx files are zip containers.

Before attempting to do container identification, you must be sure to pass the right kind of file to the right kind of container identifier.

Regards

Matt

Donald Mennerich

unread,

Jan 30, 2014, 1:38:53 PM1/30/14

to droid...@googlegroups.com

thanks, makes sense now.

Andy Jackson

unread,

Feb 5, 2014, 11:57:09 AM2/5/14

to droid...@googlegroups.com

Don,

In case it's of use to you, I've already packaged up the core DROID identification code as an embeddable JAR. You can find the source here:

https://github.com/openplanets/nanite/blob/master/nanite-core/

and an example of how to use it here:

https://github.com/openplanets/nanite/blob/master/nanite-core/src/test/java/uk/bl/wa/nanite/droid/DroidDetectorTest.java

It conforms to the Apache Tika detector API, and is available via Maven as

<dependency>

    <groupId>eu.scape-project.nanite</groupId>

    <artifactId>nanite-core</artifactId>

    <version>1.0.72.2</version>

</dependency>

(would be happy to somehow fold this into the Droid project at some point, somehow, if that's of interest)

Thanks,
Andy Jackson

Reply all

Reply to author

Forward