Request for comment and testing of new DOC and DOT Word 97-2003 container signatures

260 views
Skip to first unread message

ross-spencer

unread,
Apr 10, 2015, 1:24:16 AM4/10/15
to droid...@googlegroups.com

We've received two sets of template files from two different agencies. DROID currently reports an extension mismatch for these files because of the .DOT extension. The signature for DOC 97-2003 matches, however, the following demonstrates that it could be expanded on to make it more precise. 

Within Word, the application has awareness that these files are different from one another. N.B. regardless of extension.

And within the Word specification is information on how to identify the two types of file (note the final bullet point here):

https://msdn.microsoft.com/en-us/library/dd944620(v=office.12).aspx

·         wIdent (2 bytes): An unsigned integer that specifies that this is a Word Binary File. This value MUST be 0xA5EC.

 

·         nFib (2 bytes): An unsigned integer that specifies the version number of the file format used. Superseded by FibRgCswNew.nFibNew if it is present. This value SHOULD<12> be 0x00C1.

·         unused (2 bytes): This value is undefined and MUST be ignored.

·         lid (2 bytes): A LID that specifies the install language of the application that is producing the document. If nFib is 0x00D9 or greater, then any East Asian install lid or any install lid with a base language of Spanish, German or French MUST be recorded as lidAmerican. If the nFib is 0x0101 or greater, then any install lid with a base language of Vietnamese, Thai, or Hindi MUST be recorded as lidAmerican.

·         pnNext (2 bytes): An unsigned integer that specifies the offset in the WordDocument stream of the FIB for the document which contains all the AutoText items. If this value is 0, there are no AutoText items attached. Otherwise the FIB is found at file locationpnNext×512. If fGlsy is 1 or fDot is 0, this value MUST be 0. If pnNext is not 0, each FIB MUST share the same values for FibRgFcLcb97.fcPlcBteChpx, FibRgFcLcb97.lcbPlcBteChpx, FibRgFcLcb97.fcPlcBtePapx, FibRgFcLcb97.lcbPlcBtePapx, and FibRgLw97.cbMac.

 

·         A - fDot (1 bit): Specifies whether this is a document template (1).


Using the signature files I've attached to this email, I've extended the current Word signature and added a new one that identifies TMP files.

As part of this email, I'd like to ask:

  • The signature be added to PRONOM as part of x-fmt/45
  • The signature for fmt/40 be extended so as not to cause double identification.
  • Users reading this post to test the signature files with DROID to help provide more evidence that it works.
  • Comments on the signature and its addition.

Notes about the signature:

Because DROID is incapable of identifying bit-fields (as far as I am aware), I'm relying on the fDot bit making all integers where it is set, an odd number, therefore to find whether we're looking at a .DOT file, we can simply look for this integer being odd e.g. (01|03|05|07|09|0B|0D|0F|11|13… etc. The standard Word 97-2003 document can therefore be identified if this integer is even.

Please take a look at the attached bare-bones container signature file to understand how it works in more detail.

DOCX and DOTX have their place as first class citizens in the DROID container file, and there seems to be merit in fleshing out the current .DOT entry with a corresponding signature.

Regards,

Ross

--

Ross Spencer | Digital Preservation Analyst 
Archives New Zealand Te Rua Mahara o te Kawanatanga
Direct Dial: +64 4 894 6015 | Extn: 9348 | 
www.dia.govt.nz

template.dot
non-template.doc
dot-and-doc-template-97-2003-signature-file.xml
test-dot-and-doc-template-container-signature-20150410.xml

Dclipsham

unread,
Apr 10, 2015, 5:42:31 AM4/10/15
to droid...@googlegroups.com
Thanks Ross, this is excellent,

I agree that it is desirable to distinguish between these formats. What would be useful for me would be if further contributors could provide additional sample files, particularly across different versions of MS Word from 1997 to 2003. If anybody is able to, when uploading please specify in the file name the creating software version (and OS, if any Mac users can contribute, else I'll assume native Windows).

Additionally Ross I note from the format description provided, the fEncrypted bit field:

F - fEncrypted (1 bit): Specifies whether the document is encrypted or obfuscated as specified in Encryption and Obfuscation.

It's the 1st bit of the following byte: Could this be the key to weeding out encrypted/password protected files? - This would be incredibly useful if so and we could use your same methodology re odd/even bytes. Are you in a position to experiment? In the meantime I shall see if I can dig out some legacy versions of MS Word to generate some samples myself.

I am planning for a release later in April if possible, so if additional samples can be provided quickly, then I may be able to fit these in for then.

David

Dclipsham

unread,
Apr 10, 2015, 9:33:18 AM4/10/15
to droid...@googlegroups.com
So initial observations on DOT, based on a random sample of *.dot files held internally (that I probably can't share right now), all produced in Office 2003, the signature holds true as expected (in fact in all, byte 0x0A has either been 0xF9 or 0xF1, but then our usage is probably fairly standard). I'm going to self-generate some in Office 97 and 2000 too and I'll be able to share those. It will also give me the opportunity to play with password protection. I'll report back soon.

Further samples still welcome though!

David

Dclipsham

unread,
Apr 10, 2015, 11:13:10 AM4/10/15
to droid...@googlegroups.com
Right, so using Ross' methodology for distinguishing between bit fields, in reference to the Microsoft documentation (https://msdn.microsoft.com/en-us/library/dd944620(v=office.12).aspx), I seem to be able to distinguish between a password protected, and a non-password protected MS Word 97-03 document. 

MS Word 97-03 documents are of the Microsoft OLE2 compound document format type. Effectively these files are containers holding lots of smaller files that make up the whole of the Word document. Within the container (viewable with a zip extraction tool, such as 7zip), one can extract the 'WordDocument' file within. Within this file, the byte at position 0x0B deals with encryption/obfuscation, as per the MS FibBase description page. The first bit (ordered LSB) specifies whether the file is encrypted, meaning that if the byte at 0x0B is an 'odd' byte (e.g. 01, 03, 05, 07, 09, 0B, 0D, 0F, 11, 13 etc.) then this indicates obfuscation/encryption is present, which we can then use to create a PRONOM container signature.

I need to do a bit of further testing to ensure this is reliable and to ensure that no other process not involving password protection can produce the same outcome, and if anybody can provide further samples that are definitely password protected, then please do so.

For reference the files I have created are attached. The password for each password-protected file is 'password'.

David
WordXP_dotpassworded.dot
Word00_docpassworded.doc
Word00_dot.dot
Word00_dotpassworded.dot
WordXP_doc.doc
WordXP_docpassworded.doc
WordXP_dot.dot
Word97_doc.doc
Word97_docpassworded.doc
Word97_dot.dot
Word97_dotpassworded.dot
Word00_doc.doc

Lehane, Richard

unread,
Apr 11, 2015, 1:43:20 AM4/11/15
to droid...@googlegroups.com

Hi David
You could make a small refinement on your encrypted signature by only considering the bytes which have the fourth bit set to "1" (per the spec - J - fExtChar (1 bit): This value MUST be 1.) This would reduce your patterns from sets of 128 choices to 64.

So, encrypted would be:
(0x11|0x13|0x15|0x17|0x19|0x1b|0x1d|0x1f|0x31|0x33|0x35|0x37|0x39|0x3b|0x3d|0x3f|0x51|0x53|0x55|0x57|0x59|0x5b|0x5d|0x5f|0x71|0x73|0x75|0x77|0x79|0x7b|0x7d|0x7f|0x91|0x93|0x95|0x97|0x99|0x9b|0x9d|0x9f|0xb1|0xb3|0xb5|0xb7|0xb9|0xbb|0xbd|0xbf|0xd1|0xd3|0xd5|0xd7|0xd9|0xdb|0xdd|0xdf|0xf1|0xf3|0xf5|0xf7|0xf9|0xfb|0xfd|0xff)

and unencrypted would be:
(0x10|0x12|0x14|0x16|0x18|0x1a|0x1c|0x1e|0x30|0x32|0x34|0x36|0x38|0x3a|0x3c|0x3e|0x50|0x52|0x54|0x56|0x58|0x5a|0x5c|0x5e|0x70|0x72|0x74|0x76|0x78|0x7a|0x7c|0x7e|0x90|0x92|0x94|0x96|0x98|0x9a|0x9c|0x9e|0xb0|0xb2|0xb4|0xb6|0xb8|0xba|0xbc|0xbe|0xd0|0xd2|0xd4|0xd6|0xd8|0xda|0xdc|0xde|0xf0|0xf2|0xf4|0xf6|0xf8|0xfa|0xfc|0xfe)

If you want to play around with this, I generated these sets with: http://play.golang.org/p/y8ViWAkymb

For all of these WordDocument signatures, it might also be worth adding the word identifier bytes at 0 offset for these patterns (per the spec wIdent (2 bytes): An unsigned integer that specifies that this is a Word Binary File. This value MUST be 0xA5EC).

cheers
Richard






From: droid...@googlegroups.com [droid...@googlegroups.com] on behalf of Dclipsham [dcli...@gmail.com]

Sent: Saturday, 11 April 2015 1:13 AM

To: droid...@googlegroups.com

Subject: Re: Request for comment and testing of new DOC and DOT Word 97-2003 container signatures
--

You received this message because you are subscribed to the Google Groups "droid-list" group.

To unsubscribe from this group and stop receiving emails from it, send an email to

droid-list+...@googlegroups.com.

To post to this group, send email to
droid...@googlegroups.com.

Visit this group at
http://groups.google.com/group/droid-list.

For more options, visit
https://groups.google.com/d/optout.


______________________________________________________________________

This email has been scanned by the Symantec Email Security.cloud service.

For more information please visit http://www.symanteccloud.com

______________________________________________________________________





______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________

ross-spencer

unread,
Apr 11, 2015, 2:42:54 AM4/11/15
to droid...@googlegroups.com
HI Richard,

Good spot by adding 'J' bit. 

Just to share my own findings/notes re: the identification bytes. 0xA5EC at the beginning of this signature was my very first stop. I couldn't implement this using standard syntax:

'A5EC{8}(01|03|05...)' etc.

I think this *should* be valid syntax, but is related to the bug reported here: https://github.com/digital-preservation/droid/issues/50

However, I didn't try this:

  <InternalSignature ID="1" Specificity="Specific">
  <ByteSequence Reference="BOFoffset">
    <SubSequence MinFragLength="0" Position="1" SubSeqMaxOffset="0" SubSeqMinOffset="0">
      <Sequence>A5EC</Sequence>
    </SubSequence>
  </ByteSequence>
  <ByteSequence Reference="BOFoffset">
    <SubSequence MinFragLength="0" Position="1" SubSeqMaxOffset="10" SubSeqMinOffset="10">
      <sequence>(01|03|05|07|09|0B|0D|0F|11|13|15|17|19|1B|1D|1F|21|23|25|27|29|2B|2D|2F|31|33|35|37|39|3B|3D|3F|41|43|45|47|49|4B|4D|4F|51|53|55|57|59|5B|5D|5F|61|63|65|67|69|6B|6D|6F|71|73|75|77|79|7B|7D|7F|81|83|85|87|89|8B|8D|8F|91|93|95|97|99|9B|9D|9F|A1|A3|A5|A7|A9|AB|AD|AF|B1|B3|B5|B7|B9|BB|BD|BF|C1|C3|C5|C7|C9|CB|CD|CF|D1|D3|D5|D7|D9|DB|DD|DF|E1|E3|E5|E7|E9|EB|ED|EF|F1|F3|F5|F7|F9|FB|FD|FF)<Sequence/>
    </SubSequence>
  </ByteSequence>
</InternalSignature>

Which might work, and I agree would indeed be an improvement to these signatures. 

Ross

Matt Palmer

unread,
Apr 12, 2015, 6:41:09 AM4/12/15
to droid...@googlegroups.com
Hi all,

DROID 6 should, in fact, be capable of identifying bit-fields, although there has not been a signature which uses this so far.  The byteseek library which DROID uses to process signatures has an "all-bitmask" operator &, and an "any-bitmask" operator ~.

For example, if you wanted to specify that bit 4 must match (but you don't care about the other bits), you could write [&08].  Of if you wanted to specify that a byte must be odd, then you could write [&01].  Or more complex multi-bit masks as well.  I guess you could also test for it not matching using the DROID syntax for an inverted set !: [!&01].

I haven't tested this yet, but should be easy to modify a container signature to try it.

However:  note that this is pure byteseek syntax.  DROID has already adopted a little bit of byteseek syntax in container signatures: the quotes for strings in container signatures come from byteseek originally.  Whether the National Archives wishes to use byteseek syntax directly or not is a matter for them.  They could, of course, come up with a different syntax and transform it to byteseek before use.  This is already done in a few cases where the syntax doesn't agree: the FragmentRewriter class performs this translation.

Regards,

Matt

ross-spencer

unread,
Apr 12, 2015, 6:12:12 PM4/12/15
to droid...@googlegroups.com
Hi Matt, 

I couldn't quite figure out the logic, but what would the bitmask be for say, the 8th bit? 

In terms of testing it (and again for the benefit of the group, I tried the following:

    <SubSequence Position="1" SubSeqMinOffset="10" SubSeqMaxOffset="10">
        <Sequence>[&01]</Sequence>

 But I get a SAX error: 

[org.xml.sax.SAXParseException; lineNumber: 34; columnNumber: 50; The entity name must immediately follow the '&' in the entity reference.]

                at javax.xml.bind.helpers.AbstractUnmarshallerImpl.createUnmarshalException(Unknown Source)

                at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.createUnmarshalException(UnmarshallerImpl.java:514)

                at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:215)

                at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(UnmarshallerImpl.java:184)

                at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(Unknown Source)

                at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(Unknown Source)

                at uk.gov.nationalarchives.droid.container.ContainerSignatureSaxParser.parse(ContainerSignatureSaxParser.java:70)

                ... 29 more

Caused by: org.xml.sax.SAXParseException; lineNumber: 34; columnNumber: 50; The entity name must immediately follow the '&' in the entity reference.

                at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)

sed by: org.xml.sax.SAXParseException; lineNumber: 34; columnNumber: 50; The entity name must immediately follow the '&' in the entity reference.

                at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)


Without any other info I also tried without square brackets, and I also tried the negated syntax as well. Is it possible the Byteseek syntax hasn't been fully implemented in DROID?

Ross

Matt Palmer

unread,
Apr 12, 2015, 6:31:43 PM4/12/15
to droid...@googlegroups.com
Hi Ross,

ah, that's just an XML parsing error. & is a special character in XML, it must be encoded as  &amp; in the signature file. So I guess you'd have to write [&amp;01].

As I recall, DROID simply parses the signatures into fragments, runs the fragments through the FragmentRewriter, which just does some simple syntactic substitutions, then gives it to the byteseek compiler.   So byteseek syntax should be parsed fine, if it makes it through that process.

In terms of the syntax the value is just a hex byte bitmask.  So the syntax for the 8th bit would then be [&amp;80].

0x01 = 00000001 = [&amp;01]
0x08 = 00001000 = [&amp;08]
0x80 = 10000000 = [&amp;80]
0x55 = 01010101 = [&amp;55]

and so on...

cheers,

Matt


--
You received this message because you are subscribed to a topic in the Google Groups "droid-list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/droid-list/v4CHVddELaM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to droid-list+...@googlegroups.com.

Matt Palmer

unread,
Apr 12, 2015, 6:39:42 PM4/12/15
to droid...@googlegroups.com
If that doesn't work, single-bit matches are equivalent with an all-bitmask & or an any bit-mask ~.
  • All-bitmask matches must match *all* the bits in the bitmask.
  • Any-bitmask matches must match *any* of the bits in the bitmask.

Clearly, for single bits, this is entirely equivalent.

So you could also write [~80] to match the 8th bit only.

cheers,

Matt

ross-spencer

unread,
Apr 12, 2015, 9:10:45 PM4/12/15
to droid...@googlegroups.com
Yep, that looks pretty good Matt. 

Attached is my container signature file containing .DOT. David will want to try that with the encrypted signature too. 

Many thanks Matt! 

Ross
bitmask-test-dot-and-doc-template-container-signature-20150410.xml
Reply all
Reply to author
Forward
0 new messages