We've received two sets of template files from two different agencies. DROID currently reports an extension mismatch for these files because of the .DOT extension. The signature for DOC 97-2003 matches, however, the following demonstrates that it could be expanded on to make it more precise.
Within Word, the application has awareness that these files are different from one another. N.B. regardless of extension.
And within the Word specification is information on how to identify the two types of file (note the final bullet point here):
https://msdn.microsoft.com/en-us/library/dd944620(v=office.12).aspx
· wIdent (2 bytes): An unsigned integer that specifies that this is a Word Binary File. This value MUST be 0xA5EC.
· nFib (2 bytes): An unsigned integer that specifies the version number of the file format used. Superseded by FibRgCswNew.nFibNew if it is present. This value SHOULD<12> be 0x00C1.
· unused (2 bytes): This value is undefined and MUST be ignored.
· lid (2 bytes): A LID that specifies the install language of the application that is producing the document. If nFib is 0x00D9 or greater, then any East Asian install lid or any install lid with a base language of Spanish, German or French MUST be recorded as lidAmerican. If the nFib is 0x0101 or greater, then any install lid with a base language of Vietnamese, Thai, or Hindi MUST be recorded as lidAmerican.
· pnNext (2 bytes): An unsigned integer that specifies the offset in the WordDocument stream of the FIB for the document which contains all the AutoText items. If this value is 0, there are no AutoText items attached. Otherwise the FIB is found at file locationpnNext×512. If fGlsy is 1 or fDot is 0, this value MUST be 0. If pnNext is not 0, each FIB MUST share the same values for FibRgFcLcb97.fcPlcBteChpx, FibRgFcLcb97.lcbPlcBteChpx, FibRgFcLcb97.fcPlcBtePapx, FibRgFcLcb97.lcbPlcBtePapx, and FibRgLw97.cbMac.
· A - fDot (1 bit): Specifies whether this is a document template (1).
Using the signature files I've attached to this email, I've extended the current Word signature and added a new one that identifies TMP files.
As part of this email, I'd like to ask:
Notes about the signature:
Because DROID is incapable of identifying bit-fields (as far as I am aware), I'm relying on the fDot bit making all integers where it is set, an odd number, therefore to find whether we're looking at a .DOT file, we can simply look for this integer being odd e.g. (01|03|05|07|09|0B|0D|0F|11|13… etc. The standard Word 97-2003 document can therefore be identified if this integer is even.
Please take a look at the attached bare-bones container signature file to understand how it works in more detail.
DOCX and DOTX have their place as first class citizens in the DROID container file, and there seems to be merit in fleshing out the current .DOT entry with a corresponding signature.
Regards,
Ross
Ross Spencer | Digital Preservation Analyst
Archives New Zealand Te Rua Mahara o te Kawanatanga
Direct Dial: +64 4 894 6015 | Extn: 9348 | www.dia.govt.nz
<InternalSignature ID="1" Specificity="Specific">
<ByteSequence Reference="BOFoffset">
<SubSequence MinFragLength="0" Position="1" SubSeqMaxOffset="0" SubSeqMinOffset="0">
<Sequence>A5EC</Sequence>
</SubSequence>
</ByteSequence>
<ByteSequence Reference="BOFoffset">
<SubSequence MinFragLength="0" Position="1" SubSeqMaxOffset="10" SubSeqMinOffset="10">
<sequence>(01|03|05|07|09|0B|0D|0F|11|13|15|17|19|1B|1D|1F|21|23|25|27|29|2B|2D|2F|31|33|35|37|39|3B|3D|3F|41|43|45|47|49|4B|4D|4F|51|53|55|57|59|5B|5D|5F|61|63|65|67|69|6B|6D|6F|71|73|75|77|79|7B|7D|7F|81|83|85|87|89|8B|8D|8F|91|93|95|97|99|9B|9D|9F|A1|A3|A5|A7|A9|AB|AD|AF|B1|B3|B5|B7|B9|BB|BD|BF|C1|C3|C5|C7|C9|CB|CD|CF|D1|D3|D5|D7|D9|DB|DD|DF|E1|E3|E5|E7|E9|EB|ED|EF|F1|F3|F5|F7|F9|FB|FD|FF)<Sequence/>
</SubSequence>
</ByteSequence>
</InternalSignature>
But I get a SAX error:
[org.xml.sax.SAXParseException; lineNumber: 34; columnNumber: 50; The entity name must immediately follow the '&' in the entity reference.]
at javax.xml.bind.helpers.AbstractUnmarshallerImpl.createUnmarshalException(Unknown Source)
at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.createUnmarshalException(UnmarshallerImpl.java:514)
at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:215)
at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(UnmarshallerImpl.java:184)
at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(Unknown Source)
at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(Unknown Source)
at uk.gov.nationalarchives.droid.container.ContainerSignatureSaxParser.parse(ContainerSignatureSaxParser.java:70)
... 29 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 34; columnNumber: 50; The entity name must immediately follow the '&' in the entity reference.
at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
sed by: org.xml.sax.SAXParseException; lineNumber: 34; columnNumber: 50; The entity name must immediately follow the '&' in the entity reference.
at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
&
in the signature file. So I guess you'd have to write [&01].
As I recall, DROID simply parses the signatures into fragments, runs the fragments through the FragmentRewriter, which just does some simple syntactic substitutions, then gives it to the byteseek compiler. So byteseek syntax should be parsed fine, if it makes it through that process.
In terms of the syntax the value is just a hex byte bitmask. So the syntax for the 8th bit would then be [&80].
0x01 = 00000001 = [&01]
0x08 = 00001000 = [&08]
0x80 = 10000000 = [&80]
0x55 = 01010101 = [&55]
and so on...
cheers,
Matt
--
You received this message because you are subscribed to a topic in the Google Groups "droid-list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/droid-list/v4CHVddELaM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to droid-list+...@googlegroups.com.
Clearly, for single bits, this is entirely equivalent.