Question about square brackets in DROID container identification syntax...

30 views
Skip to first unread message

ross-spencer

unread,
Aug 24, 2014, 3:18:31 AM8/24/14
to droid...@googlegroups.com
Hi All,

I've spotted some odd square brackets in the DROID container signature file syntax and I'm not sure what they are for, take this signature:

        <ContainerSignature Id="6000" ContainerType="ZIP">
            <Description>Open Document Text 1.0</Description>
            <Files>
                <File>
                    <Path>META-INF/manifest.xml</Path>
                    <BinarySignatures>
                        <InternalSignatureCollection>
                       <InternalSignature ID="319">
                           <ByteSequence Reference="BOFoffset">
                               <SubSequence Position="1" SubSeqMinOffset="0" SubSeqMaxOffset="1024">
                                   <Sequence>'manifest:media-type="application/vnd.oasis.opendocument.text'</Sequence>
                               </SubSequence>
                           </ByteSequence>
                       </InternalSignature>
                   </InternalSignatureCollection>    
                    </BinarySignatures>
                </File>
                <File>
                    <Path>content.xml</Path>
                    <BinarySignatures>
                        <InternalSignatureCollection>
                       <InternalSignature ID="323">
                           <ByteSequence Reference="BOFoffset">
                               <SubSequence Position="1" SubSeqMinOffset="0" SubSeqMaxOffset="128">
                                   <Sequence>'office:document-content'</Sequence>
                               </SubSequence>
                               <SubSequence Position="2" SubSeqMinOffset="0">
                                   <Sequence>'office:version=' [22 27] '1.0' [22 27]</Sequence>
                               </SubSequence>
                           </ByteSequence>
                       </InternalSignature>
                   </InternalSignatureCollection>    
                    </BinarySignatures>
                </File>
            </Files>
        </ContainerSignature>

I've highlighted the two places where the syntax is used. What does it denote? It's different from the use of square brackets in the regular syntax. 

The other examples of this I believe are all Open Office, indeed might all be Open Office Text. 

Any insight appreciated. 

Cheers,

Ross

Lehane, Richard

unread,
Aug 24, 2014, 7:04:18 AM8/24/14
to droid...@googlegroups.com
Hi Ross
this is an OR pattern I think. In this context, it means either a single quote mark (') or double quote (")
cheers
Richard


From: droid...@googlegroups.com [droid...@googlegroups.com] on behalf of ross-spencer [all.along.the....@gmail.com]
Sent: Sunday, 24 August 2014 5:18 PM
To: droid...@googlegroups.com
Subject: Question about square brackets in DROID container identification syntax...

--
You received this message because you are subscribed to the Google Groups "droid-list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to droid-list+...@googlegroups.com.
To post to this group, send email to droid...@googlegroups.com.
Visit this group at http://groups.google.com/group/droid-list.
For more options, visit https://groups.google.com/d/optout.

______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________

______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________

Ross Spencer

unread,
Aug 24, 2014, 7:39:56 AM8/24/14
to droid...@googlegroups.com
Ah, yes, and in XML both are allowed for a text literal. And in this context! I should have checked my ascii table first! :)

Thanks Richard.

Ross

Matt Palmer

unread,
Aug 25, 2014, 5:24:51 AM8/25/14
to droid...@googlegroups.com
Hi Ross,

The square brackets are just sets of bytes to match. This is just like regular expressions, and is the same in the binary signatures. You can define as many bytes to match as you like, not just two as in your example.

[09 0C 0D 20] matches whitespace: tab, cr, lf and space.

Regards

Matt

Ross Spencer

unread,
Aug 26, 2014, 1:53:31 AM8/26/14
to droid...@googlegroups.com
Hi Matt, 

Thanks for the additional information. Just to clarify, it's not the same as the binary signatures because this syntax shouldn't be available to them - at least it wasn't documented/part of its original functionality. If you have a look at the help text on my signature development utility, this used the original PRONOM reference material: http://exponentialdecay.co.uk/sd/index.htm

--

In PRONOM, an internal signature is composed of one or more byte sequences, each comprising a continuous sequence of hexadecimal bytes values and, optionally, regular expressions. A signature byte sequence is modelled by describing its starting position within a bitstream and its value. 

The starting position can be one of two basic types: 

Absolute: The byte sequence starts at a fixed position within the bitstream. The position is described as an offset from either the beginning or the end of the bitstream. The byte sequence can therefore be located by moving to the specified offset, counting from either the beginning of file or end of file position. If counting from either the EOF position, the offset is to the final byte in the sequence. 

Variable: The byte sequence can start at any offset within the bitstream. The byte sequence can be located by examining the entire bitstream. 

The value of the byte sequence is defined as a sequence of hexadecimal values, optionally incorporating any of the following regular expressions: 

??: wildcard matching any pair of hexadecimal values (i.e. a single byte).

e.g.: 0x0A FF ?? FE would match 0x0A FF 6C FE or 0x0A FF 11 FE.

*: wildcard matching any number of bytes (0 or more).

e.g.: 0x0A FF * FE would match 0x0A FF 6C FE or 0x0A FF 6C 11 FE.

{n}: wildcard matching n bytes, where n is an integer.

e.g.: 0x1C 20 {2} 4E 12 would match 0x1C 20 FF 15 4E 12.

{m-n}: wildcard matching between m-n bytes inclusive, where m and n are integers or ‘*’.

e.g.: 0x03 {1-2} 4D would match 0x03 3C 4D or 0x03 3C 88 4D. 

e.g.: 0x03 {2-*} 4D would match 0x03 3C 88 4D or 0x03 3C 88 3F 4D.

(a|b): wildcard matching one from a list of values (e.g. a or b), where each value is a hexadecimal byte sequence of arbitrary length containing no wildcards.

e.g.: 0x0E (FF|FE) 17 would match 0x0E FF 17 or 0x0E FE 17.

[a:b]: wildcard matching any sequence of bytes which lies lexicographically between a and b, inclusive (where both a and b are byte sequences of the same length, containing no wildcards, and where a is less than b). The endian-ness of a and b are the same as the endian-ness of the signature as a whole.

e.g. 0xFF [09:0B] FF would match 0xFF 09 FF, 0xFF 0A FF or 0xFF 0B FF.

[!a]: wildcard matching any sequence of bytes other than a itself (where a is a byte sequence containing no wildcards).

e.g. 0xFF [!09] FF would match 0xFF 0A FF, but not 0xFF 09 FF. Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID Page 9 of 33

[!a:b]: wildcard matching any sequence of bytes which does not lie lexicographically between a and b, inclusive (where a and b are both byte sequences of the same length, containing no wildcards, and where a is less than b).

e.g. 0xFF [!01:02] FF would match 0xFF 00 FF and 0xFF 03 FF, but not 0xFF 01 FF or 0xFF 02 FF.
Note: In the examples above, spaces are included between byte values for reasons of clarity, but are omitted in actual byte sequence values. The signature is processed left-to-right if the signature is measured relative to BOF and right-to-left if measured relative to EOF. The endian-ness of the signature is only relevant for sequences inside square brackets. A byte sequence must contain a fixed subsequence of at least one byte between each occurrence of ‘*’, or between the beginning or end of the sequence and an occurrence of ‘*’. Thus, sequences of the following form are not permitted: 

[BOF] (a|b)*… 

…*(a|b) [EOF] 

…*(a|b)*... 

--

Given that it exists for container signatures, it seems like it might be a useful tool. Can it also work on collections of multiple bytes? 

e.g. [AABB CCDD EEFF] to match 0xAABB or 0xCCDD or 0xEEFF

?

Many thanks,

Ross



--
You received this message because you are subscribed to the Google Groups "droid-list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to droid-list+...@googlegroups.com.
To post to this group, send an email to droid...@googlegroups.com.

Matt Palmer

unread,
Aug 26, 2014, 4:08:17 AM8/26/14
to droid...@googlegroups.com

Hi Ross,

Yes, you're right.   I had forgotten that the square brackets in binary signatures were limited to ranges of values,  not an arbitrary set of values.

Should be easy enough to add one day, since the underlying technology is already there, and exposed for container signatures.

Cheers

Matt

You received this message because you are subscribed to a topic in the Google Groups "droid-list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/droid-list/m5VRiN-iSWI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to droid-list+...@googlegroups.com.
To post to this group, send email to droid...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages