ANNOUNCE: droidSig - a utility to create DROID signature XML from signature regular expressions.

36 views
Skip to first unread message

Matt Palmer

unread,
Jul 25, 2015, 4:26:01 PM7/25/15
to droid-list
Hi,

I would like to announce droidSig - simple Java code which allows you to directly create DROID binary signature XML definitions from the DROID signature regular expression syntax.

Usage
Just run droidSig with the expression, and it will output the ByteSequence XML to standard out.  You can also specify whether the sequence is anchored to the BOF, EOF or a wildcard search from BOF (VAR), as the first parameter.  The default is BOF, so you don't have to specify that if that's what you want.

Example

droidSig "
0-1024}3C21(444F4354595045|646F6374797065)20(48544D4C|68746D6C)20(5055424C4943|7075626C6963)20222D2F2F{1-16}2F2F(445444|647464)20{0-64}(48544D4C|68746D6C)20332E32"
<!--
{0-1024}3C21(444F4354595045|646F6374797065)20(48544D4C|68746D6C)20(5055424C4943|7075626C6963)20222D2F2F{1-16}2F2F(445444|647464)20{0-64}(48544D4C|68746D6C)20332E32
-->
<ByteSequence Reference="BOFoffset">
    <SubSequence Position="1" SubSeqMaxOffset="1024" SubSeqMinOffset="0">
        <Sequence>20222D2F2F</Sequence>
        <LeftFragment MaxOffset="0" MinOffset="0" Position="1">5055424C4943</LeftFragment>
        <LeftFragment MaxOffset="0" MinOffset="0" Position="1">7075626C6963</LeftFragment>
        <LeftFragment MaxOffset="0" MinOffset="0" Position="2">20</LeftFragment>
        <LeftFragment MaxOffset="0" MinOffset="0" Position="3">48544D4C</LeftFragment>
        <LeftFragment MaxOffset="0" MinOffset="0" Position="3">68746D6C</LeftFragment>
        <LeftFragment MaxOffset="0" MinOffset="0" Position="4">20</LeftFragment>
        <LeftFragment MaxOffset="0" MinOffset="0" Position="5">444F4354595045</LeftFragment>
        <LeftFragment MaxOffset="0" MinOffset="0" Position="5">646F6374797065</LeftFragment>
        <LeftFragment MaxOffset="0" MinOffset="0" Position="6">3C21</LeftFragment>
        <RightFragment MaxOffset="16" MinOffset="1" Position="1">2F2F</RightFragment>
        <RightFragment MaxOffset="0" MinOffset="0" Position="2">445444</RightFragment>
        <RightFragment MaxOffset="0" MinOffset="0" Position="2">647464</RightFragment>
        <RightFragment MaxOffset="0" MinOffset="0" Position="3">20</RightFragment>
        <RightFragment MaxOffset="64" MinOffset="0" Position="4">48544D4C</RightFragment>
        <RightFragment MaxOffset="64" MinOffset="0" Position="4">68746D6C</RightFragment>
        <RightFragment MaxOffset="0" MinOffset="0" Position="5">20332E32</RightFragment>
    </SubSequence>
</ByteSequence>

The Code
The code is available from my GitHub byteseek repository:

https://github.com/nishihatapalmer/byteseek/blob/master/src/main/java/net/byteseek/utils/droidSig.java

Status
It's beta code right now - it seems to work OK, but I have not exhaustively tested it yet. 

Disclaimer
I don't work at the National Archives (TNA) any more, but I was involved in the development of DROID when I worked there. TNA has no responsibility for this code, so don't ask for help in this group if there are problems with it (but see below for contact details).

Finally
Do let me know on my gmail.com account (mattpalms) if you have any problems, or just found it useful!  Alternatively, report issues on the byteseek GitHub repository and I'll have a look at it. 

Regards,

Matt.




Matt Palmer

unread,
Jul 25, 2015, 5:57:28 PM7/25/15
to droid...@googlegroups.com
Validation

droidSig does not validate that the syntax of the DROID regular expression is correct - it merely gives you correct DROID signature XML.

Getting an XML output only means the expression could be split up correctly - it does not guarantee that DROID can process the signature, only that it can read the XML.




--
You received this message because you are subscribed to a topic in the Google Groups "droid-list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/droid-list/aRHb0vx8V98/unsubscribe.
To unsubscribe from this group and all its topics, send an email to droid-list+...@googlegroups.com.
To post to this group, send email to droid...@googlegroups.com.
Visit this group at http://groups.google.com/group/droid-list.
For more options, visit https://groups.google.com/d/optout.

Matt Palmer

unread,
Jul 26, 2015, 9:25:15 AM7/26/15
to droid-list, matt...@gmail.com
UPDATE: I've moved the code to sub-module of byteseek, as it's not really related to byteseek directly.

https://github.com/nishihatapalmer/byteseek/tree/master/droidSig

I'm also going to make it easier to run, as I had reports it didn't build for one person.  I'll probably release a jar and scripts to run it directly as a command line utility.  Most people probably don't want to compile it from scratch...

REgards,

Matt

Shaul Zevin

unread,
Jul 28, 2015, 11:00:41 AM7/28/15
to droid...@googlegroups.com
Hi Matt,

I am the author of another format identification tool - Falstaff.

Falstaff has a module to compute digital signatures (regular expressions) from format samples. It uses sequence aligning and clustering algorithms for the computation.

I wonder if you would be interested to integrate signatures computation module into your utility?

Thanks,
    Shaul Zevin

--
You received this message because you are subscribed to the Google Groups "droid-list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to droid-list+...@googlegroups.com.

Matt Palmer

unread,
Jul 28, 2015, 11:47:00 AM7/28/15
to droid...@googlegroups.com

Hi Shaul,

Falstaff looks really interesting.   Machine learning is a bit over my head,  but I can grok how sequence alignment and clustering could help producing automatic regular expressions from samples.

The problem my utility is trying to solve is that DROID currently can't process the reg  exes directly.  It relies on the PRONOM service to transform them into the XML it wants.

So the utility takes expressions (themselves in a PRONOM specific regular expression syntax) and gives you the XML that DROID can consume.   It's fairly simple really,  especially compared to Falstaff.

It might make more sense to generate PRONOM expressions in Falstaff,  and use my utility to output the DROID XML from them....

I have a plan to integrate droidSig directly into DROID,  letting anyone specify format signatures themselves  without needing to go through PRONOM or to use droidSig and paste manually into the signature files...

Please do feel free to make use of droidSig yourself.   It is under a BSD license.

Cheers

Matt

You received this message because you are subscribed to a topic in the Google Groups "droid-list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/droid-list/aRHb0vx8V98/unsubscribe.
To unsubscribe from this group and all its topics, send an email to droid-list+...@googlegroups.com.

Shaul Zevin

unread,
Jul 28, 2015, 12:26:42 PM7/28/15
to droid...@googlegroups.com
Hi Matt,

Thanks for the clarification.

Falstaff uses signatures in a regular expression java syntax. In fact it can also do simple operations on the regular expressions like finding superset relations between two regular expressions.

Falstaff imports PRONOM syntax signatures and converts them into regular expressions in java syntax. I do not think it would be too difficult to do the opposite conversion as well and then use your utility to create DROID xml.

Do you think DROID people would be interested in automatic signatures generation? If yes who should I talk to?

Cheers,
  Shaul

Matt Palmer

unread,
Jul 29, 2015, 4:25:14 PM7/29/15
to droid-list, shaul...@gmail.com
Hi Shaul,

I'm not sure exactly who handles signature development at the archives now.  Your best bet is just to ask to talk to the digital preservation team at the National Archives. 

cheers,

Matt

Matt Palmer

unread,
Jul 29, 2015, 4:27:52 PM7/29/15
to droid-list, shaul...@gmail.com
Superset relations between regular expressions sounds interesting.  I've been doing some stuff with regular expression NFAs and so on, and thinking about some multi-pattern optimisations I could make...


cheers,

Matt

On Tuesday, July 28, 2015 at 5:26:42 PM UTC+1, Shaul Zevin wrote:
Reply all
Reply to author
Forward
0 new messages