Pronom in Droid

99 views
Skip to first unread message

Sobhan Jahangir

unread,
Sep 16, 2024, 3:31:58 AM9/16/24
to droid-list
Hey! I need some help to understand the process of how Droid works with Pronom. 

This is the questions I have:

Whats the difference between InternalSignatureCollection and FileFormatCollection?

Whats the difference between BOF and EOF in InternalSignature? 

What does the bytes in Shift and Sequence stand for? Is is the HEX code? I see that the numbers in Sequences matches with the numbers in Shift, the only one that doesnt match is the first two numbers in Sequence.

Do you have a link to the code you use to get the PUID out of PRONOM to use in Droid? 

Last, but not least. is it possible to change file format through Droid?

ross-spencer

unread,
Sep 18, 2024, 7:50:17 AM9/18/24
to droid-list
A lot of good questions.


https://ffdev.info is also an implementation of the signature output process and summarizes parts of that information. It is open source and you will find the links to GitHub there.

Most of the utility was written using the original PRONOM specifications that you can find here under documentation: https://www.nationalarchives.gov.uk/aboutapps/pronom/default.htm#documentation

Do you have a clear use case/goal that might help folks guide you more helpfully?

Best,
Ross

Sobhan Jahangir

unread,
Sep 18, 2024, 7:56:16 AM9/18/24
to droid-list
Thanks for good answer! I want to use Droid's signature files to create a more effective "Droid" application. And this using C#. My main question is if it is possible to shorten the amount of bytes inside InternalSignature. I saw now that I did not specifically mention droid signature file, but my question is based on those XML files.

Here is the link:  https://www.nationalarchives.gov.uk/aboutapps/pronom/droid-signature-files.htm

Francesca Mackenzie

unread,
Sep 19, 2024, 4:07:41 AM9/19/24
to droid-list
Hello,

Thank you, these are all really interesting questions.

The InternalSignatureCollection is where we store the data that does the internal identification. The internal signatures themelves. However not all files are searchable by an internal signature. Some identify by extension. The InternalSignatureCollection is linked to the FileFormatCollection which is where every file format entry in PRONOM is stored. There is additional data that doesn't make it to the xml that is also stored for each file format entry on the actual website.

BOF and EOF stand for beginning of file and end of file. Therefore a BOF sequence is searching for a set of bytes from the beginning of the file and the EOF sequence starts looking at the end of file.

The best document to understand the PRONOM xml can be found here: User Requirements (nationalarchives.gov.uk). The internal signature xml is automatically generated via our systems but important to note that the container signature xml is handwritten and created separately. 

All of DROIDs code is open source and available on the GitHub- digital-preservation/droid: DROID (Digital Record and Object Identification) (github.com). There are also other tools that are not DROID that utilise PRONOMs data. You may want to also look at the code for these too. Such as this- richardlehane/siegfried: signature-based file format identification (github.com) but there are a few others too. Some open-source and some proprietary.

DROID does not have any conversion features.

In terms of your overall question, I think it is possible to delete some of the bytes within the internal signature. Part depends on your personal code though, if you decide some information isn't important you don't need to use it! I think you could probably remove the shift bytes, though we haven't tested this theory enough to make it a recommendation. I would not remove the fragment tags. However I would say that you may want to do your own testing when making these decisions. See what you can take out and how it impacts the DROID reports and quality of answers. I would point you towards the skeleton suite which is a helpful when testing if signatures still work. exponential-decay/pronom-archive-and-skeleton-test-suite: Release repository for The Skeleton Test Suite. Contains an Archive of PRONOM, and skeleton files for testing DROID from The National Archives, UK. (github.com)

Kind regards,
Francesca



Sobhan Jahangir

unread,
Sep 19, 2024, 4:19:01 AM9/19/24
to droid-list
This answered my questions, thanks. I will do my own research and find out more. But this a perfect start!

Have a nice day!

Best of regards,
Sobhan J. 

Sobhan Jahangir

unread,
Sep 24, 2024, 9:11:59 AM9/24/24
to droid-list
Francesca - Hey again! I have done some research and have more questions regarding the signature files. If I understand correctly <DefaultShift/> is made so that if the sequence isnt found at the start it should jump 5 bytes before looking again? 

2. The first byte in sequence doesnt have a shift because it is the start of the sequence, right? 

And last , but not least does the search restart everytime a byte is found? I am asking this because I am not completely sure what <Shift> does. 

Thanks!

Matt Palmer

unread,
Sep 25, 2024, 3:35:55 AM9/25/24
to droid-list
Hi,

The Shift and DefaultShift elements are no longer used by DROID.  They were instructions on how to perform what is known as a Boyer Moore Horspool search.

Thia type of search is a sub linear search, as it does not have to look in every position to determine if there was a match.

DROID now calculates all its search metadata itself (and doesn't use the same search algorithm it originally did).

So these elements can be ignored.  You will have to use some kind of fast search algorithm of your own if you are reimplementing DROID like scanning.

Regards,

Matt

Sobhan Jahangir

unread,
Sep 25, 2024, 3:42:39 AM9/25/24
to droid-list
Hey! Thanks, but what if I was to use the XML-files how do I do it then? And what do you guys do now? 

I have heard about Typesense is it possible to implement it through that? 

Matt Palmer

unread,
Sep 25, 2024, 4:56:46 AM9/25/24
to droid-list
You just ignore the shift and default shift elements in the XML - they are not really useful anymore, being based on an old implementation (that wasn't implemented in a completely standard way at the time).

You cannot use things like TypeSense - this is what is called an "online search".  There are two basic kinds of searching: online and offline.  
  • Offline searching processes the text / file to be searched first and builds an index for things you might be interested in.  This is a good approach if you want to support multiple queries on the same data repeatedly over a long period of time.  This is like Google.
  • Online searching processes text/files to be searched in real time, where you haven't seen the thing to be searched before.  This is the approach that should be taken if you want to scan arbitrary data for specific signatures, and is the approach DROID takes.  This is useful for things like anti-virus or, indeed, file type identification use cases.
You should look for online search algorithms.  Reasonable and well understood ones are Boyer Moore Horspool (what original DROID used), or the Hash/Wu Manber variant that processes the data in larger chunks.  This is essentially what DROID uses now.

Regards,

Matt.

Matt Palmer

unread,
Sep 25, 2024, 5:27:30 AM9/25/24
to droid-list
If you are looking for a very fast online search algorithm, I can recommend using HashChain:


I am the author of this search algorithm - it was recently published in the Symposium of Experimental Algorithms in Vienna this year, and is generally the fastest sub-linear search algorithm around (except for very short patterns, e.g. up to 6 bytes in length or so).  You would have to implement it yourself, as there's only a C implementation available currently.  Failing that, look for a C# library that implements fast byte searching (not sure what's out there).

Regards,

Matt

ross-spencer

unread,
Sep 27, 2024, 8:21:28 AM9/27/24
to droid-list
Sonhan,

You might be interested in this blog I happened to have been working on this week to utilize the simplified signature file format more: https://exponentialdecay.co.uk/blog/making-droid-work-with-wikidata/

Next steps for someone might be to extract the sequences from PRONOM instead and test those against different corpi including the Skeleton suite to see if it's all working as expected. 

The approach may be useful for you - however - if you're working with a blank slate I might also recommend taking the opportunity to think about a different serialization of signature file using the PRONOM XML files.

Best,
Ross

PS. congratulations Matt! Glad to see your work recognized! :) 

Martin Zvonek

unread,
Mar 19, 2025, 7:53:20 AMMar 19
to droid-list
Hello everyone.
I would like to follow up on the previous question. Within the organization I have to verify files and get a PUID. The problem is that I have c# and we will verify offline.
And the question is, is it enough for me to verify signatures (Sequence) (see Droid_SignatureFile...xml) within EOF / BOF or within file? Of course including offset and RightFragment / LeftFragment. Thank you for your help. Unfortunately I cannot use the Droid / siegfried application.

Thank you for your help,
Martin Zvonek

Dne pátek 27. září 2024 v 14:21:28 UTC+2 uživatel ross-spencer napsal:

Ross Spencer

unread,
Mar 19, 2025, 9:11:42 AMMar 19
to droid...@googlegroups.com
Hi Martin,

It isn't an uncommon pattern to do this. Whether the PRONOM signatures are the best for doing this is up to you based on your testing. You can verify how well they work in your context against the skeleton files in Richard Lehane's builder archives: https://github.com/richardlehane/builder/releases/tag/v120a (more on skeleton files: https://www.ijdc.net/index.php/ijdc/article/view/8.1.120). You might also want to look up other encodings such as FIDO's regular expression patterns which are created from PRONOM expressions to see if they work better in your case.

Other non-archival projects sometimes build collections of magic numbers specifically for their purposes, but if PUIDs are important you will of course want to use PRONOM. 

DROID XML isn't the best format to reverse engineer for your purposes. In the PRONOM XML schema you will find the signature described more plainly under the <InternalSignature> section of the XML tree, e.g. here for fmt/2007: https://www.nationalarchives.gov.uk/PRONOM/fmt/2007.xml

Those XML files are also available in Richard's builder output.

Hope that helps,
All the best,
Ross


--
You received this message because you are subscribed to the Google Groups "droid-list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to droid-list+...@googlegroups.com.
To view this discussion, visit https://groups.google.com/d/msgid/droid-list/a43dd94a-285a-4e8c-8fe7-311283abf097n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages