FIDO vs. Siegfried

690 views
Skip to first unread message

Lauren Finkel

unread,
Jun 4, 2015, 12:35:09 PM6/4/15
to archiv...@googlegroups.com
Are there any significant differences between the two file format ID tools? I've looked at both websites, but wasn't able to glean much information about how they are similar/different from each other.

I was processing a SIP this morning (using 1.4) and noticed that when I used FIDO I kept getting these same errors:



with this showing up when I clicked the Gear for more info.



I ran the same SIP through again using Siegfried and there was no problem. Even with the FIDO error, I was still able to complete normalization for preservation and access.


Lehane, Richard

unread,
Jun 4, 2015, 4:57:02 PM6/4/15
to archiv...@googlegroups.com
Hi Lauren
[disclaimer - I've been the siegfried developer, so may not be 100% impartial here!]
First off - thanks for trying siegfried!

I think overall most archivematica users will notice little difference between these tools - this is because when running under archivematica both tools are just implementing the FPR interface & they should 99% of the time return pretty much the same results. Both tools do run outside of archivematica as well and you'll notice most differences outside of archivematica: they have different outputs and offer different functionality from the command line.

Differences you might notice running under archivematica are:

FIDO will be faster. This is primarily because, as a python tool, you get a more seamless integration with archivematica (siegfried is written in a different language and at this stage runs as a system call for each file - meaning the executable needs to start up over and over again). FIDO also applies buffer limits when matching and this can also mean a speed-up over siegfried, especially when dealing with very large files. (If speed is important, and you still want to use siegfried, you can use the standalone roy tool that comes with siegfried to make modifications such as buffer limits to your signature file).

Siegfried may sometimes be more accurate. Because siegfried defaults to full file scans (rather than scanning a limited buffer), it may find signatures later in files that are sometimes missed e.g. PDF/A signatures. Siegfried, I think, uses later versions of the PRONOM signature files so may also be more up to date. Siegfried also takes a different approach to container matching (more like the DROID approach) and I've found in the past can be more accurate for container formats like doc/x (though this is something that is being changed in FIDO in recent releases thanks to work of archivematica devs). Finally siegfried applies different rules when deciding whether to report a match: e.g. if the extension matches but the byte signature doesn't, siegfried won't report it (but will give you a descriptive warning message) whereas FIDO will (neither is right or wrong here - they just take a different approach).

I'd say at this stage, unless you are happy for things to run a bit slower, stick with FIDO but if you notice any oddities in your identifications it may be worth checking in with siegfried for a second opinion.

Again, I'm biased, so please read with a grain of salt! I'd love to have others chime in on this too,
cheers
Richard



From: archiv...@googlegroups.com [archiv...@googlegroups.com] on behalf of Lauren Finkel [lauren...@gmail.com]
Sent: Friday, 5 June 2015 2:35 AM
To: archiv...@googlegroups.com
Subject: [archivematica] FIDO vs. Siegfried

--
You received this message because you are subscribed to the Google Groups "archivematica" group.
To unsubscribe from this group and stop receiving emails from it, send an email to archivematic...@googlegroups.com.
To post to this group, send email to archiv...@googlegroups.com.
Visit this group at http://groups.google.com/group/archivematica.
For more options, visit https://groups.google.com/d/optout.

______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________

______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________

Lauren Finkel

unread,
Jun 5, 2015, 11:20:58 AM6/5/15
to archiv...@googlegroups.com
Hi Richard,

I replied to this yesterday, but it didn't seem to go through. After running many more packages through Archivematica, I found that siegfried works MUCH better than fido. Virtually no errors had been returned, which was really nice. Another question for you - what's the difference between using siegfriend as a file format ID and using the Identify by file extension option? I know siegfried uses the formats listed on pronom to identify file types, but does it also rely on the extension as well? 

I'm trying to figure out the best way to go about file format IDs for my work flow for Transfer and Ingest (right now I'm using siegfried for both stages, but depending on your answer, I'm wondering if I should use siegfried for one and the file extension identifier for the other - just to be safe).

Regarding the speed of siegfried, I have no reason to be concerned. I'm not running archivematica off a vm, so everything is running at a normal pace.

Thanks again for your reply, it was a big help. You also have an amazing website.

Best, Lauren

Hutchinson, Tim

unread,
Jun 5, 2015, 3:17:43 PM6/5/15
to archiv...@googlegroups.com

Hi Lauren,

 

Thanks for reporting this initial testing. I’m looking forward to seeing whether Siegfried helps address a couple FIDO issues I’ve run into recently – the container format that Richard mentioned is one of them.

 

Identify by file extension applies Archivematica’s format policy registry, e.g. .doc = Generic Word Document. Whereas the PRONOM-based tools should provide a more detailed identification, like version numbers. I always prefer to use that, even if there are standardized filename extensions (which is certainly not always the case depending on the vintage of the files). But it would be great to have a combined identification tool, that is, use file extension only if the PRONOM identification fails. From my earlier testing, I don’t believe FIDO uses file extensions; not sure about Siegfried.

 

Tim

 

Tim Hutchinson
Head, University Archives & Special Collections
University Library, University of Saskatchewan

Tel: (306) 966-6028  Fax: (306) 966-6040

Email: tim.hut...@usask.ca

Web: http://library.usask.ca/archives/

with this showing up when I clicked the Gear for more info.



I ran the same SIP through again using Siegfried and there was no problem. Even with the FIDO error, I was still able to complete normalization for preservation and access.

--
You received this message because you are subscribed to the Google Groups "archivematica" group.
To unsubscribe from this group and stop receiving emails from it, send an email to archivematic...@googlegroups.com.
To post to this group, send email to archiv...@googlegroups.com.
Visit this group at http://groups.google.com/group/archivematica.
For more options, visit https://groups.google.com/d/optout.

______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________


______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________

Misty De Meo

unread,
Jun 5, 2015, 11:00:02 PM6/5/15
to archiv...@googlegroups.com
Hi, Tim,

FIDO returns both signature and extension matches. Archivematica currently returns signature matches where FIDO was able to make one, but will use an extension match if no signature match was found. We're not yet certain whether we want to keep this behaviour in the future - I'd be interested in getting the community's feedback on this.

As noted elsewhere in the thread, FIDO currently has fairly primitive container signature support; as a result, FIDO is unable to return signature matches for file formats such as Microsoft Office .docx files. (FIDO currently misidentifies these files as ZIP files, which is technically correct but not useful.) A future version of FIDO is likely to solve this issue. Siegfried supports these files without a problem.

Richard notes that Siegfried is likely to be slower in some cases, but in the testing I've performed so far it outperforms FIDO for most files. However, when Siegfried is slower, due to the full-file scanning Richard mentions, it can be *much* slower. For example, I tested using a 3.3GB MPEG-4 video that neither FIDO nor Siegfried were able to identify; FIDO returned a misleading result in 223 microseconds, while Siegfried took 93.5 seconds to report that it could not identify the file.

I would love to see additional performance testing of FIDO and Siegfried within Archivematica - it would be interesting to me to see how often these edge cases are hit.

Hope this helps - please let me know if you have additional questions.
Misty

--
Misty De Meo
Software Developer / Systems Analyst
Artefactual Systems
www.artefactual.com

Hutchinson, Tim

unread,
Jun 5, 2015, 11:19:26 PM6/5/15
to <archivematica@googlegroups.com>
Hi Misty,

Thanks for clarifying that. I may have been remembering a scenario where the extension was defined in the Archivematica FPR but not in PRONOM? I'll have to try to retrace my steps, since I definitely remember cases where FIDO returned no id but file extension should have done the trick.

Tim

Misty De Meo

unread,
Jun 5, 2015, 11:30:26 PM6/5/15
to archiv...@googlegroups.com
Sorry, I was a bit unclear.

There are a few formats in the Archivematica registry which don't have PUIDs, and which can only be identified by file extension. In addition to that, PRONOM tracks file extensions - FIDO will attempt to fall back to suggesting file IDs using a file extension if it was unable to perform a signature match. Since it's still using PRONOM data, it returns PUIDs - it can't match any formats that aren't in PRONOM.

Best,
Misty

Hutchinson, Tim

unread,
Jun 5, 2015, 11:35:17 PM6/5/15
to <archivematica@googlegroups.com>
That's probably what I was thinking of. So my wish list item would be a composite rule that falls back on the Archivematica file extension if it's not in PRONOM.

Tim

Sent from my iPhone

Lehane, Richard

unread,
Jun 6, 2015, 1:56:19 AM6/6/15
to archiv...@googlegroups.com
Misty - re. your MPEG-4: it would be great to do some analysis of this file to find out why the PRONOM sigs aren't matching (there is a page on the siegfried wiki that has instructions on how to use `sf -debug` to help with this kind of analysis... https://github.com/richardlehane/siegfried/wiki/Inspect-and-Debug). Siegfried is a lot faster on big files if it can get a positive match early i.e. it won't necessarily do a full scan but it will quit as soon as it gets the best possible result. This means that the better the PRONOM signature database gets the faster siegfried should perform. The worst cases will always be these big unknown files. But 93 secs isn't too bad if you consider siegfried gulped through 3.3GB in that time! That said, if these bad cases come up a lot (e.g. for an audio-visual archive), you can give siegfried the same constraints that FIDO has. To set limits on how much of a file should be scanned use the roy tool - e.g. `roy build -bof 500000`. I think the cool thing about siegfried is that most of the time you can get by, with pretty decent performance, without doing this kind of thing.

Thanks Lauren, Tim and Misty for this discussion... it is great to get feedback on how siegfried is performing.

cheers
Richard


From: archiv...@googlegroups.com [archiv...@googlegroups.com] on behalf of Hutchinson, Tim [tim.hut...@usask.ca]
Sent: Saturday, 6 June 2015 1:35 PM

Lauren Finkel

unread,
Jun 9, 2015, 11:05:56 AM6/9/15
to archiv...@googlegroups.com
I agree with Tim, having an automatic fallback to the file extension if it can't be found in PRONOM would be a lifesaver. Or even if the drop down menu can change so two options can be selected. The frustrating part I find is that when I file format can't be identified I have to start from the very beginning instead of being given an option to retry the file format id with another tool. could that be another work around?

Lauren

Sarah Romkey

unread,
Jun 9, 2015, 5:30:28 PM6/9/15
to archiv...@googlegroups.com
Hi Lauren,

A method of falling back to another tool when there are identification errors is a great idea- I have added it to the Dashboard section of our wishlist: https://www.archivematica.org/wiki/Development_roadmap:_Archivematica#Dashboard

I don't know if this will be appropriate for your workflow, but you can choose a different tool in the ingest tab for file identification. So, if you had failures with FIDO in transfer, in ingest you could re-do with Siegfried, for example.

Cheers,

Sarah

Sarah Romkey, MAS,MLIS
Systems Archivist
Artefactual Systems
604-527-2056
@archivematica / @accesstomemory


Andrew Berger

unread,
Jun 10, 2015, 1:34:43 PM6/10/15
to archiv...@googlegroups.com
Hi all,

While I'm in favor of having a fallback option in case signature matching fails, I would like it to be configurable. In testing files from the 8+3 filename era, I've found extension matching to be misleading. For example, I've seen files with a .MOV extension identified by FIDO as MOV video when context suggests that the files are part of a BBS software package that ran on either CP/M or DOS. I'd rather have files go unidentified than be identified incorrectly, especially if that metadata is going into the AIP and the elasticsearch index.

Andrew

On Tue, Jun 9, 2015 at 8:05 AM, Lauren Finkel <lauren...@gmail.com> wrote:

Genevieve HK

unread,
Mar 28, 2016, 1:35:57 PM3/28/16
to archivematica
Picking up on this discussion due to misidentification issues during our tests....

Is there a way to flag/submit issue reports for certain formats that are misidentified by Siegfried, so that in the future they may be properly ID'd? 

We're having issues with both tools (and extension ID is not a good option for us), but more luck in general with Siegfried. It would be great to help improve it's functionality in some way, since there is currently no way to fall back on a different tool for particular formats, or to specify tools to be used for specific formats in Preservation Planning. 

-Genevieve

Lehane, Richard

unread,
Mar 28, 2016, 7:59:36 PM3/28/16
to archiv...@googlegroups.com

Hi Genevieve

You can report siegfried issues on its github repo: https://github.com/richardlehane/siegfried/issues . Please attach an example file if possible.

 

The latest version of siegfried is 1.5. That may not be the version running with your archivematica install. If that’s the case, it may be worth checking the output of the “Try Siegfried” demonstrator at http://www.itforarchivists.com/siegfried (drag your file onto the picture of siegfried) before reporting.

 

Cheers

Richard


For more options, visit https://groups.google.com/d/optout.

______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________

Sarah Romkey

unread,
Mar 30, 2016, 5:58:54 PM3/30/16
to archiv...@googlegroups.com
Hi Genevieve and all,

In a future version of Archivematica we plan to implement an "unidentified files" report and a step in the micro-services that allows you to stop the process, view the report, and re-run file identification with a different tool on the unidentified files. This work was sponsored by the Universities of Hull and York, through funding provided by Jisc. A first iteration of the development, the un-id'd files report, is targeted for 1.6 or 1.7.

Unidentified files is something different than misidentified files, which I recognize, but I just wanted to bring your attention to the development. Perhaps future development could address both use cases.

Cheers,

Sarah

Sarah Romkey, MAS,MLIS
Archivematica Program Manager
Artefactual Systems
604-527-2056
@archivematica / @accesstomemory


Reply all
Reply to author
Forward
0 new messages