Extracting files from raw disk images

215 views
Skip to first unread message

Shira Peltzman

unread,
Nov 30, 2016, 3:52:01 PM11/30/16
to Digital Curation

Hi all, 


We’re in the process of establishing our Archivematica processing workflow. One of the main files types we’ll be working with are the raw disk images (.img) that we create using our KryoFlux. Ideally we’d like to have Archivematica to extract the files from each disk image upon transfer since, at least in most cases, we're more interested in preserving the contents of the disk image rather than the disk image itself. 

Our dilemma is that while you can configure Archivematica to identify .img files by their extension (this got addressed in Dorothy Waugh's query to the DC Google Group last year), currently none of Archivematica’s existing commands are able to perform file extraction for raw disk images--or from E01 disk images, for that matter.

 

Theoretically, we could solve this problem simply by changing our workflow so that we extracted the files from our disk images before ingesting them into Archivematica. While doing file extraction manually may not be such a huge deal in the short-term, obviously this would both slow down our workflow to some degree and introduce the opportunity for human error, which would be disadvantageous overall. 

 

I'm interested to hear from other folks who are using Archivematica to process raw/e01 disk images: what does your current workflow look like? Do you extract the files before transferring them into Archivematica, or do you just process and preserve just the disk images themselves? If you do extract the files, at what point do you do this, and what tool do you use to do so?

  

I'd appreciate any thoughts you guys have to share about the above. 


Thanks in advance! 

 

Shira Peltzman 
Digital Archivist, UCLA Library

John Durno

unread,
Nov 30, 2016, 5:20:26 PM11/30/16
to digital-...@googlegroups.com

Hi Shira,

 

I just wrote a paper on that very thing, published last month in Code4Lib journal:

 

http://journal.code4lib.org/articles/11986

 

You are correct that Archivematica isn’t well suited to extracting files from Kryoflux images. I’ve come up with some semi-automated processes using a variety of tools depending on which disk formats I’m working on. I describe those processes in a reasonable amount of detail in the paper. Not sure if any of them are relevant to your particular case, but I’d be happy to answer questions if you have any.

 

Best,

John  

--
You received this message because you are subscribed to the Google Groups "Digital Curation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digital-curati...@googlegroups.com.
To post to this group, send email to digital-...@googlegroups.com.
Visit this group at https://groups.google.com/group/digital-curation.
For more options, visit https://groups.google.com/d/optout.

Message has been deleted

Ben Fino-Radin

unread,
Dec 1, 2016, 11:40:25 AM12/1/16
to Digital Curation
Hi Shira,

You raise an important point!

I think though it is important to remind those reading along (of course you yourself are very familiar with this) that Archivematica is a modular system of other existing open source tools, and can be taught new tricks. So, in the spirit of this, let's see if there is a way to "teach" Archivematica how to do what you are looking for.

There is really only one command used by Archivematica for addressing disk images in the "extract packages" microservice – tsk_recover, which is part of The Sleuth Kit. tsk_recover is indeed quite capable of extracting raw, as well as forensic disk images, and thus so is Archivematica.

There are two caveats to this:
  1. you are limited to the following 18 filesystems – this is of course highly relevant to older materials, and is what John's paper is addressing
  2. out of the box, the tsk_recover command used by Archivematica will not be able to handle physical disk images. It will only be able to extract logical images – based on your statement that it can not extract any kind of disk images, I am inclined to think that this is perhaps the root of your problem
Looking closer at problem #2 – tsk_recover is very much able to extract physical disk images, but the trick is that you need to specify the byte offset. Honestly I find it pretty silly that tsk_recover does not have an option to just extract any/all partitions with known filesystems – especially since The Sleuth Kit does include a tool (mmls) that can display the partition map, show the filesystem type for each partition, and provide the byte offset. I've opened an issue regarding this here.

I have been working (though it's been on the back burner for many months) on a Python script for doing some data munging of E01 disk images, and the final piece of the puzzle is a part that runs mmls, finds the byte offset of the volume, and then uses that when running the tsk_recover command. The script I'm working on is highly highly specific to MoMA's workflows, and actually happens as part of automation-tools, rather than "inside" Archivematica. That being said, once I finish it, I will break out the part I just described as a discrete standalone script, which you could then put on your Archivematica server, and use as a new "extract" command in the FPR – in other words giving you the ability to handle both Logical and Physical image extraction as part of the extraction microservice.

So… no immediate solutions for an automated workflow, but perhaps the problem is clearer now?

Best,
Ben

Jess Whyte

unread,
Dec 1, 2016, 12:22:07 PM12/1/16
to Digital Curation
You know that thing where someone posts at the exact same time as you, but their post is way more informative and helpful.... 

Jess Whyte

unread,
Dec 1, 2016, 12:22:12 PM12/1/16
to Digital Curation
Shira - Hi, I thought Archivematica included an extraction service that used tsk_recover (which, depending on the file system, will work with .img images)?  I'm not a regular Archivematica user though, but I'm curious about this, so please keep me posted.  

John - great article. Thanks for posting it, I learned a lot of new tricks and tactics. 

To post to this group, send email to digital...@googlegroups.com.

Reply all
Reply to author
Forward
0 new messages