fstat-manifest.txt

22 views
Skip to first unread message

Dan Chudnov

unread,
Nov 14, 2012, 1:03:36 PM11/14/12
to Digital Curation
We're building an inventory tool that lets us track what digital collections we have and where they are, with some record keeping and a friendly UI for raw content access. For the record keeping and UI pieces, we're looking at options for caching filesystem-level information into a central inventory database to support content browsing and summary reporting across collections. There's a typical tension between "copying filesystem information into a database" and "just using the filesystem", not to mention time-space and caching/aging tradeoffs. Ultimately we want to keep things simple. This has us wondering - have you considered serializing file stat info (e.g. ctime, mtime, mode, bytesize) in a bagit tag file?

In our common case, like so many of your systems, piles of content are stored on diverse storage pools; this content will accumulate over time but not change often, at least not intentionally. We want to have a simple inventory UI that allows direct access to stored content, most simply accomplished by redirecting to a web server directory browse mounted on a particular storage pool. We could, though, provide a cleaner UI, maybe without having to hit the remote filesystem right away. For example, we could use a checksum manifest cached in the db to create an initial file browse list, then only go to the remote system for specific files. A little hacky, but maybe a useful tradeoff.

That led us to think - what if we also generated filestat manifests in the bags, and cached those in the db too? That could allow a few things:

- more info available in the browser UI without having to hit remote disk
- efficient post-processing in the db to generate file/copy/collection summary statistics
- the convenience of a modern web framework/orm for storing/querying those stats

The main drawback is that it's just a data snapshot, and provides no guarantee (and perhaps the false appearance) of an accurate picture of what's actually on disk.

To support this we could generate something like "fstat-manifest.txt", with delimited values for ctime, mtime, mode, size, and path, as part of our bagging process, and part of bringing something under inventory control would mean caching this into our db for later processing and UI presentation. Having this serialized in the bag might add value over time apart from our inventory app, too. I can't tell how important that might be, but it seems like an option to consider.

Have you done something like this? I recall that LC's CTS captures file type and size info for the UI and for reporting as well, but I don't remember how this is modeled. And I don't seem to remember CTS serializing this info as tag files in the bags themselves.

Thanks for reading, -Dan


p.s. I acknowledge that CTS has corrupted my thinking about this kind of system irrevocably. But at least we can verify that it's corrupted.


John A. Kunze

unread,
Nov 14, 2012, 4:09:27 PM11/14/12
to Digital Curation
Keith Johnson brought up the concept of storing the inode/fstat info
about the time that the bagit spec was stabilizing. Good for forensics
and more.

He thought the container might be called a "baggie". A comprehensive
approach to filesystem-level info across platforms was challenged by
pretty divergent practices, eg, consider the rich file-level metadata
found on Mac OSX filesystems. That might be part of a case for just
looking at a subset of the file "metadata" (eg, ctime, mtime, ...).

-John


--- On Wed, 14 Nov 2012, Dan Chudnov wrote:

> Date: Wed, 14 Nov 2012 10:03:36 -0800
> From: Dan Chudnov <daniel....@gmail.com>
> Reply-To: digital-...@googlegroups.com
> To: Digital Curation <digital-...@googlegroups.com>
> Subject: [digital-curation] fstat-manifest.txt
> --
> You received this message because you are subscribed to the Google Groups "Digital Curation" group.
> To post to this group, send email to digital-...@googlegroups.com.
> To unsubscribe from this group, send email to digital-curati...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/digital-curation?hl=en.
>
>

Littman, Justin

unread,
Nov 15, 2012, 7:38:37 AM11/15/12
to Digital Curation
You might want to look at Digital Forensics XML (http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML).

--Justin
Reply all
Reply to author
Forward
0 new messages