Proposal: Oxford Common Filesystem Layout

510 views
Skip to first unread message

Andrew Hankinson

unread,
Sep 15, 2017, 6:55:45 AM9/15/17
to fedora-c...@googlegroups.com
Dear Fedora Community,

During the Fedora / Samvera camp in Oxford (4-8 September) there were a number of conversations about the role of file systems in institutional repositories, and specifically about the impact the storage layer has on institutional repository actions and interactions with other components of a larger institutional digital preservation programme.

This week, with feedback from Neil Jefferies and Andrew Woods, I put together some of these thoughts into a small proposal: the "Oxford Common Filesystem Layout" (OCFL). Similar to the "Portland Common Data Model," my hope that this can serve to start conversations about the underlying data storage layer in our institutional repositories and arrive at some common understanding of best-practices for filesystem storage. Please find attached the first version of this document.

I look forward to the (hopefully) ensuing conversations.

-Andrew Hankinson

---
Bodleian Libraries
University of Oxford
andrew.h...@bodleian.ox.ac.uk

Oxford Common Filesystem Layout.pdf

Scott Prater

unread,
Sep 15, 2017, 11:01:03 AM9/15/17
to fedora-c...@googlegroups.com, Andrew Hankinson
Andrew,

This is an excellent document summarizing storage issues and proposed
principles, thank you very much for putting it together. I have a
number of thoughts about practices concerning digital object storage,
based on some lessons we've learned here over the years; I would like
to discuss what has worked for us, and what we're still wrestling with,
with others working in the same space.

Just to start off a conversation, I'll put out for debate a proposed
general principle: data model and object storage organization should be
completely independent of each other. In my view, a data model is
metadata about how an object is structured, and can be stored like any
other piece of metadata, as bits in any arbitrary location. Fedora 3 +
Akubra respected this distinction fairly well, insofar as how storage
was implemented had almost nothing to do with how your data models were
structured (except for the distinction between datastreams and objects
in the storage implementation, which I think could also be done away with).

Again, thanks for taking the time to organize a conversation around this
topic.

-- Scott
--
Scott Prater
Shared Development Group
General Library System
University of Wisconsin - Madison

Mike Giarlo (Google Groups)

unread,
Sep 15, 2017, 8:19:06 PM9/15/17
to fedora-c...@googlegroups.com, Andrew Hankinson
Indeed, I too will be excited to track this conversation. Thanks for kickin=
g it off, Andrew!

--
Michael J. Giarlo
Technical Manager, Hydra-in-a-Box project
Software Architect, Digital Library Systems & Services
Stanford University Libraries

________________________________________
From: fedora-c...@googlegroups.com <fedora-c...@googlegroups.com> on behalf of Scott Prater <scott....@wisc.edu>
Sent: Friday, September 15, 2017 08:00
To: fedora-c...@googlegroups.com; Andrew Hankinson
Subject: Re: [fedora-community] Proposal: Oxford Common Filesystem Layout
--
You received this message because you are subscribed to the Google Groups "Fedora Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fedora-communi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Andrew Hankinson

unread,
Sep 19, 2017, 6:21:39 AM9/19/17
to fedora-c...@googlegroups.com
Thanks Mike, Scott,

I had only briefly looked at Akubra before.

I would like to pick up on what you mentioned, Scott, about separation between data model and object storage. Could you be more specific about how you think those two components are separable, and why you feel this way? Do you mean metadata is not stored on disk with the digital objects they describe? Or just that the paths to the digital objects have no recognizable connection to information in the metadata (e.g., the UUID of the digital object is not used to construct a pairtree folder structure to the digital objects).

-Andrew

Scott Prater

unread,
Sep 19, 2017, 9:37:14 AM9/19/17
to fedora-c...@googlegroups.com, Andrew Hankinson
Hi, Andrew --

I meant the latter -- that the paths or URLs to to the digital objects
in the storage platform are not in any way determined by the semantic
content of the metadata about the objects. Akubra's default
implementation, for example, created a hash of the ingested file, then
used the first several characters of the hash to create a directory
structure (a pairtree or something similar) whose depth was
user-configurable. One of the benefits of this approach is that if your
data is spread out among several filesystems or storage buckets, they
all fill up at the same pace evenly, you could also chop the characters
of the hash up in such a way has to have deep directories with few files
at the leaves, or flat directories with lots of files, whichever option
gave you the most performance.

One thing to keep in mind also is that underlying storage containers
radically differ in terms of how they are accessed -- the Hitachi
Content Platform, for example, uses a RESTful interface to store
objects, you simply pass it an identifier (or it allocates one), and it
takes care of the rest -- the implementation is entirely opaque in terms
of actual storage, even though you can create virtual hierarchies as URL
paths. Backblaze B2 works much the same way, like most cloud storage
providers. The trend in storage seems to be more in that direction,
where paths and URIs are semantic bits of metadata about an object
structure, to be used or not as you wish, but that have little or
nothing to do with how the bits are organized on a disk.

One of the attractions to me about not using a tree hierarchy structure
to model your objects is that you then have more flexibility in terms of
organizing your objects as nodes in a graph, with multiple parentage,
mix-in models, etc. The underlying store could use a tree to store the
objects, if it wanted -- that would be determined not by the semantic
structure of the objects, but by platform-dependent considerations based
on performance -- a good place to be. I want my storage platform to
organize the bits as it sees fit to write them and serve them up as
quickly as possible, without any other constraints getting in its way.

-- Scott

Andrew Hankinson

unread,
Sep 19, 2017, 10:21:51 AM9/19/17
to fedora-c...@googlegroups.com
I think this is one thing that I touch on in the proposal where it will be difficult to understand the boundaries: The line between specific technologies and the abstract structures that we use on top of these technologies to organize our content.

On the one hand, there are any number of storage options available, and how they store the underlying content varies significantly on the face of them, including distributed file systems or object stores. On the other hand, when you get down to the metal there still seems to be an assumption in all systems of a 'tree-like' layer. That tree view may be hidden within proprietary layers and spread across multiple physical disks, but (in my limited experience) it is probably still there, somewhere. Do these layers introduce any risks?

One of the fundamental assumptions I have is that, of all the technologies that we have access to, filesystems are the most robust, well-tested, and compatible layers of any stack. The contents of FAT16 and HFS filesystems of 30 years ago are still readable, assuming the underlying hardware still works. I know that the world has moved on since FAT16, but I think there is still significant value in having solid technologies like it underpinning long-term storage systems. The question is whether we can tell, in the present, what these technologies will be, and give guidance to implementers who are looking to make technology choices now.

Part of what I would like to explore and tease out is the impact filesystem choices have on long-term preservation, and what risks are introduced in the process of layering "storage-abstracting" technologies, such as Modeshape, on top of filesystems. Is the Hitachi Content Platform a new kind of filesystem that we can expect to be readable in 30 years, or is it simply wrapping content up in a proprietary layer that sits on top of a "normal" filesystem? Is that OK? Would we be better served by adopting more open standards in our storage systems and structures?

I think if libraries are to have digital preservation as a goal, there should be some sort of guidance on the advantages and tradeoffs of certain storage technologies, independent of any given software system. An analogy might be the equivalent of building a physical building to store archival collections while not adequately understanding the impact the HVAC system has on heat and moisture control. Is it OK to go out and buy the solution offered by the lowest bidder, or the one that is more quiet and aesthetically pleasing? Probably not. So it's probably time that we sat down and tried to understand the impact of our 'digital HVAC' systems, and the design criteria needed to build an adequate one. :)

-Andrew

Donald Brower

unread,
Oct 5, 2017, 2:52:39 PM10/5/17
to fedora-c...@googlegroups.com
Hi Andrew,

Thanks for framing these issues and starting the conversation. I ask myself often whether we should be concerned about the layout of preservation data on a filesystem, and whether the idea that as a last resort I could always create my own export tool at the filesystem level is valid, or an illusion.

-Don




>>> Sent: Friday, September 15, 2017 08:00
>>> To: fedora-community@googlegroups.com; Andrew Hankinson
>>> To unsubscribe from this group and stop receiving emails from it, send an email to fedora-community+unsubscribe@googlegroups.com.

>>> For more options, visit https://groups.google.com/d/optout.
>>>
>
> --
> Scott Prater
> Shared Development Group
> General Library System
> University of Wisconsin - Madison
>

--
You received this message because you are subscribed to the Google Groups "Fedora Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fedora-community+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Donald Brower, Ph.D.
Digital Infrastructure Lead
Hesburgh Libraries — Digital Library Initiatives and Scholarship

University of Notre Dame
208 Hesburgh Library

Andrew Woods

unread,
Oct 20, 2017, 12:07:08 PM10/20/17
to fedora-community
Hello AndrewH and All,
Thank you for resurfacing this important topic in a concrete, constructive way.

The Fedora community has been putting significant focus on defining the Fedora API Specification [1] as a common abstraction for the services expected from a Fedora repository. Although the Fedora API Specification is the interface through which Fedora clients should interact with the repository, I agree that the Fedora community should play a more direct role in collectively defining expectations of the underlying persistence characteristics and content layout, with an eye towards preservation needs.

As a part of this conversation, however, we should be careful to draw a clear line between how we expect Fedora clients to interact with the repository contents (via the HTTP API) and what characteristics and layout we may expect from the persistence layer.

That said, I would be excited to help forward an initiative related to Fedora's preservation sensibilities, potentially under the auspices of a time-bound "Oxford Common Filesystem Layout" (OCFL) interest group.

If at least three separate institutions can respond to this thread with a +1 indication of commitment to participating in an OCFL interest group, I will send out a Doodle to coordinate the first call.

Regards,
AndrewW

Aaron Birkland

unread,
Oct 20, 2017, 3:54:14 PM10/20/17
to fedora-c...@googlegroups.com

Hi all,

 

I would be cautious about developing an expectation that a Fedora repository persists content on a filesystem in a particular way, unless we are prepared to define Fedora as such,

 

That being said,  I really like the notion of standard/specification for persisting repository resources – particularly if it will offer advice on linking between resources that are persisted on the same file system.  A significant portion of a fedora 4 repository is hypertext, and it is essential not to have to possess some piece a priori knowledge in order to dereference resources that are linked from other resources.  Even  Fedora 3 suffered from this in its own special way.  Managed Datastream IDs on FOXML were opaque, and could really only be understood by the software that created them. 

 

In my mind another great application of OCFL is packaging resources in BagIt archives for preservation or transfer.  The Fedora import/export tool adopted its own set of conventions for arranging and naming resources[1].  One needs to know the conventions in order to interpret the content exported by it, including the baseURI if the repository the resources were exported from.   Likewise, there is another approach that uses an external manifest for distinguishing “domain objects” from file content, and uses a custom URI scheme for unambiguously linking between local resources in a bag [2]; but its scope is narrow.  If OCFL can provide a generalized and widely-adopted solution in this space, that would be a big win.

 

  -Aaron

 

 

[1] https://git.io/vdbew

[2] https://git.io/vdbeH

 

From: Andrew Woods
Sent: Friday, October 20, 2017 12:08 PM
To: fedora-community
Subject: Re: [fedora-community] Proposal: Oxford Common Filesystem Layout

 

Hello AndrewH and All,

Regards,
AndrewW

--
You received this message because you are subscribed to the Google Groups "Fedora Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fedora-communi...@googlegroups.com.

Scott Prater

unread,
Oct 23, 2017, 6:14:59 PM10/23/17
to fedora-c...@googlegroups.com, Aaron Birkland
I'm with Aaron. Perhaps we can start by calling out some fundamental
principles, such as "Objects stored by Fedora should be retrievable from
the storage layer and immediately interpretable without needing to know
how Fedora internally manages its data". Of course, we'll need to have
some a priori knowledge -- hence an OCFL standard.

(Something like CMIS, but for import/export to/from storage layers?
https://en.wikipedia.org/wiki/Content_Management_Interoperability_Services)

-- Scott


On 10/20/2017 02:53 PM, Aaron Birkland wrote:
> Hi all,
>
> I would be cautious about developing an expectation that a Fedora repository persists content on a filesystem in a particular way, unless we are prepared to define Fedora as such,
>
> That being said, I really like the notion of standard/specification for persisting repository resources – particularly if it will offer advice on linking between resources that are persisted on the same file system. A significant portion of a fedora 4 repository is hypertext, and it is essential not to have to possess some piece a priori knowledge in order to dereference resources that are linked from other resources. Even Fedora 3 suffered from this in its own special way. Managed Datastream IDs on FOXML were opaque, and could really only be understood by the software that created them.
>
> In my mind another great application of OCFL is packaging resources in BagIt archives for preservation or transfer. The Fedora import/export tool adopted its own set of conventions for arranging and naming resources[1]. One needs to know the conventions in order to interpret the content exported by it, including the baseURI if the repository the resources were exported from. Likewise, there is another approach that uses an external manifest for distinguishing “domain objects” from file content, and uses a custom URI scheme for unambiguously linking between local resources in a bag [2]; but its scope is narrow. If OCFL can provide a generalized and widely-adopted solution in this space, that would be a big win.
>
> -Aaron
>
>
> [1] https://git.io/vdbew
> [2] https://git.io/vdbeH
>
> From: Andrew Woods<mailto:awo...@duraspace.org>
> Sent: Friday, October 20, 2017 12:08 PM
> To: fedora-community<mailto:fedora-c...@googlegroups.com>
> Subject: Re: [fedora-community] Proposal: Oxford Common Filesystem Layout
>
> Hello AndrewH and All,
> Thank you for resurfacing this important topic in a concrete, constructive way.
>
> The Fedora community has been putting significant focus on defining the Fedora API Specification [1] as a common abstraction for the services expected from a Fedora repository. Although the Fedora API Specification is the interface through which Fedora clients should interact with the repository, I agree that the Fedora community should play a more direct role in collectively defining expectations of the underlying persistence characteristics and content layout, with an eye towards preservation needs.
>
> As a part of this conversation, however, we should be careful to draw a clear line between how we expect Fedora clients to interact with the repository contents (via the HTTP API) and what characteristics and layout we may expect from the persistence layer.
>
> That said, I would be excited to help forward an initiative related to Fedora's preservation sensibilities, potentially under the auspices of a time-bound "Oxford Common Filesystem Layout" (OCFL) interest group.
>
> If at least three separate institutions can respond to this thread with a +1 indication of commitment to participating in an OCFL interest group, I will send out a Doodle to coordinate the first call.
>
> Regards,
> AndrewW
>
> [1] http://fedora.info/spec/
>
> --
> You received this message because you are subscribed to the Google Groups "Fedora Community" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to fedora-communi...@googlegroups.com<mailto:fedora-communi...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.
>

Eoghan Ó Carragáin

unread,
Oct 24, 2017, 6:02:44 AM10/24/17
to fedora-c...@googlegroups.com
Hi,
There is a Research Data Alliance group looking at repository interoperabilty [0]. A spec for a common bagit-based export/exchange format is being discussed [1]. It doesn't deal with filesystem layout for production repositories but there is overlap none the less, so may be of interest.
Eoghan

On 23 October 2017 at 23:14, Scott Prater <scott....@wisc.edu> wrote:
I'm with Aaron.  Perhaps we can start by calling out some fundamental principles, such as "Objects stored by Fedora should be retrievable from the storage layer and immediately interpretable without needing to know how Fedora internally manages its data".  Of course, we'll need to have some a priori knowledge -- hence an OCFL standard.

(Something like CMIS, but for import/export to/from storage layers? https://en.wikipedia.org/wiki/Content_Management_Interoperability_Services)

-- Scott


On 10/20/2017 02:53 PM, Aaron Birkland wrote:
Hi all,

I would be cautious about developing an expectation that a Fedora repository persists content on a filesystem in a particular way, unless we are prepared to define Fedora as such,

That being said,  I really like the notion of standard/specification for persisting repository resources – particularly if it will offer advice on linking between resources that are persisted on the same file system.  A significant portion of a fedora 4 repository is hypertext, and it is essential not to have to possess some piece a priori knowledge in order to dereference resources that are linked from other resources.  Even  Fedora 3 suffered from this in its own special way.  Managed Datastream IDs on FOXML were opaque, and could really only be understood by the software that created them.

In my mind another great application of OCFL is packaging resources in BagIt archives for preservation or transfer.  The Fedora import/export tool adopted its own set of conventions for arranging and naming resources[1].  One needs to know the conventions in order to interpret the content exported by it, including the baseURI if the repository the resources were exported from.   Likewise, there is another approach that uses an external manifest for distinguishing “domain objects” from file content, and uses a custom URI scheme for unambiguously linking between local resources in a bag [2]; but its scope is narrow.  If OCFL can provide a generalized and widely-adopted solution in this space, that would be a big win.

   -Aaron


[1] https://git.io/vdbew
[2] https://git.io/vdbeH

From: Andrew Woods<mailto:awoods@duraspace.org>
Sent: Friday, October 20, 2017 12:08 PM
To: fedora-community<mailto:fedora-comm...@googlegroups.com>
Subject: Re: [fedora-community] Proposal: Oxford Common Filesystem Layout

Hello AndrewH and All,
Thank you for resurfacing this important topic in a concrete, constructive way.

The Fedora community has been putting significant focus on defining the Fedora API Specification [1] as a common abstraction for the services expected from a Fedora repository. Although the Fedora API Specification is the interface through which Fedora clients should interact with the repository, I agree that the Fedora community should play a more direct role in collectively defining expectations of the underlying persistence characteristics and content layout, with an eye towards preservation needs.

As a part of this conversation, however, we should be careful to draw a clear line between how we expect Fedora clients to interact with the repository contents (via the HTTP API) and what characteristics and layout we may expect from the persistence layer.

That said, I would be excited to help forward an initiative related to Fedora's preservation sensibilities, potentially under the auspices of a time-bound "Oxford Common Filesystem Layout" (OCFL) interest group.

If at least three separate institutions can respond to this thread with a +1 indication of commitment to participating in an OCFL interest group, I will send out a Doodle to coordinate the first call.

Regards,
AndrewW

[1] http://fedora.info/spec/

--
You received this message because you are subscribed to the Google Groups "Fedora Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fedora-community+unsubscribe@googlegroups.com<mailto:fedora-community+unsubscribe@googlegroups.com>.

For more options, visit https://groups.google.com/d/optout.


--
Scott Prater
Shared Development Group
General Library System
University of Wisconsin - Madison

--
You received this message because you are subscribed to the Google Groups "Fedora Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fedora-community+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages