Representing archival collections in RDF

Aaron Rubinstein

unread,

Aug 24, 2010, 2:56:19 PM8/24/10

to Archives and the Semantic Web

Thought I'd try and spark some discussion on this list...

At UMass, we're working on simple representations of our collections
in RDF. Our current plans are to add some basic RDFa to our HTML
finding aids. The RDF would look something like this:

@prefix dcterms: <http://purl.org/dc/terms/>.
@prefix foaf: <http://xmlns.com/foaf/0.1/>.
@prefix pvn: <http://purl.org/archival/provenance/0.1#>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.

<http://example.com/mums589#collection>
rdf:type pvn:Collection;
dcterms:title "W. C. Wheeler Scrapbook";
dcterms:extent "1 vol. (0.5 linear ft.)";
dcterms:abstract "A resident of Haydenville, Mass., during the
1930s, C. H. Wheeler..." ;
dcterms:creator "Wheeler, C. H.";
pvn:heldBy <http://library.umass.edu/spcoll#archive> .

Another possible strategy would be to host the RDF descriptions
separately from the finding aid and link to the static html finding
aid using foaf:page or equivalent. The dcterms:creator could also
link to a resource like this: http://gslis.simmons.edu/archival/5d9920257771ee469c5edc0da779ab17#person.

As approaches to modeling the arrangement of collections become
clearer, that metadata could be added to these descriptions as well.
The advantage here is to: 1. Separate essential data about
collections from display information. The co-mingling of structure
and display metadata in EAD v. 2002 makes it almost impossible to
model in RDF. 2. Provide a basic RDF representation of collections
without trying to recreate EAD but allowing for modular addons as
vocabularies develop.

What do folks out there think of this approach?

Aaron

Jodi Schneider

unread,

Aug 25, 2010, 6:13:34 PM8/25/10

to semantic...@googlegroups.com

LOCAH is using Linked Data for archives, too. They've written about it here:
http://blogs.ukoln.ac.uk/locah/2010/08/18/some-thoughts-on-architecture-and-workflows/

On Tue, Aug 24, 2010 at 7:56 PM, Aaron Rubinstein <rubin...@gmail.com> wrote:
> Thought I'd try and spark some discussion on this list...
>
> At UMass, we're working on simple representations of our collections
> in RDF. Our current plans are to add some basic RDFa to our HTML
> finding aids. The RDF would look something like this:
>
> @prefix dcterms: <http://purl.org/dc/terms/>.
> @prefix foaf: <http://xmlns.com/foaf/0.1/>.
> @prefix pvn: <http://purl.org/archival/provenance/0.1#>.
> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
>
> <http://example.com/mums589#collection>
> rdf:type pvn:Collection;
> dcterms:title "W. C. Wheeler Scrapbook";
> dcterms:extent "1 vol. (0.5 linear ft.)";
> dcterms:abstract "A resident of Haydenville, Mass., during the
> 1930s, C. H. Wheeler..." ;
> dcterms:creator "Wheeler, C. H.";
> pvn:heldBy <http://library.umass.edu/spcoll#archive> .

This seems nice, but the location and date should be accessible rather
than just in text strings.
Are there best practices around extent? (i.e. is there a way to encode
it so that '1 vol.' is separate from '.5 linear ft.'

>
> Another possible strategy would be to host the RDF descriptions
> separately from the finding aid and link to the static html finding
> aid using foaf:page or equivalent.

Content negotiation, maybe?

> The dcterms:creator could also
> link to a resource like this: http://gslis.simmons.edu/archival/5d9920257771ee469c5edc0da779ab17#person.

Yes, especially if that site could provide info for humans as well as
machines. But fine even so; links will help humans.

>
> As approaches to modeling the arrangement of collections become
> clearer, that metadata could be added to these descriptions as well.
> The advantage here is to: 1. Separate essential data about
> collections from display information. The co-mingling of structure
> and display metadata in EAD v. 2002 makes it almost impossible to
> model in RDF. 2. Provide a basic RDF representation of collections
> without trying to recreate EAD but allowing for modular addons as
> vocabularies develop.
>
> What do folks out there think of this approach?
>
> Aaron
>
>
>
>

-Jodi

Jeanne Kramer-Smyth

unread,

Aug 25, 2010, 6:35:49 PM8/25/10

to semantic...@googlegroups.com

+1 on breaking out the extent details in some standard way.

Also - would you consider adding dcterms:subject and some way to indicate inclusive or bulk dates?

These two elements added in would give folks much more power in using your data to do interesting things like groupings and visualizations.

Jeanne Kramer-Smyth
http://www.spellboundblog.com

Aaron Rubinstein

unread,

Aug 26, 2010, 12:22:03 PM8/26/10

to semantic...@googlegroups.com

A huge thanks for your feedback, Jodi! My comments are below...

On 8/25/2010 6:13 PM, Jodi Schneider wrote:
> LOCAH is using Linked Data for archives, too. They've written about it here:
> http://blogs.ukoln.ac.uk/locah/2010/08/18/some-thoughts-on-architecture-and-workflows/

I have seen this and am eager to see how they are modeling MODS and EAD
in RDF.

> On Tue, Aug 24, 2010 at 7:56 PM, Aaron Rubinstein<rubin...@gmail.com> wrote:
>> Thought I'd try and spark some discussion on this list...
>>
>> At UMass, we're working on simple representations of our collections
>> in RDF. Our current plans are to add some basic RDFa to our HTML
>> finding aids. The RDF would look something like this:
>>
>> @prefix dcterms:<http://purl.org/dc/terms/>.
>> @prefix foaf:<http://xmlns.com/foaf/0.1/>.
>> @prefix pvn:<http://purl.org/archival/provenance/0.1#>.
>> @prefix rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
>>
>> <http://example.com/mums589#collection>
>> rdf:type pvn:Collection;
>> dcterms:title "W. C. Wheeler Scrapbook";
>> dcterms:extent "1 vol. (0.5 linear ft.)";
>> dcterms:abstract "A resident of Haydenville, Mass., during the
>> 1930s, C. H. Wheeler..." ;
>> dcterms:creator "Wheeler, C. H.";
>> pvn:heldBy<http://library.umass.edu/spcoll#archive> .
>
> This seems nice, but the location and date should be accessible rather
> than just in text strings.

I somehow forgot to add dcterms:created to my example. Still, this
would only appear as a string literal. Could you explain a little more
what you mean by accessibility? If you mean using URIs to represent
dates, I could see the benefit but there is just as much benefit, in my
opinion, having a literal date encoded in a standard format.

> Are there best practices around extent? (i.e. is there a way to encode
> it so that '1 vol.' is separate from '.5 linear ft.'

The extent element here is extracted from our EAD, which follows DACS's
rules for formatting. There is no reason why we couldn't use a
combination of integers and attributes to add more granularity to extent
in our EAD but there's been little practical reason to do so and no best
practices for how to do that. In RDF, I'm not sure the best way to add
more granularity here. In fact, looking again at the dcterms ontology,
a range has appeared for dcterms:extent, which is the class
dcterms:SizeOrDuration. dcterms:SizeOrDuration does not appear as a
domain of any property in the ontology so I'm not entirely sure how
extent should be used.

>> Another possible strategy would be to host the RDF descriptions
>> separately from the finding aid and link to the static html finding
>> aid using foaf:page or equivalent.

> Content negotiation, maybe?

This would certainly be the ideal. For now, I think we're stuck adding
RDFa to the HTML finding aids, though we also have a collections catalog
in Wordpress that could have this RDFa and point to the full finding
aids. We don't have the flexibility to completely rethink how we
deliver collection information so we are trying to work with what we
have, though we are certainly working towards a more elegant system.

>> The dcterms:creator could also
>> link to a resource like this: http://gslis.simmons.edu/archival/5d9920257771ee469c5edc0da779ab17#person.

> Yes, especially if that site could provide info for humans as well as
> machines. But fine even so; links will help humans.

The link I gave is to a linked data service that connegs to html or rdf.

Mark A. Matienzo

unread,

Sep 15, 2010, 12:36:14 AM9/15/10

to semantic...@googlegroups.com

On Thu, Aug 26, 2010 at 12:22 PM, Aaron Rubinstein
<rubin...@gmail.com> wrote:
>> Are there best practices around extent? (i.e. is there a way to encode
>> it so that '1 vol.' is separate from '.5 linear ft.'
>
> The extent element here is extracted from our EAD, which follows DACS's
> rules for formatting. There is no reason why we couldn't use a combination
> of integers and attributes to add more granularity to extent in our EAD but
> there's been little practical reason to do so and no best practices for how
> to do that. In RDF, I'm not sure the best way to add more granularity here.
> In fact, looking again at the dcterms ontology, a range has appeared for
> dcterms:extent, which is the class dcterms:SizeOrDuration.
> dcterms:SizeOrDuration does not appear as a domain of any property in the
> ontology so I'm not entirely sure how extent should be used.

I'd like to follow up on this issue, and to make a more general point
about the future of EAD. As Aaron and others do this sort of work, it
would be useful for those of us involved in the EAD revision process
are finding difficulty. While it might be a stretch to create a more
"RDF-friendly" version of EAD at least in an initial revision cycle,
there is at least some interest in developing a stricter subset or
profile for data-oriented work.

Regarding the LOCAH folks, I've been trying to get them to join this
list, but I'm not sure if they have. :)

Mark A. Matienzo
Digital Archivist, Manuscripts and Archives
Yale University Library

Pete J

unread,

Sep 15, 2010, 5:20:57 AM9/15/10

to semantic...@googlegroups.com

On 15 September 2010 05:36, Mark A. Matienzo <mark.m...@gmail.com> wrote:

> Regarding the LOCAH folks, I've been trying to get them to join this
> list, but I'm not sure if they have. :)

I have and I'm following this... I'm just trying to get something
written up at the moment, and will share it here as soon as I get it
done.

Pete

Pete J

unread,

Sep 28, 2010, 8:40:42 AM9/28/10

to semantic...@googlegroups.com

OK, that took me rather longer to write up than I anticipated, but
I've just put a post on the LOCAH project blog outlining our approach
to modelling the Archives Hub EAD data:

http://blogs.ukoln.ac.uk/locah/2010/09/28/model-a-first-cut/

As I try to emphasise, it's very much "a first cut" at the problem,
and I'm sure we'll find lots of things need changing along the way,
but I hope it's a starting point.

Any thoughts are welcome!

Cheers

Pete
----
Pete Johnston

Tim Sherratt

unread,

Oct 4, 2010, 2:27:25 AM10/4/10

to semantic...@googlegroups.com

All,

Just a belated comment on the dcterms:extent question as I'm thinking
about this myself in trying to model archival repositories for a
redevelopment of http://directory.archivists.org.au

According to my own limited understanding it seems that there are two
approaches in creating a more fine-grained representation of extent.
One would be to define a custom datatype (eg LinearMetre) and use that
to identify the nature of the literal value something like:

dcterms:extent "23.4"^^<http:www.example.org/datatype/LinearMetre>

The second approach would be to create a subproperty of dcterms:extent
(eg linearMetre, or numberOfItems).

These alternatives are described in _Linked Data Modelling Patterns_ -
http://patterns.dataincubator.org/book/custom-datatype.html

From the recommendation there, I think that the subproperty approach
would be preferred, and it does seem sort of neater (to human eyes at
least). But I was wondering if anyone else had been thinking about
this or could point to any such examples in regard to dcterms:extent.

Cheers, Tim

--
Tim Sherratt (t...@discontents.com.au)
Words - http://www.discontents.com.au
Experiments - http://wraggelabs.com
@wragge on Twitter

Aaron Rubinstein

unread,

Dec 28, 2010, 2:07:33 PM12/28/10

to semantic...@googlegroups.com

Finally having a chance to get back to this now that we are redoing
the presentation of our finding aids...

Thanks for all the great feedback on the last data I sent. I was able
to incorporate many of your suggestions, though I'm still struggling
with others.

Here's some sample RDF for our collections:

@prefix dcterms: <http://purl.org/dc/terms/>.
@prefix foaf: <http://xmlns.com/foaf/0.1/>.

@prefix arch: <http://purl.org/archival/vocab/arch#>.

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.

@prefix xsd <http://www.w3.org/2001/XMLSchema#> .

<http://example.com/mums312#collection>
rdf:type arch:Collection;
dcterms:title "W. E. B. Du Bois Papers";
dcterms:extent "380 boxes
dcterms:extent "165 linear feet";
dcterms:abstract "Scholar, writer, editor of The Crisis and other
journals..." ;
dcterms:creator "Du Bois, W. E. B. (William Edward Burghardt), 1868-1963";
arch:inclusiveStart "1803"^^xsd:date;
arch:inclusiveEnd "1999"^^xsd:date;
arch:bulkStart "1877"^^xsd:date;
arch:bulkEnd "1963"^^xsd:date;
dcterms:subject <http://id.loc.gov/authorities/label/African
Americans--Civil rights>
dcterms:subject <http://id.loc.gov/authorities/label/African
Americans--History>

pvn:heldBy <http://library.umass.edu/spcoll#archive> .

I've added inclusive and bulk dates by creating subproperties of
dcterms:created and I've also added subjects as well. I'm a little
stuck with the extent, however. Breaking out each measurement is
certainly a help but the content is still not easily parsed by a
machine. Tim's suggestion to create custom datatypes or subproperties
of dcterms:extent makes a lot of sense but would mean a change in our
encoding best practices, which currently has us putting the human
readable form (100 linear feet) in <extent>.

A couple other things...

I've revised the vocabulary that supports this data as well as the
UMass archival names service at http://gslis.simmons.edu/archival.
You can find the revised vocabulary here:
http://purl.org/archival/vocab/arch.

At this point, the RDF above will be embedded in the HTML version of
the finding aid as RDFa. The model, then, is that when you request
the URI for the collection http://example.com/coll1#collection, you
get the EAD/HTML representation as well as the RDF representation.

Of course, any and all feedback from the LOD/SW/Archives heads out
there would be very much appreciated.