[Dspace-tech] Nested Metadata

Peter Dietz

unread,

Aug 26, 2015, 2:47:59 PM8/26/15

to dspac...@lists.sourceforge.net

Hi All,

Has anyone stored nested / rich metadata in DSpace?

An example I'm thinking of is for storing richer amounts of metadata for an object. For example:

Author

first-name: Peter
last-name: Dietz
name-as-it-appears: Peter Dietz
institution: Longsight
date-of-birth: ...
...

Author

first-name: Sam
last-name: Ottenhoff
...

The Authority Control system of DSpace looks like it approaches this, but the documentation isn't clear, and I'm not sure if it requires that your data values reside in some Library of Congress registry.

The hack-job I have in mind would be to serialize the information... to json... and then store that into a metadata field.

So.

schema.author.serialized = {first-name: "Peter", last-name: "Dietz", "name-as-it-appears" : "Peter Dietz", "institution": "Longsight", ... }

However, I'm tempted to think that DSpace should either have the ability to plug into any registry (hopefully there are registries you can populate and maintain with your own local data), or to extend DSpace's metadata data model to support nested/rich data.

Thoughts?

________________
Peter Dietz
Longsight
www.longsight.com
pe...@longsight.com
p: 740-599-5005 x809

Mark Diggory

unread,

Aug 26, 2015, 2:48:00 PM8/26/15

to Peter Dietz, dspac...@lists.sourceforge.net

Peter,

Just some brief feedback, this sounds very much like a DCMI Encoding Scheme. The goal of which is to express the structure and/or source of the value of a DC metadata field without needing to create extensive nesting.

http://wiki.dublincore.org/index.php/Glossary/Encoding_Scheme

Cheers,

Mark

------------------------------------------------------------------------------

_______________________________________________
DSpace-tech mailing list
DSpac...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

--

Mark Diggory
2888 Loker Avenue East, Suite 315, Carlsbad, CA. 92010
Esperantolaan 4, Heverlee 3001, Belgium
http://www.atmire.com

emilio lorenzo

unread,

Aug 26, 2015, 2:48:01 PM8/26/15

to dspac...@lists.sourceforge.net

Hi Peter,
The new SOLR core for authority indexes supports that kind of JSON structure, although has to be linked, via an autorithy key, with a metadata field...

We are using it for storing name variations, Authors ids, bio and some others fields

best luck

Emilio

Andrea Schweer

unread,

Aug 26, 2015, 2:48:05 PM8/26/15

to Peter Dietz, dspac...@lists.sourceforge.net

Hi,

On 30/07/15 08:06, Peter Dietz wrote:
> Has anyone stored nested / rich metadata in DSpace?

In a freshwater quality data repository I've helped develop
(http://lernzdb.its.waikato.ac.nz), we're storing rich metadata in an
XML bitstream. This bitstream is considered the authoritative source for
the metadata. The DSpace item metadata fields are just used as a vehicle
for transporting the metadata into discovery and the DSpace item page.
Some of the information on the item page is pulled straight from the XML
file (the repository uses XMLUI and the XSL just makes a document() call
to read the XML metadata just like it makes a document() call to read
the standard mets.xml file). It works pretty well, you just need to make
sure you keep everything in sync (we use a bunch of curation tasks for
this).

cheers,
Andrea

--
Dr Andrea Schweer
IRR Technical Specialist, ITS Information Systems
The University of Waikato, Hamilton, New Zealand

Mark H. Wood

unread,

Aug 26, 2015, 2:48:15 PM8/26/15

to dspac...@lists.sourceforge.net

On Wed, Jul 29, 2015 at 04:06:19PM -0400, Peter Dietz wrote:
> Has anyone stored nested / rich metadata in DSpace?
>
> An example I'm thinking of is for storing richer amounts of metadata for an
> object. For example:
>

> - Author
> - first-name: Peter
> - last-name: Dietz
> - name-as-it-appears: Peter Dietz
> - institution: Longsight
> - date-of-birth: ...
> - ...
> - Author
> - first-name: Sam
> - last-name: Ottenhoff
> - ...

>
> The Authority Control system of DSpace looks like it approaches this, but
> the documentation isn't clear, and I'm not sure if it requires that your
> data values reside in some Library of Congress registry.

You can create other authority providers. (The documentation is
indeed sketchy. The code is in
dspace-api:org.dspace.content.authority. Sadly there is no
package-level documentation to help us understand how the package is
organized.)

> The hack-job I have in mind would be to serialize the information... to
> json... and then store that into a metadata field.
>
> So.
> schema.author.serialized = {first-name: "Peter", last-name: "Dietz",
> "name-as-it-appears" : "Peter Dietz", "institution": "Longsight", ... }
>
> However, I'm tempted to think that DSpace should either have the ability to
> plug into any registry (hopefully there are registries you can populate and
> maintain with your own local data), or to extend DSpace's metadata data
> model to support nested/rich data.

DSpace already has infrastructure sufficient to represent the above.
We just don't define:

somenamespace.person.givenname
somenamespace.person.surname
somenamespace.person.preferred
somenamespace.person.affiliation
somenamespace.person.dob

That part is easy to fix. The hard part is that DSpace treats author
names as immediate strings rather than identifiers for related
"person" objects. Fixing that will take a bit of work. It ties in
with existing and ongoing work to integrate ORCID, too.

--
Mark H. Wood
Lead Technology Analyst

University Library
Indiana University - Purdue University Indianapolis
755 W. Michigan Street
Indianapolis, IN 46202
317-274-0749
www.ulib.iupui.edu

signature.asc

Peter Dietz

unread,

Aug 26, 2015, 2:49:11 PM8/26/15

to dspac...@lists.sourceforge.net

Thanks for all the responses, and insight thus far.

My thinking about the problem space of nested metadata, is that that is trying to solve the problem of metadata about metadata about an item.

Item/1234 => Author => [Peter Dietz, Developer, Longsight, Brown eyes, ...]

What we've been doing is that all metadata objects are really of one type, and thats text. We don't care what it is, we don't validate it, we just store the text value of whatever you've given us. Some of the problems that we run into is for date fields. They try to parse the text, and its not a valid/parseable date. Date => Unknown, Date => 2015/04, Date => 1960's, Date => 08/04/2015. We could/should validate things stored in the date-type of metadata fields as ISO8601.

So, for storing other types of information. In the case of tying a metadata field to be backed by some authority control system, we store a foreign key / reference, and then SOLR stores an encoding of the metadata we fetch from the metadata service provider. In the case of the ORCID integration it can grab:

givenNames, familyName, creditName, otherNames, country, keyword, external_identifier, researcher_url, biography

So we have a form of a schema for storing this "object" inside of a metadata value. Our current metadata system is basically a key/value store. key = metadata field (i.e. dc.title), and value is unspecified, but usually just text. Could we validate that we have a type called _nested_orcid_author, which has to be json, and only contain the above fields? That looks like an object, an OrcidAuthor object. We'd need a schema to enforce that. But then we're building tables and classes for that field. Maybe some type of key/key/value store would be appropriate?

dc.author => {{ _nested_metadata_object }}

Then a NestedMetadataObject can have keys (metadata_field_id) , and values (unspecified text).

So.

NestedMetadataValues nmValues = item.getMetadata('dc.author');

nmValue0 = nmValues[0];

nmValue0.getMetadata('dc.author.firstname') ==> "Peter Dietz"

That approach. Or is it best to stick with the authority framework. Build some type of MetadataAuthorityProvider for each "rich" / "nested" metadata object? But, if I need to have 10 fields that each need a metadata authority backing store... And there is no Library of Congress metadata service provider for each, do you need to construct your own metadata silos? Could you build a single external metadata service provider system, that could be integrated with DSpace, and be mapped to 10 different fields? Author (firstname, lastname, institution), Review(# of stars, title, description), Link(link name, url), ScientificClassification(Kingdom, Phylum, Class, Order, Suborder, Family, Genus, Species), ...

For reference, I've stored some items with the value for author serialized as JSON.

https://trydspace.longsight.com/handle/123456789/175

https://trydspace.longsight.com/rest/handle/123456789/175?expand=all

"metadata": [

{

"key": "dc.contributor.author",

"value": "{firstname:Mary Davis, lastname:MacNaughton, role:Editor}",

"language": ""

},

{

"key": "dc.contributor.author",

"value": "{firstname:Michael, lastname:Duncan, role:Contributor}",

"language": ""

},

http://dspace-rest-client-play.herokuapp.com/item/202

http://dspace-rails.herokuapp.com/item/202

Or, is flat metadata really best? Do you really need DSpace to store metadata about metadata (i.e. Author.eye-color), or is storing "Dietz, Peter" sufficient, or just our current limitation.

________________
Peter Dietz
Longsight
www.longsight.com
pe...@longsight.com
p: 740-599-5005 x809

Graham Triggs

unread,

Aug 26, 2015, 2:49:13 PM8/26/15

to Peter Dietz, dspac...@lists.sourceforge.net

Hi Peter,

I think you may not be too far off with that approach.

However, one key thing that gets missed - you provide an example where dc.contributor.author that captures rich metadata. Which we shouldn't be doing - dublin core is meant to be simple, and simple data is what we should be capturing in the "dc" schema.

Now, we can register any other schemas we like, and the fields will store strings opaquely. We can throw any format we want to define into there. For instance, it's feasible to have a "mods" schema, and in it:

mods.titleinfo =
<titleInfo><title>At Gettysburg, or, What a Girl Saw and Heard of the Battle: A True Narrative</title></titleInfo>

The problem then is that there isn't anything built in for handling the display or data entry of that rich field definition, but then that would basically be true of any arbitrary rich structure.

G

Reply all

Reply to author

Forward