Converting Digital NZ data to Linked Open Data

60 views
Skip to first unread message

Conal Tuohy

unread,
Mar 21, 2016, 8:04:00 AM3/21/16
to DigitalNZ
Kia ora tatou!

I'm looking into the idea of converting the DigitalNZ dataset into Linked Open Data (and maybe Linked Data Fragments), by building on top of the Digital NZ APIs.

I've used both the main "Search Records" and "Get Metadata" APIs (for access to metadata about digital objects) and I've also had a little tutu with the "Concepts API". I'd like to build a layer on top of all of them that unifies them into a single space of interlinked data, but there are a few challenges.

So I'd be very grateful for whatever comments, thoughts, and suggestions list-members can provide, whether about the specific issues below, or more generally about how the Digital NZ APIs are being used and connected, and where other developers have found fishhooks and challenges. And I'd also like to know what other developers think of the idea of a Digital NZ LOD service in general, and if you'd like to be involved. I am considering the option of building something to demo at the National Digital Forum at the end of the year, but that's still just an idea.

There are two possible approaches to building such a LOD service:

One way is to harvest all the data from the APIs, convert it into RDF as required, aggregate all the data in a SPARQL store, and publish the SPARQL store as LOD. This last step would be quite straightforward to do with existing tools because you would have all the data available in a big SPARQL store and could run whatever queries you need to, to satisfy requests for Linked Data. But it does involve harvesting and converting a great big dataset as step 1, which is the reason I'd prefer to take an alternative approach:

The alternative is to build a lightweight "on the fly" converter: a proxy service which sits on top of Digital NZ's APIs, and when it receives a request for some Linked Data, makes one or more Digital NZ API requests, processes the results to LOD form, and returns them. I have done this before with the web API of Museum Victoria, as an experiment, and it worked fairly well, though the pattern has its own limitations and disadvantages. For background, here are a few posts on my blog about that experiment:
Using a "proxy" approach rather than an aggregator, the generated LOD is more constrained by the capability of the APIs. Realistically, only information that can be gleaned from the API by making a small number of API calls can be used to generate the LOD responses (whereas, by contrast, the aggregator approach already has ALL the data available locally when it is responding to a request, so it can run arbitrary queries and answer arbitrary questions).

To give an example: the API response for a given record provides a list of URIs of related URIs on Digital NZ partner (contributor) websites. But it seems not to be possible to go the other way, and locate a particular Digital NZ record from the URI of a related partner website, presumably because the Digital NZ records aren't indexed by those partner URIs.

Similarly, the "Concepts API" response for a particular concept provides identifiers of related records in the main API, but those main API records don't appear to point back to the concept records.

These limitations do pose serious challenges for implementing a LOD interface on top of those APIs using the "proxy" pattern. There are some issues with using the "aggregation" pattern too.

The first problem is just dealing with the scale of harvesting, with a legal complication that the terms and conditions seem to forbid keeping long-lived caches (though the T&Cs are explicitly justified by a desire to not serve stale data, whereas there are "last-updated" time stamps that could help to manage those caches without requiring complete reharvests).

Another thing which is a bit of a concern for me is the stability of the concepts API. It appears to be a kind of beta, and in some ways it's quite unlike the other API (for instance, only supporting JSON-LD). Given that difference, I'm a bit wary that it may be "just a demo" or outside of the "main trunk" of development. Is there any information on its development status? A blog post on the Digital NZ site hints at a new (more "linked-data" style) API <http://www.digitalnz.org/blog/posts/introducing-the-digitalnz-concepts-api>. Is there a development roadmap available to the public anywhere? Or an issue tracker or anything like that?

Where I'm going with this idea is I'd like to build on the wonderful aggregation work that Digital NZ has done, and on the work of the GLAM institutions involved, and to present that data in a format (LOD) that's more easily usable in a generic way, and in a way that interconnects with the wider web of data. I think there's a whole lot of work done, but that the LOD layer is still missing. I think the generic nature of LOD opens up a number of possibilities for end users, such as e.g. generic Javascript libraries and widgets for LOD that allow the data to be reused in a much larger variety of ways.

Your thoughts?

Cheers!

Con




Stuart A. Yeates

unread,
Mar 21, 2016, 2:45:37 PM3/21/16
to digi...@googlegroups.com
I believe that there is a subset of DigitalNZ sources with metadata available under a more permissive license. Building a tool on only those sources might motivate the subset to become larger.

cheers
stuart

--
...let us be heard from red core to black sky

--

---
You received this message because you are subscribed to the Google Groups "DigitalNZ" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalnz+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tim McNamara

unread,
Mar 22, 2016, 12:41:19 AM3/22/16
to digi...@googlegroups.com

Conal,

I recommend that you focus on the proxy model for now. From memory, the terms of use don't allow caching of responses.

Once the proof of concept is built and functional, you can then work with the DNZ team to go back to content providers and ask for a more liberal licence.

Being part of the LOD cloud would be pretty neat! Good luck.

Conal Tuohy

unread,
Mar 22, 2016, 1:52:30 AM3/22/16
to digi...@googlegroups.com
Thanks for the reminder about the relatively open "commercial use" dataset, Stuart!

The Terms and Conditions for that subset do allow for indefinite caching (whereas the T&C for the rest of the data require caching to last no more than 30 days). It seems like only about 2% of records currently have that licence (​691102 records out of the full dataset of 29748364), though I take your point that providers might be inclined to liberalize their licensing if they thought there was some value in doing so.

It's made me realize, too, that a Linked Data service based on the full Digital NZ dataset would not in fact be very open: almost all of it would have non-commercial licences which would be a barrier to reuse. I tend to think it would still be worth doing, though.
--

Conal Tuohy

unread,
Mar 22, 2016, 2:26:25 AM3/22/16
to digi...@googlegroups.com
The legal restrictions on caching are given here:

In general you are not discouraged from caching at all, but you are allowed to cache data for up to 30 days if you have a good technical reason.
http://digitalnz.org/about/terms-of-use/developer-api-terms-of-use#cache

The 2% of records with an open licence are permitted to be cached as you see fit:
http://digitalnz.org/about/terms-of-use/developer-api-terms-of-use#commercial_use_terms

I'm actually inclined NOT to pursue the proxy model (even though it sidesteps the legal issue around caching), because the resulting Linked Data service would be much less useful without bidirectional links. For instance, you could find all web pages about a particular concept, but you couldn't then find all the concepts that relate to that web page; the links between concepts and web pages couldn't be EXPLORED. The same applies to sets; you can list all the items in a set, but you can't find all the sets which include a particular item.

To be able to render these links in both directions would require a preliminary step of harvesting them into a SPARQL graph store. There are 29M records (almost all of them newspaper articles), so you'd need to harvest about a million records a day. You can retrieve 100 records per API call, and the API lets you make up to a maximum of 10k API calls a day, so it is actually feasible, though only just.

Another option would be a hybrid, where only the core "Get Metadata API" is used in real-time. The results of those API calls would then be joined with local SPARQL query results, which would be of data harvested from the other APIs (Sets and Concepts).

I'm curious if anyone else has harvested in bulk from the Digital NZ API and can comment on the experience?

I have used it to harvest newspapers; I've successfully harvested the complete Manawatu Herald (the newspaper of my home town, Foxton), which ran to about 80k articles and took about an hour and a half.

Con

Conal Tuohy

unread,
Mar 22, 2016, 2:27:10 AM3/22/16
to digi...@googlegroups.com

On 22 March 2016 at 16:26, Conal Tuohy <conal...@gmail.com> wrote:
In general you are not discouraged from caching at all, but you are allowed to cache data for up to 30 days if you have a good technical reason.

Whoops I meant to say you ARE discouraged from caching at all, but ... etc.

Andy Neale

unread,
Mar 22, 2016, 5:07:30 PM3/22/16
to digi...@googlegroups.com

Hi Conal,

 

Thanks for proposing this as it sounds really interesting.  A couple of thoughts from my side:

 

·         I can confirm that the terms and conditions do allow for caching up to 30 days to support localised applications

·         The intent behind the caching is to support discreet localised experiences, as opposed to supporting the complete download of the DigitalNZ held data. That’s not to say that a full download of the corpus could not be valuable, simply that the systems and agreements with partners are not currently designed with that in mind

·         The reason for caching is that this is a real time system with data changing constantly, and our partners need assurance that the changes they make will flow through (that modifications, additions, and even complete deletion of records will be respected)

·         If you are wanting to pursue this then I’d recommend you start with a smaller prototype of discreet data, and we’d certainly be keen to see the results

 

As an aside, we’ve looked at the SPARQL options previously and came to the conclusion that the metadata quality may not be good enough yet to make this a really valuable service en mass. Certainly our existing infrastructure couldn’t cope with heavy SPARQL queries which is why we haven’t gone down that route. The concepts API is our first foray into the linking of data and there will be more to come on this. It’s certainly something we continue to work on.

 

You’re right that the concepts themselves don’t point to related records, but it works the other way, you can query the records API for a matching concept. More documentation is here:

 

http://digitalnz.github.io/supplejack/api_usage/concepts-api.html

 

If there are specific question on this that we can help with just give me a bell.

 

Andy

Conal Tuohy

unread,
Mar 22, 2016, 10:55:27 PM3/22/16
to DigitalNZ


On Wednesday, 23 March 2016 07:07:30 UTC+10, Andy Neale wrote:

Hi Conal,

 

Thanks for proposing this as it sounds really interesting.  A couple of thoughts from my side:

 

·         I can confirm that the terms and conditions do allow for caching up to 30 days to support localised applications

·         The intent behind the caching is to support discreet localised experiences, as opposed to supporting the complete download of the DigitalNZ held data. That’s not to say that a full download of the corpus could not be valuable, simply that the systems and agreements with partners are not currently designed with that in mind


Yes I totally agree. The Digital NZ API is optimized for discovery, rather than bulk transfer, though it can be used in that way.

Incidentally, I've seen some very interesting topic modelling, geotagging and other text-mining work done on newspaper corpora; almost always these have been done by academic researchers with privileged access to the corpus (i.e. not using the regular public access methods). Ben Adam's work (at Auckland Uni) on the Papers Past corpus is (I understand) an exception in that he harvested it using Digital NZ's API, but it's a tedious process for researchers to go through, especially if they are not big on network protocols and more interested in Bayesian statistics ;-) My feeling is that record-based web APIs (which are more about discovery) are not really the most suitable, and that data publishers themselves should be stepping up to offer alternative methods for bulk access. In particular, RSS feeds of torrents (as supported by various torrent apps) are a good way to share large datasets, producing high bandwidth overall while minimizing load on the central infrastructure.
 

 As an aside, we’ve looked at the SPARQL options previously and came to the conclusion that the metadata quality may not be good enough yet to make this a really valuable service en mass.


I'm not sure I understand what you mean by that ... could you elaborate?
 

Certainly our existing infrastructure couldn’t cope with heavy SPARQL queries which is why we haven’t gone down that route.


I think if you were going to use SPARQL to publish LOD you would probably want to serialize all the resource-descriptions in advance, or at least cache them thoroughly, rather than have your LOD requests fulfilled by on-the-fly SPARQL queries. The me the value of using a SPARQL store is that it provides a mechanism to generate resource descriptions which comprehensively represent their subject; avoiding the situation where record A links to record B but not vice versa, as if the relationship between A and B were a property of A, rather than a joint property).

 

The concepts API is our first foray into the linking of data and there will be more to come on this. It’s certainly something we continue to work on.

 

You’re right that the concepts themselves don’t point to related records, but it works the other way, you can query the records API for a matching concept.


That's not what I meant, actually. I understand that the "Concepts API" can be used to access descriptions of concepts, and also (via the "records" part of the Concepts API) to access lists of related records.

But what appears to be missing is a way to go from a record to a concept. Unless I've missed something, the links between records and concepts are only visible from concepts. If I have the identifier of a record, there seems to be no way to find the concepts that apply to that record (apart from downloading ALL the concepts and searching through them). Have I missed something?

cheers

Conal



 

Andy Neale

unread,
Mar 23, 2016, 8:32:15 PM3/23/16
to DigitalNZ

Hi Conal,

 

A couple of quick thoughts:

 

·         Yes, I definitely agree that publishing bulk datasets would be better, and that the current APIs were not designed with that in mind. There is a technical component to this, but keep in mind also that DNZ is an aggregation service so we need to respect the copyright status of the data and the rights that owners provide… even while we advocate for more permissible uses that would allow bulk download

·         If a record has any concepts attached you can get to that data in the concept_ids field  e.g. http://api.digitalnz.org/records/35111213.json?api_key=your-key

·         I think I’ll refrain from getting further into the merits and practicality of SPARQL:-) but perhaps we should have that discussion over a drink at the National Digital Forum!

Conal Tuohy

unread,
Mar 23, 2016, 10:07:30 PM3/23/16
to digi...@googlegroups.com

On 24 March 2016 at 10:32, Andy Neale <Andy....@dia.govt.nz> wrote:
·         If a record has any concepts attached you can get to that data in the concept_ids field  e.g. http://api.digitalnz.org/records/35111213.json?api_key=your-key

Aha! I *had* missed something. NB the existence of that field in the API is not documented anywhere. It should be here, at least: http://www.digitalnz.org/developers/api-docs-v3/search-records-api-v3

By the way, is there any functionally analogous method for Sets? i.e. is it possible to find out which public Sets an item belongs to?

Finally, what can be said about the semantics of the relationship between an information resources and its related concepts? From the Ralph Hotere example, I can see that Hotere was an artist associated with a record describing an art exhibition catalogue: http://natlib.govt.nz/records/22362113 which you could argue is a kind of "aboutness" relation (i.e. the catalogue is to some extent *about* the artist), and in other cases I've seen it seems similar, but I'm not sure if that's a rule. Might there be some other kind of relationship? e.g. "hasAuthor"? How is the relationship generated? I had assumed that names (of authors or subjects) were being matched against an authority file, but in the case of the exhibition catalogue record I don't see Hotere's name appearing there at all, so now I'm thinking that's probably not the case.

I can't check this at the moment because the API has started giving me an internal server error:

Andy Neale

unread,
Mar 28, 2016, 5:14:08 PM3/28/16
to digi...@googlegroups.com

Hi again!

 

No, we don’t have anything analogous to Sets… but it’s a nice idea!

 

And we haven’t yet taken the step to identify the type of relationships. The basic approach from the beta has been to simply accept the relationships that have been determined by our contributing content partners i.e. if a National Library item record is connected to a National Library authority file, then that’s what we record. So if you are not seeing an assertion then it means it has not yet been connected by the content partner. We’re just scratching the surface in terms of possible relationships, and it is a very simple start.

 

A.

 

From: digi...@googlegroups.com [mailto:digi...@googlegroups.com] On Behalf Of Conal Tuohy
Sent: Thursday, 24 March 2016 3:07 p.m.
To: digi...@googlegroups.com
Subject: Re: [DigitalNZ] Converting Digital NZ data to Linked Open Data

 

 

On 24 March 2016 at 10:32, Andy Neale <Andy....@dia.govt.nz> wrote:

--

Conal Tuohy

unread,
Mar 28, 2016, 7:11:23 PM3/28/16
to DigitalNZ


On Tuesday, March 29, 2016 at 7:14:08 AM UTC+10, Andy Neale wrote:

Hi again!

 

No, we don’t have anything analogous to Sets… but it’s a nice idea!

 

And we haven’t yet taken the step to identify the type of relationships. The basic approach from the beta has been to simply accept the relationships that have been determined by our contributing content partners i.e. if a National Library item record is connected to a National Library authority file, then that’s what we record.


OK thanks. If it is possible in future, it would be good if, for any given content partner, or on a record-by-record basis, you could also record the *type* of the relationships. 
Reply all
Reply to author
Forward
0 new messages