Call for Te Papa API focus group

86 views
Skip to first unread message

Douglas Campbell

unread,
Apr 21, 2016, 11:56:53 PM4/21/16
to DigitalNZ
Hi all,
 
Adrian Kingston and I are at the beginning stages of developing a public API to Te Papa's Collections Online.  We'll be taking an iterative approach aiming for an MVP to feed Collections Online initially, then hopefully opening up wider.  We're aiming to have an internal beta ready by mid 2016 and some kind of public version in 2017.
 
The kinds of data we expect to be available via the API will be from the Collections Online website - http://collections.tepapa.govt.nz/ :
  • Metadata about Te Papa collection items, and related topics/stories
  • Authority data - names, places...
  • Links to collection media, including licencing details (images, video, etc.)
We'd be interested to hear from devs who might be intersted in using this kind of API in the future.

If you'd like to help, here are some initial questions to help our project scoping:


1. Usage - What kinds of things would you use the API to build? e.g. links from your own museum holding to related Te Papa items, 'big data' processing of images.

2. Environments - What technical environment(s) might you use? e.g. real-time Javascript calls from a web page, deep analysis using Python on local machine.

3. Data formats - What is your most preferred metadata format and why? eg. JSON, XML, turtle, RDF/XML, JSON-LD.

4. Any other thoughts/requests?

You can reply to the list or email me directly: douglas....@tepapa.govt.nz

As we progress through the development we may well have more questions and hope you'll be able to help!

Cheers,
Douglas Campbell
Te Papa Digital

Jonathan Hunt

unread,
Apr 26, 2016, 12:51:51 AM4/26/16
to digi...@googlegroups.com
Hi Douglas

It's great to hear of this Te Papa initiative.

I'm not sure what I might build; that might be suggested by getting to know the data and how it fits together or can be matched with other datasets. I'm particularly interested in Person & Place authority data.

I've been using the Auckland Museum API recently; JSON & JSON-LD are easy to work with, though a choice of formats is desirable.

It's really important to have good documentation, preferably with examples of API use and responses. Swagger.io (now https://openapis.org/ ) could be useful here. The AM API docs don't go into much depth on searching for example, simply referring you to elastic search.

You may be across this already but API versioning should be considered from the start.

Regards
Jonathan
> --
>
> ---
> You received this message because you are subscribed to the Google Groups "DigitalNZ" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to digitalnz+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Jonathan Hunt
http://huntdesign.co.nz
+64 21 529 250
PO Box 1062, Christchurch 8140, New Zealand

Douglas Campbell

unread,
Apr 26, 2016, 1:26:37 AM4/26/16
to DigitalNZ
Hi Johnathan,

Great & thanks, you've raised some useful points. Co-incidentally I've spent the last week in an email conversation with an API provider due to gaps in their documentation, so I feel your pain :) Will try to take that on board.

Cheers,
Douglas

Stuart A. Yeates

unread,
Apr 26, 2016, 5:12:35 AM4/26/16
to digi...@googlegroups.com
Re (3), it's really hard to go past a decent csv file.

cheers
stuart

--
...let us be heard from red core to black sky

Paul Sutherland

unread,
Apr 26, 2016, 6:27:59 PM4/26/16
to digi...@googlegroups.com
I would agree with Stuart - a TSV or CSV would be useful to me!

Douglas Campbell

unread,
Apr 26, 2016, 9:59:52 PM4/26/16
to DigitalNZ
Thanks Stuart and Paul.

Some of our metadata is quite hierarchical so parts may be lost flattening it into CSVs. Do you have specific tasks in mind for the CSVs? I can imagine object and/or authority dumps could be used to match against a local collection to provide links to related content, but maybe you have other ideas in mind?

Cheers,
Douglas

Conal Tuohy

unread,
Apr 26, 2016, 11:30:50 PM4/26/16
to digi...@googlegroups.com
Thanks for raising this here, Douglas! It's a really very important step to gather opinions and requirements from a developer community before building the API.

I've looked at a whole bunch of GLAM APIs and web services recently, including a few museums: the Auckland Museum, the Powerhouse in Sydney, and Museum Victoria. So I do have some definite opinions.


> 1. Usage - What kinds of things would you use the API to build? e.g. links from your own museum holding to related Te Papa items, 'big data' processing of images.

> 2. Environments - What technical environment(s) might you use? e.g. real-time Javascript calls from a web page, deep analysis using Python on local machine.

One of my goals for this year is to develop some client-side Javascript software for enhancing web pages which contain links to GLAM collections, by looking up related metadata and using the metadata to add content to the page's content. That software would be built to use Linked Data rather than a specific institutional API, so (unless your API conforms, at least in part, to some generic Linked Data principles) it would not be directly connected, though it could well be connected indirectly. Not that a LOD API would be a bad thing to build (far from it!) but it may not be the best next step.

I'm also more generally interested in "distant reading" / "distant viewing" i.e. the application of "big data" techniques to the study of culture, and there the requirement is different. Here I've tended to use OAI-PMH or simply HTTP bulk download. One of my gripes with Trove's API and with the DigitalNZ dataset is that they are not suited to bulk data access (and in fact DigitalNZ's is deliberately limited). There are some technical reasons involved, but it's my view that the main obstacles to permitting big data analysis often spring from organisational cultural: a sense of proprietorship, and a desire to be a gatekeeper. It can be a sensitive issue to deal with for that reason, and needs to be approached and managed with care. If it's posed too purely "technocratically" it can arouse antagonism.


> 3. Data formats - What is your most preferred metadata format and why? eg. JSON, XML, turtle, RDF/XML, JSON-LD.

I've used all of these formats, and I think they all have their place. To my mind the important thing is to have documented the semantics of the data. It's not enough to publish JSON objects whose value have "human-readable" names, because what's "human-readable" to a museum curator is not necessarily intelligible to everyone else. The same with XML; if the XML has a namespace, and a schema which contains documentation, then that's OK, but otherwise, it's no better than CSV (except that it supports hierarchy). The RDF-based formats turtle, RDF/XML, and JSON-LD are all designed to be self-descriptive. The URIs that identify the descriptive terms used in RDF can often be dereferenced to provide documentation (in the same way that XML namespaces can), but again, people will sometimes make up their own RDF vocabulary, mint URIs to identify terms in that vocabulary, but then when you access one of those URIs you get a 404, and you realise the publishers never took the time to properly define (or even analyse) the semantics of their data.

I don't think it's essential to adopt some particular standard XML (or other) schema (such as LIDO, VRA, even DC): that can actually be harmful because it can pose a semantic bottleneck. If you have some potentially useful data on hand which can't be easily expressed in terms of your adopted schema then you will be forced to leave it out, or to express it only in some semantically weak way (such as  "note" or some kind of "free text" field which the schema leaves unconstrained and semantically vague). By hiding that kind of problematic data from the API, you'd be reducing the API's value. Better would be to define your own schema based directly on the actual schemas of the existing information systems,, and simply document it within an inch of its life. Again, that would provide a platform that could be built on, either by higher levels of your API publishing system, or by external agents. If you are going to publish in some standard XML vocabulary, I would start with publishing what you actually have in its own "native" format ("TePapaXML" or "KeEmuXML" or whatever), and then crosswalk that to a standard schema, or to many standard schemas; this crosswalk could be as lossy as you like because the native data would always be available as an alternative.

> 4. Any other thoughts/requests?

I could trot out a big wish list of what I'd ideally want from Te Papa's API, but I think the most important thing is that you should set goals that are achievable within your budget and time-frame, and that will provide a basis for later work, and then proceed through that list of goals in priority order.

My #1 priority would be a download of the complete dataset. Even though it's not ideal for many purposes, it is something that other people can pick up and run with. So I think that should be your first goal: to publish the data in as complete a form as possible, for bulk download. Simplest is a complete dump of the data in some fairly "raw" form such as CSV.  It could be as simple as SQL dumps of individual tables that underly the CMS, or it could be a denormalized table. The Powerhouse, for instance, offer a TSV-formatted download of object metadata. The DPLA offers downloads in a slightly cruftified JSON-LD of each of their contributors, too. Perhaps to handle hierarchies more conveniently you could dump it in an XML format. One way or another, this is a good first step, and the key to making it useful is to thoroughly document the semantics of the tables, the columns, the XML elements (whether using an XML schema or informally), or whatever.

My personal preference for #2 priority would be a publicly accessible OAI-PMH provider. I believe there may well be one feeding DigitalNZ, but if so it's not easy to find. Again, this is a doddle to set up; the important thing, to me, is to be comprehensive and expose all the metadata records, whether object descriptions or authority records.

Last year I wrote a web proxy that transforms Museum Victoria's API (based on a Ke Emu back end, which I think is what Te Papa uses?) into Linked Data; I wrote a few blog posts about it which include some detailed whinges about their API: http://conaltuohy.com/blog/tag/museum/

One thing I would especially like in a search API would be the ability to search using the URI of a human-readable web resource: a web page on Te Papa Collections, or one of their thumbnails or images, and retrieve the full metadata record. In general, I think a search API should allow a search over every field; if it's data worth having, then it's worth indexing, and allowing it function as an access point, and it points directly towards a Linked Data approach.

Ultimately I'd really like to see a LOD service (which to my mind is a kind of API, though some people think of it as something distinct). For simple searching, you can use the "Linked Data Fragments" protocol; this allows you to return Linked Data in response to a query expressed in terms of a triple pattern.

Finally, in terms of process, can I make a plea to have a publicly accessible issue tracker, where bugs and enhancements can be reported and requested and managed, and to have the API source published in a source code repository where bugs and enhancements can be fixed and implemented? I think that your openness to this forum is a great start, and I think keeping the work in the open will be a big help to building and maintaining a community.

Regards

Con

--

---
You received this message because you are subscribed to the Google Groups "DigitalNZ" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalnz+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Stuart A. Yeates

unread,
Apr 26, 2016, 11:36:38 PM4/26/16
to digi...@googlegroups.com
In the past I've used CSV + template to create wikipedia biographies (think DNZB).

There's an upcoming editathon at Te Papa around species. Given a csv with one species per line I can craft a template that will might make the significantly easier. 

cheers
stuart

--
...let us be heard from red core to black sky

Conal Tuohy

unread,
Apr 27, 2016, 1:13:32 AM4/27/16
to digi...@googlegroups.com
Roy Fielding's view on API versioning is "don't do it": https://www.infoq.com/articles/roy-fielding-on-versioning

Douglas Campbell

unread,
Apr 27, 2016, 1:34:23 AM4/27/16
to DigitalNZ
Thanks Con, I really appreciate you taking the time to document your thoughts. Obviously you've given us a lot to think about so I'll need some time to digest...

One question if I may, are there one or two places you would point to who are 'doing it right'?

Cheers,
Douglas

Douglas Campbell

unread,
Apr 27, 2016, 1:35:46 AM4/27/16
to DigitalNZ
Thanks Stuart, as you know, design is always easier/better when you have more concrete use cases. :)

Douglas Campbell

unread,
Apr 27, 2016, 1:55:13 AM4/27/16
to DigitalNZ
Hmm, I might need to read Roy Fielding's work a bit more, but it's difficult to see how he thinks it would work without seeing a worked example.

I can imagine some future changes can't be handled gracefully via REST. The 'Web' has traditionally solved this (badly) with "let's just create a different website" and with 404s and visitors eventually give up.

But if you think of an API as a contract, if I vary the contract (and value my API consumers), I will provide some overlapping support. What's more, cultural heritage institutions typically think beyond the 5-10 year horizon, so constantly abandoning something and creating a new unrelated thing isn't really their style.

My preferred approach to versioning is: current version has no versioning, but we have a set of endpoints with a version id in the URIs if you still need to use an older version.

Douglas

Conal Tuohy

unread,
Apr 27, 2016, 7:48:59 AM4/27/16
to digi...@googlegroups.com
Well it depends on what you mean by versioning an API. I think what Fielding is criticising is the idea that you need to have different URIs for V1, V2, etc, but of course you need to be able to revise your API. It's a question of whether you can do that in a backwards compatible way, and generally you can, though API designers often choose not to for their own convenience.

The idea with a REST API is that the bulk of the complexity is in the resource representations (i.e. the API responses). If you want to add new features to the API, then you revise those resource representations to include those features. Those representations would use some structured format like CSV or JSON or XML, you can generally add new features (columns, keys, elements) and not confuse or upset clients which have been built for a previous version of the representation.

Jonathan Hunt

unread,
Apr 27, 2016, 5:40:00 PM4/27/16
to digi...@googlegroups.com
Which is fair enough if you're going to go full REST with HATEOAS, but almost every "REST" API is actually an HTTP API [1].

I think versioning needs to be considered because it's hard to make a business case for a general client that can do hypermedia. DigitalNZ bakes in a version in the URL (e.g. http://api.digitalnz.org/v3/records.json) but I think it's better to include a version via an Accept header (see the discussion at [3]) so the resource URL is more long-term stable.

[1] http://martinfowler.com/articles/richardsonMaturityModel.html
[2] https://github.com/18F/api-standards
[3] https://github.com/18F/api-standards/issues/5
021 529250
http://open.org.nz/ - Open Data and Open Government in New Zealand
Got an issue for your Council? Try FixMyStreet http://fixmystreet.org.nz


Jonathan Hunt

unread,
Apr 27, 2016, 5:40:23 PM4/27/16
to digi...@googlegroups.com
Which is fair enough if you're going to go full REST with HATEOAS, but almost every "REST" API is actually an HTTP API [1].

I think versioning needs to be considered because it's hard to make a business case for a general client that can do hypermedia. DigitalNZ bakes in a version in the URL (e.g. http://api.digitalnz.org/v3/records.json) but I think it's better to include a version via an Accept header (see the discussion at [3]) so the resource URL is more long-term stable.

[1] http://martinfowler.com/articles/richardsonMaturityModel.html
[2] https://github.com/18F/api-standards
[3] https://github.com/18F/api-standards/issues/5

On 27/04/2016, at 5:13 pm, Conal Tuohy <conal...@gmail.com> wrote:

021 529250
http://open.org.nz/ - Open Data and Open Government in New Zealand
Got an issue for your Council? Try FixMyStreet http://fixmystreet.org.nz


Conal Tuohy

unread,
Apr 27, 2016, 10:43:45 PM4/27/16
to digi...@googlegroups.com
Hey Jonathon!

I think we are mostly (perhaps entirely) in agreement.

On 28 April 2016 at 06:50, Jonathan Hunt <huntde...@gmail.com> wrote:
Which is fair enough if you're going to go full REST with HATEOAS, but almost every "REST" API is actually an HTTP API [1].

I totally agree. And my advice to Douglas is indeed that he should build a REST API. It does require some extra thought, as compared to an RPC-style API, but the end result is a better API; easier for clients to use. That's why it is to be recommended.
 
I think versioning needs to be considered because it's hard to make a business case for a general client that can do hypermedia.

I would say that the growth in Linked Data both as a "data cloud" and as a movement has shown:

(1) that such "general clients" are possible (e.g. LODLive, Linked Data Fragments Client, etc.) and
(2) that domain specific clients are also facilitated by the provision of general (and REST-based) protocols such as LD and LDF

 
DigitalNZ bakes in a version in the URL (e.g. http://api.digitalnz.org/v3/records.json) but I think it's better to include a version via an Accept header (see the discussion at [3]) so the resource URL is more long-term stable.

[1] http://martinfowler.com/articles/richardsonMaturityModel.html
[2] https://github.com/18F/api-standards
[3] https://github.com/18F/api-standards/issues/5

I agree totally. That's the same point I made about not versioning URIs but versioning resource representations. The URIs should not change; what should change is the media type of the response (i.e. what the Accept header is specifying).

So you could start off with an API that returns a version1-type response (e.g. responding with an HTTP Content-Type header of "application/my-api-v1+json", and then incrementally add to that response, adding new backwardly compatible features, and still specifying that same media type. If and when it became necessary to offer a backwardly incompatible version of the API, clients could still use the same request URIs, but would need to specify "Accept: application/my-api-v2+json" in their request, otherwise they would get the old-style response.

Incidentally, I would add that API keys (if they are really necessary!) should not be specified in the URI either: they should use HTTP authentication headers, which is what those headers were invented for. By placing authentication in URIs, an API is effectively saying that a resource is a different resource if you request it than if I request it, which is (almost always) not actually the case. Pragmatically, it makes it difficult to share URIs in discussion forums like this, too.
 

Paul Sutherland

unread,
Apr 27, 2016, 11:32:42 PM4/27/16
to digi...@googlegroups.com
Currently our WW100 collaborative site http://canterbury100.org.nz/  has the ability to import metadata in CSV. - including remote image source and object record. e.g. http://canterbury100.org.nz/explore/objects/nona-hildyard 
So right now I would like to be able to have object metadata in csv of material in Collections such as Te Papa. Our site cannot handle other formats and I do not have a toolkit to convert data via APIs...

Douglas Campbell

unread,
Apr 27, 2016, 11:53:28 PM4/27/16
to DigitalNZ
Nice. Thanks Paul. 

I'm just wondering if you have thought about how to ensure CSV files have the correct data and field names you need? Would you negotiate with a contributor or would you expect to do some manual manipulation of the data?

Conal Tuohy

unread,
Apr 27, 2016, 11:57:58 PM4/27/16
to digi...@googlegroups.com

On 27 April 2016 at 15:34, Douglas Campbell <douglas....@tepapa.govt.nz> wrote:
One question if I may, are there one or two places you would point to who are 'doing it right'?

My favourite is the British Museum, whose APIs are standards-based (Linked Data, and SPARQL).

Whereas most of the "REST APIs" out there are custom-built applications, built Ruby or PHP or Python or something, with SQL or a generic NoSQL document store like MongoDB. Those tools aren't necessarily aligned with web standards (some are more than others), so it's all too easy to build something that doesn't really work with the architecture of the WWW. Not that it can't be done, it's just that if your tools don't offer all the necessary constraints, then if you are developing an API in that way you need to be more conscientious about following web architecture yourself.

If I were offering constructive criticism to the APIs I've mostly looked at, the main things would be:
  • Responses should be self-descriptive (using Content-Type and Accept headers, and self-descriptive formats such as namespaced XML). The response formats should be well documented. I wrote a blog post about that last year, after doing some work with Zotero's web API: http://conaltuohy.com/blog/zotero-web-api-data-format/
  • Responses should contain hyperlinks for API navigation. For instance a search response should contain hyperlinks to the resources found. That's what you get if you use a website's search form, and an API should be just the same. Some APIs return simple identifiers like "204934509" or "items/23409", which you have to concatenate onto some other "endpoint" URI to get access to the resource. It not necessary to return absolute URIs: if the URIs returned in a search result are relative to the the URI of the request itself, then that's great; in some client environments you can simply treat treat that identifier as a URI and read data from it. Ideally, the response schema should make it explicit that the data type of those identifiers is "URI".
  • URIs should identify resources and not contain extraneous cruft, such as  authentication tokens and data formats, which should instead be handled with HTTP headers.

Douglas Campbell

unread,
Apr 28, 2016, 12:19:23 AM4/28/16
to DigitalNZ
Thanks Conal and Jonathan.

Versioning and keys are certainly on our radar. Thanks for the versioning links Jonathan, very useful. Recently I saw an API that accepts keys in the URI during development, but expects they move into the header for production.

I tried to get as close as possible to HATEOAS in a REST API I built recently, but it is difficult when there is no consensus of what best practice is.  I'm aware that there is a lot of heated debate around these topics... which makes me wonder if Conal was actually trolling :)

Yes Conal, I also like the British Museum API. Thanks for your best-practice suggestions.

Paul Sutherland

unread,
Apr 28, 2016, 12:20:53 AM4/28/16
to digi...@googlegroups.com
I would expect to do some manual manipulation -
so far renaming column names and adding a URL stem to the filename to get image or point to resource...

Douglas Campbell

unread,
Apr 28, 2016, 1:40:53 AM4/28/16
to DigitalNZ
I'm also more generally interested in "distant reading" / "distant viewing" i.e. the application of "big data" techniques to the study of culture, and there the requirement is different. Here I've tended to use OAI-PMH or simply HTTP bulk download.

 Hey Con, are you able to give some specific examples of the types of big data culture study you have, or might, do?

Conal Tuohy

unread,
Apr 28, 2016, 2:06:44 AM4/28/16
to digi...@googlegroups.com
What I was thinking of there (with "big data") were analytical techniques that are often computationally intensive and that require access to bulk data (rather than piecemeal access to small subsets).

Network analysis is one area, where you are considering the dataset as a graph, and determining connectedness or disjointness (i.e. is every node connected to every other node, or do they form disjoint networks?), centrality (e.g. for every pair of nodes in a connected graph, there is a set of a shortest paths; and if a given node is one of the intermediate nodes on a large number of these shortest paths, then it is highly "central"), and so on. This kind of analysis is often used with social networks but it can be applied to networks in general.

Another is topic modelling, where you are looking at words that make up your documents (NB words could be features of various kinds, and "documents" should be read broadly as "information resources"), and try to identify a small set of clusters of "key" words which taken together constitute a "topic" which is considered to be latent in the corpus. I have used the LDA algorithm (very popular in the DH community) for this, to analyse newspaper articles, but I haven't published anything on it (yet). The idea is that documents are each treated as if they were a "bag of words" drawn randomly from some probability distribution; the algorithm produces a much smaller set (i.e. much fewer than the number of documents in the corpus) of probability distributions, where the probability distribution of each document approximates some weighted sum of the topic distributions. e.g. a given document might be 10% on this topic, 20% on that topic, and 80% on a third topic. If you read the list of most probable words in each topic, you can see what the topics are about, and hence you can get some high-level understanding of the corpus not unlike having a bunch of subject cataloguers read it and collaboratively produce and assign some subject headings.

On 28 April 2016 at 15:40, Douglas Campbell <douglas....@tepapa.govt.nz> wrote:
I'm also more generally interested in "distant reading" / "distant viewing" i.e. the application of "big data" techniques to the study of culture, and there the requirement is different. Here I've tended to use OAI-PMH or simply HTTP bulk download.

 Hey Con, are you able to give some specific examples of the types of big data culture study you have, or might, do?

--

---
You received this message because you are subscribed to the Google Groups "DigitalNZ" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalnz+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Stuart A. Yeates

unread,
Apr 28, 2016, 4:07:14 AM4/28/16
to digi...@googlegroups.com
I've spent much of the last three days at a software carpentry http://software-carpentry.org/ event at VUW. CSV was the only data format that got talked about, and I believe it's the way to go for engaging with researchers and academics.

cheers
stuart

--
...let us be heard from red core to black sky

Conal Tuohy

unread,
Apr 28, 2016, 4:47:01 AM4/28/16
to digi...@googlegroups.com
On 27 April 2016 at 15:55, Douglas Campbell <douglas....@tepapa.govt.nz> wrote:
Hmm, I might need to read Roy Fielding's work a bit more,

Let me recommend this blog post of his on the subject: http://roy.gbiv.com/untangled/2008/rest-apis-must-be-hypertext-driven
 
but it's difficult to see how he thinks it would work without seeing a worked example.

I can imagine some future changes can't be handled gracefully via REST. The 'Web' has traditionally solved this (badly) with "let's just create a different website" and with 404s and visitors eventually give up.

Could you give an example of the kind of thing that you think couldn't work?

Douglas Campbell

unread,
Apr 28, 2016, 5:59:39 PM4/28/16
to DigitalNZ
Thanks Conal. The tools for this type of analysis keep getting better so I can see we're likely to see more demand for it. 

In your case, it sounds like you are doing more exploratory analysis to see if anything interesting pops up rather than having a particular end-goal in mind?

Douglas Campbell

unread,
Apr 28, 2016, 6:03:40 PM4/28/16
to DigitalNZ
Thanks Stuart, that further, strongly validates CSV as a requirement for an important group of prospective API consumers.

Conal Tuohy

unread,
Apr 28, 2016, 6:18:15 PM4/28/16
to digi...@googlegroups.com

Hi Douglas

Yes I was mostly trying out the tech as a learning exercise. But also I think it's in the nature of those techniques that they are exploratory and serendipitous. I did have some things in mind which was about exploring the differences between the colonial wars in NZ and Australia through the prism of newspaper coverage. Unfortunately Trove's API is a serious bottleneck - it's much, much slower and less reliable than Digital NZ and of course the Aussie newspaper corpus is larger. :-(

BTW here's another "big data" project using images that's perhaps more relevant: http://ryanfb.github.io/etc/2015/11/03/finding_near-matches_in_the_rijksmuseum_with_pastec.html

Douglas Campbell

unread,
Apr 28, 2016, 6:19:57 PM4/28/16
to DigitalNZ
I can imagine some future changes can't be handled gracefully via REST. The 'Web' has traditionally solved this (badly) with "let's just create a different website" and with 404s and visitors eventually give up.

Could you give an example of the kind of thing that you think couldn't work? 
 
I can't recall an exact example, but I'm thinking maybe when your data modelling evolves and reveals that what you previously lumped together as a single resource should really have part separated off as a separate resource. I'm not sure how it is possible to do this 'gracefully' without versioning since the main resource now returns half the data (possibly records and/or fields are missing), which may break some existing apps.

Fielding's post seem to recommend abandoning it and making two new resources, which is just messy.  That is still versioning except it is done via documentation - "This resource is the older version that combines resources X and Y, you should now use X or Y instead".

Conal Tuohy

unread,
Apr 28, 2016, 6:34:05 PM4/28/16
to digi...@googlegroups.com

It's worth noting that if you do go down the SPARQL/RDF route (i.e. deploying a SPARQL Query server as your API provider) then you will get CSV and TSV for free: the SPARQL Query protocol offers these two along with a JSON- and an XML-based format for tabular results.

https://www.w3.org/TR/2013/REC-sparql11-protocol-20130321/#query-success

Conal Tuohy

unread,
Apr 28, 2016, 6:39:07 PM4/28/16
to digi...@googlegroups.com


On 29 Apr 2016 08:19, "Douglas Campbell" <douglas....@tepapa.govt.nz> wrote:
>>>
>>> I can imagine some future changes can't be handled gracefully via REST. The 'Web' has traditionally solved this (badly) with "let's just create a different website" and with 404s and visitors eventually give up.
>>
>>
>> Could you give an example of the kind of thing that you think couldn't work? 
>
>  
> I can't recall an exact example, but I'm thinking maybe when your data modelling evolves and reveals that what you previously lumped together as a single resource should really have part separated off as a separate resource. I'm not sure how it is possible to do this 'gracefully' without versioning since the main resource now returns half the data (possibly records and/or fields are missing), which may break some existing apps.

Versioning the response type.

New clients would request the resource with an Accept header of "application/my-api-fine-grained+json" to receive the fine-grained resource representation. Without that header you'd get the old coarse-grained one. The URI remains the same, because "cool URIs don't change".

>
> Fielding's post seem to recommend abandoning it and making two new resources, which is just messy.  That is still versioning except it is done via documentation - "This resource is the older version that combines resources X and Y, you should now use X or Y instead".
>

Glen Barnes

unread,
Apr 28, 2016, 8:35:04 PM4/28/16
to DigitalNZ
CSV and TSV are not 'free' if you have to use SPARQL/RDF to query in the first place. Massive learning curve means API just won't be used by most people. 

By all means look at SPARQL down the road but this should be the last thing to implement IMHO.

Cheers
Glen  

Douglas Campbell

unread,
Apr 28, 2016, 8:54:07 PM4/28/16
to DigitalNZ

Thank you for the time you have invested in answering our questions. The discussion has helped our scoping of Te Papa's collections API considerably. Some of the points may seem obvious but it really helps to get validation, and a couple of unexpected perspectives also popped up.


Here is a summary of the discussion.


Use cases

  • Javascript widget to place on webpages, eg. finds GLAM Linked Data content related to the current webpage and inserts it on that page for display
  • Feed into a human cut&paste process, eg. species descriptions for a Wikipedia Editathon (CSV is a useful format for this as it quick and easy to manipulate and is being taught as a method to researchers and academics)
  • Import metadata of war-related images into a war images website (CSV is a common baseline import format on many systems)
  • Distant reading/viewing (ie. big data analysis rather than 'close reading'), eg. exploring the differences between the colonial wars in NZ and Australia based on newspaper coverage - this is often just exploratory analysis to uncover patterns (eg. using network analysis or topic modelling) so bulk downloads are needed, eg. CSVs or OAI-PMH access. Licencing that restricts bulk downloading and storage is sometimes a barrier

Schema and Formats

  • Documentation, documentation, documentation!!
    • Eg. field names might use jargon
    • Ideally self-documenting, e.g. can find documentation by dereferencing an XML namespace or RDF URI
  • Don't be limited by a schema, better to add your own fields than omit data or shoehorn it in to comply with a schema. Can always provide crosswalks to other schema

Other points

  • Can query by human website URIs, eg. put in Collections Online webpage URL or thumbnail URL and can find its details
  • Index every field - if it's worth being there it is worth being searchable
  • Public issue tracker
  • API source in a public repository so bugs can be fixed
  • Keep URIs clean and shareable, eg. how to deal with versioning, API keys, etc.
  • Reduce uptake barriers, eg. responses are self-descriptive, responses contain URIs to related API resources, easy to get documentation (eg. by dereferencing a URI)
Thanks to DigitalNZ for letting us kick off this discussion. We can continue offline via email.

Cheers,
Douglas

Conal Tuohy

unread,
Apr 28, 2016, 10:02:28 PM4/28/16
to digi...@googlegroups.com

Not necessarily; you can use a SPARQL store as a back end to "canned" queries and not require front end devs to know SPARQL, but still get the benefits of SPARQL's power, such as conneg.

Stuart A. Yeates

unread,
Apr 29, 2016, 3:11:22 AM4/29/16
to digi...@googlegroups.com
I know it's rather late in the conversation to be raising this, but I'd give my left arm to be able to browse the Te Papa Māori content by iwi and hapu. 

I'm guessing that that would involve resolution of two separate political issues which are out of scope for the current project, however. 

cheers
stuart

--
...let us be heard from red core to black sky

Reply all
Reply to author
Forward
0 new messages