I have finally published the code I wrote to enable OAI-PMH harvesting of Papers Past articles via Digital NZ. Thanks to everyone on the list who helped me out with useful information!
The source code (Java and XSLT) is here:
https://github.com/Conal-Tuohy/Retailer I also wrote a blog post about it here:
http://conaltuohy.com/blog/how-to-download-bulk-newspaper-articles-from-papers-past/It was an interesting experience. Overall I think it worked very well, but there were a few issues.
The main "pain point", for me, was that the Digital NZ API doesn't give you a way to search by a last-updated date, and this is a necessary feature for OAI-PMH. And I think this feature is crucial, actually, since it would enable large-scale distribution and intelligent caching. Without it, you've got no real way to check that a bunch of resources you've obtained from Digital NZ are up to date, or if they are stale.
I was able to work around this for Papers Past because the newspaper collection is added to, but existing records are not edited. So this means you can treat "syndication-date" as meaning "last updated" (though not, in general, for other Digital NZ collections). NB, the API doesn't provide a convenient way to search by syndication-date either, but it was at least possible by sorting results in syndication-date order, and performing a binary search (by making repeated API calls to narrow the date range down). Hence in a result set of size n, you can find a record for a given date in O(log n) API calls.