How to harvest newspaper articles from Papers Past

26 views
Skip to first unread message

Conal Tuohy

unread,
Sep 15, 2014, 2:05:12 AM9/15/14
to digi...@googlegroups.com
I have finally published the code I wrote to enable OAI-PMH harvesting of Papers Past articles via Digital NZ. Thanks to everyone on the list who helped me out with useful information!

The source code (Java and XSLT) is here: https://github.com/Conal-Tuohy/Retailer

I also wrote a blog post about it here: http://conaltuohy.com/blog/how-to-download-bulk-newspaper-articles-from-papers-past/

It was an interesting experience. Overall I think it worked very well, but there were a few issues.

The main "pain point", for me, was that the Digital NZ API doesn't give you a way to search by a last-updated date, and this is a necessary feature for OAI-PMH. And I think this feature is crucial, actually, since it would enable large-scale distribution and intelligent caching. Without it, you've got no real way to check that a bunch of resources you've obtained from Digital NZ are up to date, or if they are stale.

I was able to work around this for Papers Past because the newspaper collection is added to, but existing records are not edited. So this means you can treat "syndication-date" as meaning "last updated" (though not, in general, for other Digital NZ collections). NB, the API doesn't provide a convenient way to search by syndication-date either, but it was at least possible by sorting results in syndication-date order, and performing a binary search (by making repeated API calls to narrow the date range down). Hence in a result set of size n, you can find a record for a given date in O(log n) API calls.



Tim McNamara

unread,
Sep 15, 2014, 7:41:36 PM9/15/14
to digi...@googlegroups.com
Great work Conal, thanks for the write up.

Tim McNamara
@timClicks

--

---
You received this message because you are subscribed to the Google Groups "DigitalNZ" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalnz+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages