Searching by last updated date

15 views
Skip to first unread message

Conal Tuohy

unread,
Aug 21, 2014, 8:55:53 AM8/21/14
to digi...@googlegroups.com
I have been working on some software to support historical researchers using text-mining techniques on newspaper articles. I have written an OAI-PMH front-end (an "OAI-PMH Provider") for Trove Australia's newspaper collection, to allow for download, or "harvesting", of search results, in bulk (i.e. hundreds or thousands of newspaper articles), and I'm working on one now for the National Library of NZ's Papers Past, using the Digital NZ API.

To support OAI-PMH properly it's necessary to query by the syndication_date. The OAI-PMH protocol provides for querying records updated between two dates (with a granularity of either days or seconds). Typically, an OAI-PMH harvester will harvest from its provider overnight, requesting just those records which are new or updated since the previous night's harvest.

I don't wish to use the old API as I understand it is going away very soon, though I've seen examples that show that in the previous versions of the Digital NZ API, it was possible to search by last_updated_date, or by syndication_date. It appears though that that's not possible in the v3 API, in which the only date range searching functionality is that provided by the year, decade, and century facets. Is that correct?

Is there a facility I've missed in the v3 API? Or plans to add such a feature?

Otherwise I think it is doable via a bit of extra querying of the API - a binary search - since it is possible to sort the results by syndication-date.

Any thoughts, anyone?

Conal

PS anyone interested to try out the OAI-PMH provider when it is ready?



Andy Neale

unread,
Aug 21, 2014, 5:24:27 PM8/21/14
to <digitalnz@googlegroups.com>
Hi Conal,

I'm not sure we can support what you are trying to do sorry I'll dig around though. We have a feature in the works to expose the delta on records for bulk download purposes, but we don't have the funding for that at the moment. 

The syndication date currents is tagged when something is first harvested, not when it is updated so I'm not sure that will work either. Will double check. 

Andy
--

---
You received this message because you are subscribed to the Google Groups "DigitalNZ" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalnz+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Chris McDowall

unread,
Aug 21, 2014, 6:03:27 PM8/21/14
to digi...@googlegroups.com

Gordon Paynter

unread,
Aug 22, 2014, 7:09:38 PM8/22/14
to digi...@googlegroups.com
Hi Chris, Conal:

This again!

By way of a history lesson, syndication date was originally designed to mean "last updated date" (specifically to support "what's changed" and OAI-PMH) but at some point this was changed to function as "record created date" as Chris describes (and as is described in the current "v1 & v2" documentation).

As the designer I find this mildly annoying but I have to concede that when it stopped working for a few months nobody seemed to notice but me, and I can see uses for the "record created date" approach also (especially since the bulk of DigitalNZ material does not tend to get updated). 

In the strictest sense, I don't see how the DigitalNZ API will support OAI-PMH in it's current (v2 or v3) form unless you harvest every DigitalNZ record regularly.

But I don't think that you need to worry because historic newspapers are very rarely changed after being made available online. They also tend to be made available online in very large batches, which I assume will have the same or similar syndication date in DigitalNZ. So you might harvest nothing for months, then suddenly get 100,000 new records overnight.

Gordon
 

Chris McDowall

unread,
Aug 22, 2014, 7:48:29 PM8/22/14
to digi...@googlegroups.com
Hi Gordon,

Note: I don't work for DigitalNZ anymore but I still use the API.

If memory serves, all DNZ records store a last_updated timestamp in the database. It would be a great to get that returned & sortable through the API.

Chris 
Reply all
Reply to author
Forward
0 new messages