NDLTD OAI harvesting: badResumptionToken

47 views
Skip to first unread message

Dter

unread,
May 10, 2021, 9:30:42 AM5/10/21
to ETD

Hello,

I am unable to retrieve the required metadata from NDLTD Union via OAI-PMH. The request http://union.ndltd.org/OAI-PMH/?verb=ListRecords&metadataPrefix=oai_dc&from=2016-01-04&until=2016-01-04 yields many results and features a resumptionToken at the end of the XML file:

<resumptionToken completeListSize="4314" cursor="0">2016-01-04T04:08:50Z!2016-01-04!!oai_dc!1564!4314!oai:union.ndltd.org:TW/092CSMU0012012</resumptionToken>

I am trying to use the resumptionToken to get the next portion of data. I am encoding its value according to the documentation and then submit the following request:

http://union.ndltd.org/OAI-PMH/?verb=ListRecords&resumptionToken=2016-01-04T04%253A08%253A50Z!2016-01-04!!oai_dc!1564!4314!oai%253Aunion.ndltd.org%253ATW%252F092CSMU0012012

As a result, I get badResumptionToken in response.

I have also tried without escaping, yet to no avail.

Would you be so kind to help me and make this request work?

Hussein Suleman

unread,
May 10, 2021, 9:55:34 AM5/10/21
to e...@ndltd.org
Hi

There is clearly something wrong in the code on our end; as it is
creating mixed-resolution dates (for efficiency reasons) rather than
consistent high-resolution dates.

I will look at a fix for this when I have some time.

For now, I have a quick workaround that does seem to work. Instead of
using low-resolution dates in the initial request, use the
second-resolution version, such as:

http://union.ndltd.org/OAI-PMH/?verb=ListRecords&metadataPrefix=oai_dc&from=2016-01-04T00:00:00Z&until=2016-01-04T23:59:59Z

This will then return resumptionTokens such as:
2016-01-04T04:08:50Z!2016-01-04T23:59:59Z!!oai_dc!1564!4314!oai:union.ndltd.org:TW/092CSMU0012012
and it seems those are handled properly by the server.

Regards,
Hussein Suleman

--
Hussein Suleman, PhD
Head of Department and Professor
Department of Computer Science, School of IT, University of Cape Town
www.husseinsspace.com

On 2021/05/10 15:30, Dter wrote:
> Hello,
>
> I am unable to retrieve the required metadata from NDLTD Union via
> OAI-PMH. The request
> http://union.ndltd.org/OAI-PMH/?verb=ListRecords&metadataPrefix=oai_dc&from=2016-01-04&until=2016-01-04 yields
> many results and features a resumptionToken at the end of the XML file:
>
> <resumptionToken completeListSize="4314"
> cursor="0">2016-01-04T04:08:50Z!2016-01-04!!oai_dc!1564!4314!oai:union.ndltd.org:TW/092CSMU0012012</resumptionToken>
>
>
> I am trying to use the resumptionToken to get the next portion of data.
> I am encoding its value according to the documentation
> <http://www.openarchives.org/OAI/openarchivesprotocol.html#SpecialCharacters> and
> As a result, I get *badResumptionToken* in response.
>
> I have also tried without escaping, yet to no avail.
>
> Would you be so kind to help me and make this request work?
>
> --
> You received this message because you are subscribed to the Google
> Groups "ETD" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to etd+uns...@ndltd.org <mailto:etd+uns...@ndltd.org>.
> To view this discussion on the web visit
> https://groups.google.com/a/ndltd.org/d/msgid/etd/443c284f-57fa-488b-b688-23dbd370ba41n%40ndltd.org
> <https://groups.google.com/a/ndltd.org/d/msgid/etd/443c284f-57fa-488b-b688-23dbd370ba41n%40ndltd.org?utm_medium=email&utm_source=footer>.

Dter

unread,
May 10, 2021, 10:06:15 AM5/10/21
to ETD, hussein
Thanks for the prompt reply. This subsequent request (http://union.ndltd.org/OAI-PMH/?verb=ListRecords&resumptionToken=2016-01-04T04:08:50Z!2016-01-04T23:59:59Z!!oai_dc!1564!4314!oai:union.ndltd.org:TW/092CSMU0012012) indeed works but returns only 1 result, In fact, it should yield significantly more (there are over 4 thousand records altogehter (completeListSize="4314" ), and only 1 thousand of them covered in the first request).

Am I doing something incorrectly this time?
понеділок, 10 травня 2021 р. о 16:55:34 UTC+3 hussein пише:

Hussein Suleman

unread,
May 10, 2021, 3:11:43 PM5/10/21
to Dter, ETD
Hi

I noticed that myself as well.

I tried expanding the date range (e.g., until=2016-01-05) and the next
page of results turns out to have only that one record for 2016-01-04,
before it starts to list records from the next day. So I suspect that
the record counters in the resumptionToken are simply not correct.

I doubt anyone actually uses those record counters, as nobody has
noticed this before :)

Regards,
Hussein Suleman

--
Hussein Suleman, PhD
Head of Department and Professor
Department of Computer Science, School of IT, University of Cape Town
www.husseinsspace.com

On 2021/05/10 16:06, Dter wrote:
> Thanks for the prompt reply. This subsequent request
> (http://union.ndltd.org/OAI-PMH/?verb=ListRecords&resumptionToken=2016-01-04T04:08:50Z!2016-01-04T23:59:59Z!!oai_dc!1564!4314!oai:union.ndltd.org:TW/092CSMU0012012)
> indeed works but returns only 1 result, In fact, it should yield
> significantly more (there are over 4 thousand records altogehter
> (/completeListSize="4314"/ ), and only 1 thousand of them covered in the
> first request).
>
> Am I doing something incorrectly this time?
> понеділок, 10 травня 2021 р. о 16:55:34 UTC+3 hussein пише:
>
> Hi
>
> There is clearly something wrong in the code on our end; as it is
> creating mixed-resolution dates (for efficiency reasons) rather than
> consistent high-resolution dates.
>
> I will look at a fix for this when I have some time.
>
> For now, I have a quick workaround that does seem to work. Instead of
> using low-resolution dates in the initial request, use the
> second-resolution version, such as:
>
> http://union.ndltd.org/OAI-PMH/?verb=ListRecords&metadataPrefix=oai_dc&from=2016-01-04T00:00:00Z&until=2016-01-04T23:59:59Z
>
>
> This will then return resumptionTokens such as:
> 2016-01-04T04:08:50Z!2016-01-04T23:59:59Z!!oai_dc!1564!4314!oai:union.ndltd.org:TW/092CSMU0012012
>
> and it seems those are handled properly by the server.
>
> Regards,
> Hussein Suleman
>
> --
> Hussein Suleman, PhD
> Head of Department and Professor
> Department of Computer Science, School of IT, University of Cape Town
> www.husseinsspace.com <http://www.husseinsspace.com>

Dter

unread,
May 11, 2021, 6:41:25 AM5/11/21
to ETD, hussein, Dter
I believe that something is working wrong (and that the counter are correct).

I have done some A/B testing. I have tried harvesting and counting all items (except deleted records). Here are the results:

1. resumptionTokens ignored (i.e. scraping only the first XML page of the response): ~1.16 million records retrieved.
2. With resumptionTokens: (i.e. all subsequent XML pages in a response are scraped)   ~1.20 million records retrieved.

There is only a mere difference. At the same time, as NDLTD Union Archive states, there should be ~6.20 million valid records in the database. So, around 5 million records are lost somewhere in the course of harvesting, and the only reasonable assumption I can make is that the resumptionTokens are not working correctly (this would also explain the differences in the counters and the actual number of records retrieved through the resumptionTokens).

I would be really grateful if you could provide me with further assistance in this issue.

понеділок, 10 травня 2021 р. о 22:11:43 UTC+3 hussein пише:
Reply all
Reply to author
Forward
0 new messages