OAI harvesting with Trove (National Library of Australia)

104 views
Skip to first unread message

Erin Mollenhauer

unread,
Jun 15, 2017, 1:38:05 AM6/15/17
to AtoM Users
Hi everyone,

We have self-hosted AtoM, version 2.3.0.

We are working with Trove Australia to allow them to harvest our records - the OAI plugin is activated.

There are two issues:
1. The resumption tokens are causing the same 100 records to be harvested repeatedly.

2. When Trove tries to harvest authority records, we encounter this (message from Trove support):
With the people and organisation records, our harvester is having trouble connecting to the browse pages in order to find and download the eac records.
 
Basically, what is happening is  that I’m trying to connect to the browse page - http://atom.library.moore.edu.au/index.php/actor/browse?sort=alphabetic&page=1, extract the links to the individual people, then generate the links to the EAC records. Because the browse pages paginate numerically I can move through each page quite easily.
 
Unfortunately, the harvester is very, very fussy with the XML/HTML that it looks at, and it has decided that the line:
<div>
          <label>
                          <input name="repos" type="radio" value checked="checked" data-placeholder="Search">
                        Global search          </label>
        </div>
 
Is invalid XML – with ‘value’ being an attribute that has not been defined.
 
Unfortunately this is also preventing the harvester from correctly paginating through in our other options for harvesting.

We are about to upgrade to 2.3.1.

Has anyone encountered these issues and can you shed any light?
Thank you!
Erin Mollenhauer
Erin Mollenhauer
Monograph Acquisitions and Special Collections Librarian – Donald Robinson Library, Moore College
 
MC_Logo_RGB_no-background
Phone: (02) 9577 9891
Address: 1 King Street, Newtown NSW 2042 Australia | Web: www.moore.edu.au |
CRICOS Provider Code: 00682B

Important Notice: This email is for the named recipient only.  Its contents are confidential and may contain legally privileged information.  The unauthorised use, disclosure, copying or alteration of this message is strictly forbidden.  If you receive this email in error, please contact the sender immediately and delete the email and all attachments from your system. This email is subject to copyright. Copyright: Moore Theological College Council.

Dan Gillean

unread,
Jun 15, 2017, 11:49:49 AM6/15/17
to ICA-AtoM Users
Hi Erin,

Regarding the first issue, you have found known bug in the 2.3.0 release. See:

We have merged the fix to our stable/2.3.x branch, but the bug might have been identified and fixed AFTER the public 2.3.1 release - so it is possible the fix is only in our code repository, and not in the 2.3.1 tarball available in the Downloads section of our website.

If you are upgrading to 2.3.1 and want this fix, you have 2 options. The first is to follow Option 2 in our installation instructions and install from the stable/2.3.x branch of our GitHub code repository.

The second option is to patch the code yourself - if you look at the related commit from the fix, it is a one-line change to fix the resumption token:

The second issue is a bit trickier. The AtoM OAI repository module has not currently been set up to expose authority records - just archival descriptions. The fact that it is paginating at all through HTML search/browse pages is a bit surprising to me! You could always try altering the code locally to see if you can make it validate so Trove's harvester doesn't choke on that input? Ultimately, adding full support for exposing EAC-CPF XML authority records via the OAI repository module will require development to implement properly.

As a final note, you might find the following development coming in the 2.4 release interesting: our OAI repository module will be able to expose EAD XML (rather than just simple Dublin Core) for harvesting records. To ensure that the page does not time out while trying to generate the EAD on demand, there is also an option to pre-generate and cache all EAD and DC XML. When the setting is turned on, any descriptions that are edited or newly created will automatically trigger a job to update the cached XML, so it can be exposed on request. See:

Regards,


Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.
Visit this group at https://groups.google.com/group/ica-atom-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/ica-atom-users/a16cb146-e173-4e5c-a466-67ef9af159ca%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Erin Mollenhauer

unread,
Jun 16, 2017, 1:10:04 AM6/16/17
to AtoM Users
Hi Dan,

Thanks for your reply. We have completed the upgrade and the resumption token issue is fixed - all of our description records have been harvested.

We have decided not to pursue harvesting the authority records for now. This is what the Trove technician told me:

That the Trove Harvester does have the ability to paginate through HTML pages due to the way that it is set up (as it has the ability to take a list of links and follow each of them to find records), it isn’t related to the OAI-PMH feed at all.  I was experimenting with our other methods of harvesting to see if I could get it working. As it doesn’t appear simple I will leave it as it is.


Regards,
Erin
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To post to this group, send email to ica-ato...@googlegroups.com.

Dan Gillean

unread,
Jun 16, 2017, 12:47:33 PM6/16/17
to ICA-AtoM Users
Good to know, Erin - thanks for posting an update to the thread! I'm glad to hear that upgrading solved the issue you were encountering.

If in the future your institution might be interested in sponsoring development to add support for exposing authority records via OAI-PMH, please feel free to contact me off-list, and Artefactual would be happy to provide you with an estimate. You can read more about how we maintain and develop AtoM here. If you have developers who are interested in doing the work themselves, check out our Developer resources, feel free to have them post thoughts and questions here in the forum, and consider sending us a pull request for inclusion in a future public release!

Best,

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory

To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages