Problems generating and caching XML for all archival descriptions

118 views
Skip to first unread message

Jenny Mitcham

unread,
Jun 5, 2018, 8:55:42 AM6/5/18
to ica-ato...@googlegroups.com
Hi,

I am currently trying to use the script here (https://www.accesstomemory.org/en/docs/2.4/admin-manual/maintenance/cli-tools/#generate-and-cache-xml-for-all-archival-descriptions) to generate EAD for all the archival descriptions we have prior to testing out the new EAD harvesting functionality in AtoM 2.4.

I'm just doing this on our test instance of AtoM at the moment (which has more or less the same data as is in our production version).

I'm trying to create EAD for 33218 records so this obviously takes a bit of processing power. The operation appears to have worked well, though I did need to dive in a couple of times and then restart it (using the 'skip' option provided to start it off at the right place each time). When the operation appeared to grind to a halt each time, this did have knock on effects on our test AtoM web interface and also led to a 504 gateway timeout error at one point.

I'm now thinking about how we run this on our production version of AtoM with a minimum of impact on our users. It would be great if the command line task included the option to set start and end values (rather than just start value when using the skip option). Then we could just get it to create the EAD in blocks of 5000 at a time for example. I think this could then work without upsetting users of the live catalogue.

I'd welcome any ideas and suggestions.

Many thanks,
Jen



--
Jenny Mitcham
Digital Archivist
Borthwick Institute for Archives
University of York
Heslington
York
YO10 5DD

Telephone: 01904 321170

Borthwick Institute website: http://www.york.ac.uk/borthwick/
Digital archiving blog: http://digital-archiving.blogspot.co.uk/
Twitter: @Jenny_Mitcham
Skype: jenny_mitcham

sbr...@artefactual.com

unread,
Jun 5, 2018, 6:55:04 PM6/5/18
to AtoM Users
Hi Jenny

I've been looking into this today - it sounds like you are experiencing an out-of-memory issue when exporting the EAD records.  I'd agree with you that adding a limit to the number of records that are exported is a possible work around, but the effectiveness of this work around will vary depending on the memory size of each hierarchy in question.


Steve

Vicky Phillips

unread,
Jun 7, 2018, 4:58:21 AM6/7/18
to AtoM Users
Hi Jenny and Steve,

I had similar issues when trying to generate the PDF Finding Aids for our site. Creighton Barrett kindly shared some scripts which they'd developed to split up the process. The generateListOfObjects.php will be able to give you a list of all the top-level descriptions which have been published. I'm wondering if the genFindingAidsFromList.php could be adapted to use the job that generates EAD insteead of arGenerateFindingAidJob?

Vicky Phillips
Digital Standards Manager
National Library of Wales

David at Artefactual

unread,
Jun 7, 2018, 7:28:57 PM6/7/18
to AtoM Users
Hi all,

Vicky, thanks for linking Creighton's post about the PDF finding aid helper scripts that Margaret Vail wrote.  In it's current state the cache:xml-representations script doesn't accept a "--slug" parameter (or similar) that would allow generating a cached EAD representation of a single Fonds or Collection, it tries to generate all of the EAD XML documents. But I think it would be a fairly simple change to add a "--slug" parameter to allow generating and caching a single EAD document, which would allow using a script like the genFindingAidsFromList.php script in this context.

Jenny, I wanted to expand on Steve's answer a bit.  AtoM is bad at managing memory for scripts that run for a long time, like the cache:xml-representations script.  AtoM's memory problems are due to the  a problem in the old version of the PropelORM "object relation manager" database library we are using.  When Propel loads a record (like an archival description) into memory it doesn't release the memory after the record is no longer needed.  This usually isn't a problem for short running scripts (like displaying a single archival description) but it means that when a long running script loads a lot of records, the memory footprint of AtoM continually grows until the script completes, or is killed because it is using up too much memory.  We've tried in the past to get Propel to free up memory when a database record is no longer needed (goes "out of scope" in programming speak) but without much success.  Because we use Propel throughout AtoM it would be a very big job to switch to a different database library at this point.

I think your idea of adding an "--end" parameter to script is a good work around for the memory problems.  Another, very similar option, is to add a "--limit" parameter instead, e.g. limit the script to generating 1000 records instead of specifying the index of the last record to generate.  

When Steve says the "effectiveness ... will vary the memory size of the hierarchy in question" I think he means that the script can run out of memory and die before the specified number of records  (e.g. it dies after 500 EAD docs instead of the specified 1000 docs).  In Steve's testing I think the cache:xml-representations script was dying after generating 40 EAD documents. The --skip parameter can still be used to resume the script, but it makes it hard to automate resuming the script when it could die after 300 or 500 or 999 EAD documents are generated.  This problem could be alleviated by storing the index of last EAD document generated somewhere outside the script (e.g. a text file) or by using a script like the one that Vicky shared that generates the EAD documents one at a time instead of generating them in a batch.

I hope that's helpful.

Best regards,
David

--

David Juhasz
Director, AtoM Technical Services Artefactual Systems Inc. www.artefactual.com

Jenny Mitcham

unread,
Jun 27, 2018, 5:17:59 AM6/27/18
to ica-ato...@googlegroups.com
Hi,

I just wanted to update you on this query I raised last month.

We managed to generate the XML for all of our 34,000+ AtoM records yesterday on our production site.

To get around the problems described we temporarily doubled the memory that was available to AtoM.

This seemed to do the job - it took 11 hours to generate and cache all the XML but it didn't crash and it didn't impact on the user experience for people who may have been browsing our catalogue so I'm happy with that.

We are now in a position to test the EAD harvesting.

All the best,
Jen

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To post to this group, send email to ica-ato...@googlegroups.com.
Visit this group at https://groups.google.com/group/ica-atom-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/ica-atom-users/c51b59a0-0681-4ea9-a40e-c4cd47556df0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dan Gillean

unread,
Jun 27, 2018, 10:58:33 AM6/27/18
to ICA-AtoM Users
Hi Jen, 

Thanks for sharing this! Are you willing to share how much memory was allocated for this, so we have a useful reference point for others?

Regards, 

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory

To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.


--
Jenny Mitcham
Digital Archivist
Borthwick Institute for Archives
University of York
Heslington
York
YO10 5DD

Telephone: 01904 321170

Borthwick Institute website: http://www.york.ac.uk/borthwick/
Digital archiving blog: http://digital-archiving.blogspot.co.uk/
Twitter: @Jenny_Mitcham
Skype: jenny_mitcham

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.

Jenny Mitcham

unread,
Jun 27, 2018, 11:02:46 AM6/27/18
to ica-ato...@googlegroups.com
Yes sure.
Our AtoM site had 4GB but we upped it to 8GB for the purposes of this exercise...and will put it back down again tomorrow!
Cheers,
Jen

To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To post to this group, send email to ica-ato...@googlegroups.com.


--
Jenny Mitcham
Digital Archivist
Borthwick Institute for Archives
University of York
Heslington
York
YO10 5DD

Telephone: 01904 321170

Borthwick Institute website: http://www.york.ac.uk/borthwick/
Digital archiving blog: http://digital-archiving.blogspot.co.uk/
Twitter: @Jenny_Mitcham
Skype: jenny_mitcham

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To post to this group, send email to ica-ato...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To post to this group, send email to ica-ato...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Dan Gillean

unread,
Jun 27, 2018, 11:09:48 AM6/27/18
to ICA-AtoM Users
Thanks Jen!

So it appears that 8GB of memory allowed 564 top level descriptions (34,000+ total records) to generate cached EAD and DC XML in about 11 hours. Good to know :)

Good luck with the next steps! 

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory

On Wed, Jun 27, 2018 at 11:02 AM, 'Jenny Mitcham' via AtoM Users <ica-ato...@googlegroups.com> wrote:
Yes sure.
Our AtoM site had 4GB but we upped it to 8GB for the purposes of this exercise...and will put it back down again tomorrow!
Cheers,
Jen

On Wed, 27 Jun 2018 at 15:58, Dan Gillean <d...@artefactual.com> wrote:
Hi Jen, 

Thanks for sharing this! Are you willing to share how much memory was allocated for this, so we have a useful reference point for others?

Regards, 

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory

To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.


--
Jenny Mitcham
Digital Archivist
Borthwick Institute for Archives
University of York
Heslington
York
YO10 5DD

Telephone: 01904 321170

Borthwick Institute website: http://www.york.ac.uk/borthwick/
Digital archiving blog: http://digital-archiving.blogspot.co.uk/
Twitter: @Jenny_Mitcham
Skype: jenny_mitcham

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.


--
Jenny Mitcham
Digital Archivist
Borthwick Institute for Archives
University of York
Heslington
York
YO10 5DD

Telephone: 01904 321170

Borthwick Institute website: http://www.york.ac.uk/borthwick/
Digital archiving blog: http://digital-archiving.blogspot.co.uk/
Twitter: @Jenny_Mitcham
Skype: jenny_mitcham

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.

John Hewson

unread,
Aug 2, 2018, 4:18:00 AM8/2/18
to AtoM Users
I wish I'd read this thread before we tried the same. Trying to export 118 fonds with c 48,000 descriptions, on a 4GB test server, we were getting killed at about 38,000; and it was taking Elasticsearch down with it, so we didn't fancy that in production. Rather than increase memory, we came to the same conclusions as above and batched the process. (Jenny, I gather you asked Kathryn about this, so here are the details, in case they help anyone.)

We hacked lib/QubitInformationObjectXmlCache.class.php, in the function exportAll, after the line
$exporting++;
we added
if (isset($options['skip']) && $exporting>1000) {exit("next");}

Then we called the task using an incrementing script:

#!/bin/bash
echo "Exporting"
skip=0
while [ $(cd /path/to/your/atom && php symfony cache:xml-representations --skip="$skip" | tail -1) == "next" ]
do
(( skip+=1000 ))
echo "$skip"
done
echo "Done"

This hugely reduced the memory requirement and time to run. Archiveshub (keen as mustard) have already made a successful harvest from the result. In our case, batches of 1000 worked nicely, but this could be reduced by orders of magnitude if necessary.

Jenny Mitcham

unread,
Aug 2, 2018, 4:44:42 AM8/2/18
to ica-ato...@googlegroups.com
Thanks for sharing your workaround John.
Jen

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To post to this group, send email to ica-ato...@googlegroups.com.
Visit this group at https://groups.google.com/group/ica-atom-users.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages