Notes on bulk generating PDF finding aids

138 views
Skip to first unread message

Creighton Barrett

unread,
Feb 6, 2017, 3:25:53 PM2/6/17
to ica-ato...@googlegroups.com
Hi everyone,

There have been a few posts recently about the XSLT templates for PDF finding aids, bulk generating finding aids, etc. (see here and here and here).

We are almost finished working through this and our talented developer, Margaret Vail, wrote some notes to share on the list. What we found was that some of our finding aids are too big and require some manual intervention. It was too hard to figure out where the custom script discussed in this thread was dying, so we split up the process by first running a script to produce a list of top-level descriptions and then feeding that list into a second script that runs the arGenerateFindingAid job on each top-level description.

We identified two finding aids that require individual CLI tasks for exporting EAD, transforming the EAD, and running Apache FOP. Not sure whether there are ways we could optimize the server to eliminate that problem, but it works for now. 

Anyway, here goes. Hope this is helpful to anyone else working on their PDFs!


Atom - Bulk Generate PDF Finding Aids


Process

  1. Modify generateListOfObjects.php and genFindingAidsFromList.php to the temporary filename you want to work with.
  2. Get the key that represents your AtoM instance to work with Qubitjob. Replace the key in genFindingAidsFromList.php in your own. I found this key by dumping data into a log file when creating a Finding Aid from the browser.
  3. Run generateListOfObjects.php - This script generates a list of all published top-level descriptions. Command:php symfony tools:run generateListOfObjects.php
  4. AtoM recommends deleting the contents of your <atom_home_dir>/downloads folder prior to generating new finding aids.
  5. Run genFindingAidsFromList.php - This script takes the list of top-level descriptions generated by the previous step and queues them into gearman which is AtoM's job processor. Dividing these scripts into two steps allows you to remove completed finding aids from the list and continue with the remaining finding aids after an error or interruption. Command:php symfony tools:run genFindingAidsFromList.php

Notes: Your gearman worker may quit during this process. Restart the worker and it will automatically resume.


Code

generateListOfObjects.php

<?php

 

$c = new Criteria;

$c->add(QubitInformationObject::PARENT_ID, 1);

 

$file = fopen("processFindingAids.txt", "w");

 

foreach (QubitInformationObject::get($c) as $io)

{

        if ($io->getPublicationStatus()->statusId == QubitTerm::PUBLICATION_STATUS_DRAFT_ID)

                        continue;

        $txt = $io->id . ",Generating finding aid for: ".$io->getTitle(array("cultureFallback" => true)). "\n";

        fwrite($file,$txt);

}

 

fclose($file);

?>


genFindingAidsFromList.php

<?php

 

$myfile = fopen("processFindingAids.txt", "r") or die("Unable to open file!");

 

while(!feof($myfile)) {

 

        $line = fgets($myfile);

        $var = explode(",",$line);

 

        print $var[1] . "\n";

        $params = array(

                'objectId' => $var[0],

                'description' => trim($var[1])

        );

        $job = QubitJob::runJob('arGenerateFindingAidJob', $params,'AtoM instance key');

        sleep(30);

}

 

?>


AtoM - Generate Large PDF Finding Aids from Command Line

This is useful if your PDF is too large to be generated using symfony and gearman.


Process

STEP 1:

<path_to_your_php>php -d memory_limit=-1 -d error_reporting="E_ALL" symfony export:bulk --single-slug=<top-level description slug> --public <location_to_save file> 2>&1


Example:

<path_to_your_php>php -d memory_limit=-1 -d error_reporting="E_ALL" symfony export:bulk --single-slug=dalhousie-university-reference-collection --public /temp 2>&1


STEP 2:

java -jar '<path_to_atom>/lib/task/pdf/saxon9he.jar' -s:'<full_path_to_file_including_name>' -xsl:'<path_to_atom>/lib/task/pdf/ead-pdf-<template_type>.xsl' -o:'<full_path_to_new_file_including_name>' 2>&1


Example:

java -jar '/appl/www/html/atom-2.3.0/lib/task/pdf/saxon9he.jar' -s:'/temp/ead_0000049278_dalhousie-university-reference-collection.xml' -xsl:'/appl/www/html/atom-2.3.0/lib/task/pdf/ead-pdf-full-details.xsl' -o:'/temp/ead_0000049278_dalhousie-university-reference-collection.fo' 2>&1


STEP 3:

fop -r -q -fo '<full_path_to_fo_file_including_name>' -pdf '<full_path_to_new_file_including_name>.pdf' 2>&1


Example:

fop -r -q -fo '/temp/ead_0000049278_dalhousie-university-reference-collection.fo' -pdf '/temp/dalhousie-university-reference-collection.pdf' 2>&1

 




Dan Gillean

unread,
Feb 8, 2017, 11:52:00 AM2/8/17
to ICA-AtoM Users
Creighton,

Thanks so much for sharing this!!! I'll try to add a link to these instruction on our wiki soon, in the Community Resources section :)

Cheers,

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.
Visit this group at https://groups.google.com/group/ica-atom-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/ica-atom-users/CAHueW_WYY9u9up6DiLQcf2cSqigqAAd%3DxPVsQmdY3HQAfPv2wg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Vicky Phillips

unread,
Apr 6, 2017, 6:07:25 PM4/6/17
to AtoM Users
Yes thanks for sharing this Creighton. I'm trying to give this a go at the moment. I've managed to generate the list but struggling to find the key for our installation in order to do the next part. Can anyone tell me what file I can look at in order to get the key please?
Thanks
Vicky

Creighton Barrett

unread,
Apr 6, 2017, 8:43:31 PM4/6/17
to ica-ato...@googlegroups.com
My pleasure, Vicky! Our developer Margaret Vail worked all of this out, so I might have to confirm with her, but I think she "found" the key by manually generating a finding aid via the browser and then checking the logs. Our key was a 32 character string. Have you tried that? Maybe someone knows of an easier method?

Remember that it is a good idea to clear all of the previously generated finding aids before running the scripts.

For anyone else following along here, there is one caveat to the steps to generate PDFs for large finding aids via the command line: EAD exported via command will include elements that are marked as hidden in the visible elements module. If you use the visible elements module to mask certain elements, then you must remove these elements from EAD exported via command before processing the EAD with your XSLT.

I'm not sure if there is some way to mask certain elements when EAD is exported via command, but you can easily remove the elements from the EAD in Oxygen and then continue on to the other steps.

Cheers,
Creighton

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-ato...@googlegroups.com.

Dan Gillean

unread,
Apr 7, 2017, 11:09:38 AM4/7/17
to ICA-AtoM Users
Hi Creighton,

Quick thought in response to the following:

For anyone else following along here, there is one caveat to the steps to generate PDFs for large finding aids via the command line: EAD exported via command will include elements that are marked as hidden in the visible elements module. If you use the visible elements module to mask certain elements, then you must remove these elements from EAD exported via command before processing the EAD with your XSLT.

The command-line task for bulk XML export includes the --public option - this is what the finding aid generation normally uses when excluding Physical storage info and drafts during generation. I have not looked closely at all steps you have shared, but if a user's concern is only about Drafts and physical storage info, it seems like you might be able to just include the --public option in Step 1 above, and then avoid having to manually remove these elements.

Users following along at home and considering this: Otherwise, Creighton is correct, and this applies everywhere in the application (e.g. the export option in the user interface). Visible elements will hide some fields from display, but currently ONLY physical storage information is removed from the exports as well based on Visible elements settings!

Thanks again for sharing this info!

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory

To post to this group, send email to ica-atom-users@googlegroups.com.

Creighton Barrett

unread,
Apr 7, 2017, 11:44:16 AM4/7/17
to ica-ato...@googlegroups.com
Thanks, Dan, that is great information. We're bogged down at the moment but will try to do some tests and incorporate that into our docs for future reference.

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.
Visit this group at https://groups.google.com/group/ica-atom-users.

sbr...@artefactual.com

unread,
Apr 7, 2017, 7:18:58 PM4/7/17
to ica-ato...@googlegroups.com
Hi Creighton

I was looking at genFindingAidsFromList.php, and in particular the key that is used - this can be simplified so that the key does not need to be found and passed in.  

The key is used to ensure that the gearman worker process associates a given job with the correct AtoM instance which is important when multiple AtoM instances are hosted on a single server.  The key is actually a hash of the strings for SiteTitle + Site Base Url + Site Root Dir and is normally calculated on the fly from AtoM Settings when a job is submitted via the Web Interface.

We can simplify genFindingAidsFromList.php so that it has the context it requires so that QubitJob::runJob can generate this key automatically - see my changes in bold below:

<?php

  // Grab the settings for use here
  sfConfig::add(QubitSetting::getSettingsArray());

  $myfile = fopen("processFindingAids.txt", "r") or die("Unable to open file!");

  while (!feof($myfile))
  {

    $line = fgets($myfile);

    $var = explode(",",$line);

    print $var[1] . "\n";

    $params = array(
      'objectId' => $var[0],
      'description' => trim($var[1]));

    // Now we can drop the key from the call to runJob() as the QubitJob 
    // class will be able to correctly generate the key on the fly from
    // information passed in the settings array.
    $job = QubitJob::runJob('arGenerateFindingAidJob', $params); 

    sleep(30);
  }
?>

Hope this helps!

Steve

 

On Thursday, April 6, 2017 at 5:43:31 PM UTC-7, Creighton Barrett wrote:
My pleasure, Vicky! Our developer Margaret Vail worked all of this out, so I might have to confirm with her, but I think she "found" the key by manually generating a finding aid via the browser and then checking the logs. Our key was a 32 character string. Have you tried that? Maybe someone knows of an easier method?

Remember that it is a good idea to clear all of the previously generated finding aids before running the scripts.

For anyone else following along here, there is one caveat to the steps to generate PDFs for large finding aids via the command line: EAD exported via command will include elements that are marked as hidden in the visible elements module. If you use the visible elements module to mask certain elements, then you must remove these elements from EAD exported via command before processing the EAD with your XSLT.

I'm not sure if there is some way to mask certain elements when EAD is exported via command, but you can easily remove the elements from the EAD in Oxygen and then continue on to the other steps.

Cheers,
Creighton
On 6 April 2017 at 19:07, 'Vicky Phillips' via AtoM Users <ica-ato...@googlegroups.com> wrote:
Yes thanks for sharing this Creighton. I'm trying to give this a go at the moment. I've managed to generate the list but struggling to find the key for our installation in order to do the next part. Can anyone tell me what file I can look at in order to get the key please?
Thanks
Vicky

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.

To post to this group, send email to ica-ato...@googlegroups.com.
Visit this group at https://groups.google.com/group/ica-atom-users.

Creighton Barrett

unread,
Apr 10, 2017, 9:14:36 AM4/10/17
to ica-ato...@googlegroups.com
Fantastic, thanks so much Steve! I'll pass this along here so we can update our docs. Great stuff.

On 7 April 2017 at 20:18, <sbr...@artefactual.com> wrote:
Hi Creighton

I was looking at genFindingAidsFromList.php, and in particular the key that is used - this can be simplified so that the key does not need to be found and passed in.  

The key is used to ensure that the gearman worker process associates a given job with the correct AtoM instance which is important when multiple AtoM instances are hosted on a single server.  The key is actually a hash of the strings for SiteTitle + Site Base Url + Site Root Dir and is normally calculated on the fly from AtoM Settings when a job is submitted via the Web Interface.

We can simplify genFindingAidsFromList.php so that it has the context it requires so that QubitJob::runJob can generate this key automatically - see my changes in bold below:

<?php

  // Grab the settings for use here
  $context = sfContext::getInstance();
  sfConfig::add(QubitSetting::getSettingsArray());  // Important!
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.

Vicky Phillips

unread,
Apr 25, 2017, 11:44:39 AM4/25/17
to AtoM Users
Hi,
I've successfully generated a list of records and generated PDF Finding Aids from this. I'm currently working out how to copy the PDFs across from our test AtoM instance to our Production AtoM instance and getting the Download button to display in our Production instance. Currently if there's no PDF at all there's no Finding Aid section being displayed. When I copy the PDF from our test instance to Production instance I now get a Finding Aid section with Status: Unknown. Although this message suggests "No finding aid has previously been generated for this description" the system seems to recognise that something has happened as it's gone from no Finding Aid Section to with Finding Aid section and a status.  Any suggestions as to what I'm missing here? Just to note I'm planning on generating them on our test instance so as to not put any unnecessary stress on the Production site whilst generating all of the Finding Aids.

I'm also trying the second process documented by Creighton in order to process the larger archives. I've managed to do Step 1 in the process but I'm now stuck on Step 2. I've run the following command whilst in the atom directory but got a Connection timed out message.

 java -jar lib/task/pdf/saxon9he.jar -s:EADExport/st-davids-diocesan-records-5.xml -xsl:lib/task/pdf/ead-pdf-inventory-summary.xsl -o:PDF_FO/st-davids-diocesan-records-5.fo 2>&1
Error
  I/O error reported by XML parser processing
  file:/var/www/atom-2.3.0/EADExport/st-davids-diocesan-records-5.xml: Connection timed out
  (Connection timed out)
Transformation failed: Run-time errors were reported

I thought that this may have something to do with the large size of the archive (it took 2.5 hours to export the EAD in Step 1). So I then tried the same command on a single record archival description, again it failed but with No route to host (Host unreachable)

 java -jar lib/task/pdf/saxon9he.jar -s:EADExport/berriew-1982.xml -xsl:lib/task/pdf/ead-pdf-inventory-summary.xsl -o:PDF_FO/berriew-1982.fo 2>&1
Error
  I/O error reported by XML parser processing
  file:/var/www/atom-2.3.0/EADExport/berriew-1982.xml: No route to host (Host unreachable)
Transformation failed: Run-time errors were reported

Again does anyone have any suggestions as to what is causing these errors?

Thanks in advance for any help.

Vicky


Creighton Barrett

unread,
Apr 26, 2017, 11:22:54 AM4/26/17
to ica-ato...@googlegroups.com
Hi Vicky,

Just wanted to send a quick note to say I am out of the office this week and unable to check with our team about this until sometime next week. But I can say that our work on the PDFs has stalled a bit because of other things and we haven't published our PDF finding aids on our live site. So we haven't encountered this problem yet! But I think our plans were to do this as part of a regular synchronization between our staff site and our public site. What is your normal routine for sending updates to your live site?

One thing to mention: if you generate the PDFs on your test site, you want to make sure the base URL in the PDF reflects the URL of your live site.

All for now,

Creighton

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.
Visit this group at https://groups.google.com/group/ica-atom-users.

Vicky Phillips

unread,
May 3, 2017, 7:14:22 AM5/3/17
to AtoM Users
Thanks Creighton. Just to let you know I decided to try and put our largest archive through the first process and it looks to have worked fine. I'm just getting the archivists to take a look at it to ensure it's all there. But it looks like we may not need to use the second process suggested.

As testing has gone well on our test instance of AtoM last night I started generating 3000 records on our live site (it's live with the archivists but not live with the public at the moment). With the aim that these would have completed by this morning before archivists started using the system. Unfortunately the live system ran out of memory when trying to extract the EAD. I had the following message

Generating finding aid for: Aston Hall Estate Records

Exporting EAD has failed.

ERROR(EAD-EXPORT): PHP Fatal error: Allowed memory size of 2147483648 bytes exhausted (tried to allocate 23221523 bytes) in /var/www/atom-2.3.0/lib/task/export/exportBulkTask.class.php on line 115


We've increased the memory and a couple have gone through but the rest remain in the queue. How can I get rid of these from the queue? I didn't want to be generating Finding Aids in the day while the archivists are working on the site otherwise anything that they do that requires the job scheduler will be added to the end of my long queue of PDF generation tasks.


We're still looking into how we can produce these on our test instance and copy them across to the live site. I was hoping that it would be a simple copy across from one server to another but it doesn't look that way at the moment. We'd be grateful of any suggestions as to how to go about this.

Thanks,

Vicky

Reply all
Reply to author
Forward
0 new messages