Major improvements in offline caching of EAD and DC XML

29 views
Skip to first unread message

Dan Field

unread,
Mar 27, 2019, 5:11:14 AM3/27/19
to AtoM Users
As you may be aware from previous threads on this group from my co-worker Vicky Philips, we at the National Library of Wales have been having some issues with the generation of EAD and DC from the web interface bogging our web server down. We looked to cache the entire atom system's EAD and DC offline instead, however when trying this with the arCacheDescriptionXmlTask we found that the entire process would take around 6 months to complete! 

Thanks to comments on here regarding chunking the export with the addition of a --skip flag to the above script, we were able to improve this, but due to the size of our archive (~18000 top level archival collections, ~800000 slugs) we still ran into serious memory issues. Most of this we put down to the ORM being inefficient. At one point we logged around 30,000 SQL select queries on a single EAD generation request. 

The chunking got me thinking, and we decided to modify arCacheDescriptionXmlTask to take a --slug param. Then by generating a list of all of the slugs in atom with a simple 

select slug FROM slug,information_object WHERE information_object.id = slug.object_id into outfile "/tmp/all_slugs.txt"

This could be wrapped up into a simple PHP script but we didn't need to at the time

Then using gnu parallel and a VM with a few (24) CPU cores we were able to parallel process this input list

cat /tmp/all_slugs.txt | parallel "php symfony cache:xml-representations --id={.}"

This enabled us to process all 800000 slugs in around 48 hours! Obviously, not everyone will have that many CPU cores available and some may have many many more, but even on a dual or quad core system, this should bring massive improvements. It seems that the original arCacheDescriptionXmlTask was filling memory by iterating over the entire atom archive at the same time, leaving much of it in memory. Chunking down to the single slug level removes this memory overhead and even despite the inefficient ORM, MariaDB never really stressed with 24 parallel processes asking for data. The bootstrapping of PHP and Symfony didn't seem to make much of an impression either.

I'll try and put a PR in on github when I get a chance but note that we are on 2.3.1 still, I suspect that this will work fine on 2.4 or 2.55 releases too from glancing at the code (not modified in 2 years) but for now if anyone wants our modified class here it is. This should be lib/task/arCacheDescriptionXmlTask.class.php within your atom project:


class arCacheDescriptionXmlTask extends arBaseTask
{
  protected function configure()
  {
    $this->addOptions(array(
      new sfCommandOption('application', null, sfCommandOption::PARAMETER_OPTIONAL, 'The application name', 'qubit'),
      new sfCommandOption('env', null, sfCommandOption::PARAMETER_REQUIRED, 'The environment', 'cli'),
      new sfCommandOption('connection', null, sfCommandOption::PARAMETER_REQUIRED, 'The connection name', 'propel'),
      new sfCommandOption('skip', null, sfCommandOption::PARAMETER_OPTIONAL, 'Number of information objects to skip', 0),
      new sfCommandOption('slug', null, sfCommandOption::PARAMETER_OPTIONAL, 'Slug of resource', 0)
    ));

    $this->namespace = 'cache';
    $this->name = 'xml-representations';

    $this->briefDescription = 'Render all descriptions as XML and cache the results as files';
    $this->detailedDescription = <<<EOF
Render all descriptions as XML and cache the results as files
EOF;
  }

  public function execute($arguments = array(), $options = array())
  {
    parent::execute($arguments, $options);
    if ($options['slug']) {
      $this->export($options['slug']);
    } else {
      $this->exportAll($options);
    }
  }

  private function exportAll($options)
  {
    $logger = new sfCommandLogger(new sfEventDispatcher);
    $logger->log('Caching XML representations of information objects...');

    $cache = new QubitInformationObjectXmlCache(array('logger' => $logger));
    $cache->exportAll(array('skip' => $options['skip']));

    $logger->log('Done.');
  }

  private function export($slug)
  {
    $obj = QubitObject::getBySlug($slug);

    $logger = new sfCommandLogger(new sfEventDispatcher);
    $logger->log("Caching XML representation of resource  {$slug}");

    $cache = new QubitInformationObjectXmlCache(array('logger' => $logger));
    $cache->export($obj);

    $logger->log('Done.');
  }
}


Dan Gillean

unread,
Mar 27, 2019, 10:37:42 AM3/27/19
to ICA-AtoM Users
Hi Dan, 

Wow! Thanks for sharing this! I've asked on of our developers to take an initial look at this, but we'd love to get a pull request for this! If you do create one, please make sure you create the PR against our latest development branch (qa/2.5.x).

We would also ask that you complete and submit a Contributor's agreement if you submit a PR - a link to the form, where to send it, and more information on why we ask this can be found here: 
Thanks again for sharing what you've done! 

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory


--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To post to this group, send email to ica-ato...@googlegroups.com.
Visit this group at https://groups.google.com/group/ica-atom-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/ica-atom-users/ece2f7ae-63d8-4535-8fb8-7c12c616866f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages