FW: [NDSA-ALL] Workflow capacity and "temporary" storage

5 views

Skip to first unread message

Tallman, Nathan

unread,

Jun 16, 2020, 11:54:10 AM6/16/20

to community

Greetings APTrusters,

Passing along this excellent white paper on APTrust architectures for scaling that Andrew Diamond wrote in response to a question from the digital preservation community on NDSA-ALL. I’m forwarding not only to share the excellent descriptions of APTrust experiences, but also as a demonstration of value that APTrust contributes to the community. This white paper is detailed, thorough, and surely useful in many contexts, not just digital preservation. Thank you, Andrew!

Nathan

From: NDSA-ALL <NDSA...@LISTS.CLIR.ORG> on behalf of andrew diamond <andrew....@APTRUST.ORG>
Reply-To: andrew diamond <andrew....@APTRUST.ORG>
Date: Monday, June 15, 2020 at 4:06 PM
To: NDSA-ALL <NDSA...@LISTS.CLIR.ORG>
Subject: Re: [NDSA-ALL] Workflow capacity and "temporary" storage

Hi Dan,

Our response is fairly lengthy, so I put it in this document:

https://docs.google.com/document/d/1c8x-BeA7K13JvCI34Nxk3E_Ha9iP84EXcBYiAdAJVSA/edit?usp=sharing

If you want to talk further about staging, preservation, or other issues, feel free to contact me or Bradley Daigle at APTrust.

Andrew Diamond

Lead Developer, APTrust

On Mon, Jun 15, 2020 at 11:18 AM Amy Kirchhoff <Amy.Ki...@ithaka.org> wrote:

Hi Dan ~

Your experience rings true (maybe in life, not just in preservation – we all manage to fill up our humongous staging areas).

We have implemented a few things at Portico over time. The biggest change was one we made several years ago, where we moved to what we call “straight-to-ingest” or S2I. Portico preserves electronic publications – journals, books, digitized collections, and even more new-fangled, database like content. We are a not-for-profit with limited resources and we were running into efficiency and cost-effectiveness issues. We can afford a certain amount of capacity to handle “problem content” and the amount of content falling into this category was greater than our capacity. This problem content would remain in our processing space until we could address it and we were simply not able to get to it all in a timely fashion. Our processing space was not designed for secure, long-term preservation, it was designed for transactional activities. During a project to identify efficiencies, we realized that much of the content we were identifying as “problem” were just the reality – sometimes journal articles are published with missing images! If that’s the way they were published, than it is perfectly appropriate for them to be preserved that way. In addition, there are whole categories of this content where, despite the problems, we had at least one complete rendition of the article or book. For example, if we are sent both the XML of an article and a PDF, even if we are missing an image file, we have the PDF – which is a perfectly acceptable and complete rendition.

Enter, S2I – which lets us move some of this problematic content into the Archive without our need to resolve the problems before preserving it. We now grade content as we preserve it … ‘A’ means we believe we have everything needed and know of no problems in the content, ‘B’ means there are some problems, but we have at least one, complete rendition of the item. And so on. We may eventually choose to preserve items graded ‘C’, ‘D’, and ‘F’ – but at the moment we do not, they remain in our processing area until we address the problems. Tied with grading content, we also implemented some significant improvements to our error tracking. We now have a new file, which is part of every archival unit and indicates the details of any problems we have with the content.

We released the changes to support S2I and are rolling it out scenario by scenario. For example, we knew that one of our biggest problems was missing images and so we started there. Another big category is XML that does not validate to the publisher’s DTD (if we can extract the title, authors, DOI, and a few other items from it with regular expressions, we’ll go ahead and preserve the content, if we have a good PDF of the article). We have 10s of additional scenarios we could roll-out. For content in each of the implemented scenarios, it is moving through our processing space and being preserved in the archive with a record of its grade and errors. This gets it out of our processing space and into the archive, which is a much more appropriate, long-term location. With the detailed error message tracking, we now have the information we need to identify content with specific problems and pull it out of the archive for reprocessing, as appropriate. We can also write rules to alert us to anomalies – for example, say publisher A has a historic pattern of 10% of its content being marked B because of missing images. If we see that pattern changing, say it rises to 50%, we can get an alert, take a look, and prioritize a deep review of that publisher.

Note that not all problems are publisher introduced problems, sometimes it is our problem – for example, we code our XML transformations very conservatively.   We do not write a rule for all elements in the DTD, just those we have seen in action. Thus, if during processing we encounter an element where we do not have a rule for that is also a ‘problem’.

We are not pushing all content into the archive willy-nilly, but S2I has greatly improved our ability to get content into the archive and out of our processing system, and to track any problems we have with the content.

You can read more about our S2I project here:

·         http://doi.org/10.17605/OSF.IO/VW7RJ

·         https://www.dpconline.org/blog/idpd/taming-the-pre-ingest-processing-monster

Another tactic we have used for staging areas (as opposed to processing areas) is that when we have identified large amounts of content that needs to stick around in our staging area for a good amount of time, we will sometimes off-load it into Glacier. For example, we have one large publisher that sent us their content multiple times and it was going to take us a long while to confirm it had all made it into the Archive and we did not want to delete it until we had that confirmation – this was prime content to move out of our staging area and into Glacier.

I’m happy to answer questions (or find the right person on staff to answer questions!).

~ Amy (Portico, Archive Service Product Manager)

From: The NDSA organization list <NDSA...@LISTS.CLIR.ORG> On Behalf Of Noonan, Dan
Sent: Friday, June 12, 2020 2:48 PM
To: NDSA...@LISTS.CLIR.ORG
Subject: [NDSA-ALL] Workflow capacity and "temporary" storage

Hi All: This is a query that I think is valuable for our larger digital preservation community, so please reply to the whole list.

A few years back we commissioned a new shared drive to be the staging area for content, where processing and metadata creation happens prior to ingest into our Digital Collections (Fedora/Hyrax) system. We original expected it to be about 5-10TBs of fluid space, with things coming in being processed, ingested and local copies disposed freeing up space. Unfortunately it has ballooned to 30TBs+. Some of that comes from a pause we had had on ingest leading up to the Hyrax upgrade, and being without a metadata librarian at the time, and part of it has come from a significant amount of av digitization that we had the funding to do, but not the human resources to push it through the process post-digitization.

So the series of questions I have been tasked with is to find out:

How do more mature digital libraries manage temporary storage for processing?
Do they also have a mismatch between their ambitions and their capacity?
If not, what processes have they developed to keep them in sync?
What prioritization metrics do you use?
Are these adaptable processes for other institutions?

Please let me know your thoughts – Thanks – Dan

Daniel W. Noonan

Associate Professor

Digital Preservation Librarian

University Libraries | Digital Programs

320A 18^th Avenue Library | 175 West 18^th Avenue Columbus, OH 43210
614.247.2425 Office
noon...@osu.edu go.osu.edu/noonan @DannyNoonan1962

http://orcid.org/0000-0002-7021-4106

Pronouns: he/him/his ~ Honorific: Mr.

Buckeyes consider the environment before printing.

Campus Campaign Fund: 483229 Rare Books and Manuscripts fund for LGBTQ

########################################################################

to manage your NDSA-ALL subscription, visit ndsa.org/ndsa-all

########################################################################

to manage your NDSA-ALL subscription, visit ndsa.org/ndsa-all

########################################################################

to manage your NDSA-ALL subscription, visit ndsa.org/ndsa-all

Andrew Diamond

unread,

Jun 16, 2020, 12:09:52 PM6/16/20

to Tallman, Nathan, community

Thanks, Nathan. I think many people have to work through similar issues. The paper just touches on common problems and potential solutions. I hope it's useful.

Andrew Diamond

Lead Developer, APTrust

--
You received this message because you are subscribed to the Google Groups "community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to community+...@aptrust.org.
To view this discussion on the web visit https://groups.google.com/a/aptrust.org/d/msgid/community/98A2FE60-EB71-4509-8062-B0DCCDEFBD97%40psu.edu.

Reply all

Reply to author

Forward

0 new messages