Archivematica Data Flow

107 views
Skip to first unread message

Matthew Gettemy

unread,
Sep 25, 2015, 9:01:11 AM9/25/15
to archivematica
Hi,
I am running Archivematica 1.4.1. 

I guess it probably depends on how you have microservices set up, but I was wondering if Archivematica has multiple instances of a given transfers data out there simultaneously. It seems like when a transfer is happening that that space utilized is greater than the size of the data bundled in the transfer - is there a way to know what this is...are there two copies of the data? Also wondering if there is any documentation describing the data flow through the /var/archivematica file structure. 


Thanks!
Matthew

Sarah Romkey

unread,
Sep 30, 2015, 11:48:48 AM9/30/15
to archiv...@googlegroups.com
Hello Matthew,

You are correct, Archivematica requires more processing space than the transfer itself for a couple of reasons- depending on what you're running and the number of files in your transfer, some of the services can be resource intensive- for example, many logs are created when you run Examine Contents, and for transfers with many files, the transfer METS file can be very large once it is created. Unless you hit upon a bug, Archivematica should only be creating duplicate copies of digital objects when you move objects into the pipeline, into/out of transfer backlog, and the into archival storage. Some micro-services require extra space such as Normalization and AIP packaging.

I think the closest thing we have to documentation on this is our Micro-services wiki page (https://wiki.archivematica.org/Micro-services). If you expand each micro-service you get a bit more detail on what's going on. Any place where it says "Move to processing directory" or similar, Archivematica is moving, not copying, the content.

Hope that helps clarify things a bit.

Cheers,

Sarah

Sarah Romkey, MAS,MLIS
Systems Archivist
Artefactual Systems
604-527-2056
@archivematica / @accesstomemory



--
You received this message because you are subscribed to the Google Groups "archivematica" group.
To unsubscribe from this group and stop receiving emails from it, send an email to archivematic...@googlegroups.com.
To post to this group, send email to archiv...@googlegroups.com.
Visit this group at http://groups.google.com/group/archivematica.
For more options, visit https://groups.google.com/d/optout.

Matthew Gettemy

unread,
Oct 1, 2015, 3:39:10 PM10/1/15
to archivematica
Hey Sarah, 
That is very helpful, thank you! The issue came up because we allotted a partition for /var/archivematica that was only slightly larger than a transfer we were testing. We are also in the process of trying to determine how to size the partitions to handle our expected future load. More testing required!

- Matthew

Justin Simpson

unread,
Oct 1, 2015, 4:18:44 PM10/1/15
to archiv...@googlegroups.com
One heuristic that has been useful is to consider the largest transfer you would want to process at once, and allocate 3 to 4 times as much space, for your processing location (i.e., /var/archivematica/sharedDirectory).  This allows normalization for access and normalization for preservation to both be run, as well as the examine contents microservice.

If you process materials, and leave them in the pipeline (for example, waiting at an 'upload dip' question or some other user prompt), and then continue with other transfers, then you need extra space.  If you always process one transfer at a time, and run it all the way through to either transfer backlog or to aip storage, then you don't need to allocate extra space to hold the transfers/sips that are not yet complete.  But it is a trade off.

The amount of space required for normalization depends on the files being normalized.  Turning TIFFs into jpgs for access doesn't require nearly as much space as turning a .mov file into an .mkv for preservation.  Some compressed video formats can require 7 to 20 times as much disk space, in my experience (i.e. a 10gb .mov could require up to 200gb of disk space, during normalization.

There is also a set of directories where files can accumulate, in an Archivematica pipeline.  The administration tab has a new option, in the 1.4.0 release and on, where you can view the contents of these directories and click a button to clear them out. This includes things like the dip upload directory, and the failed and rejected directories where Archivematica puts SIPs for which processing has terminated unsuccessfully.

 


Justin Simpson
Director of Archivematica Technical Services
www.artefactual.com
604-527-2056

--

Matthew Gettemy

unread,
Oct 9, 2015, 9:27:13 AM10/9/15
to archivematica
Hey Justin,
Thank you for the detail! We are mostly expecting to process large batches of text files, which during processing the consumed partition space peaks between 2-3x the size of the transfer.


- Matthew
Reply all
Reply to author
Forward
0 new messages