Archivematica scalability on very large object

Mathieu Giannecchini

unread,

May 17, 2016, 9:05:30 AM5/17/16

to archiv...@googlegroups.com

Dear all,

We are facing some scalability issues during our tests with
Archivematica (1.4) and very large objects and we would like to share
our questions / reflexion on it.
By very large objects we mean a cinema Digital Source Master asset with
134000 files and a total size of 1,3TB (and that's far from being our
biggest asset).
An archivematica default pipeline cannot handle this kind of asset for
building an AIP.

Our pipeline Hardware configuration is 16 CPU cores, 32G RAM and 55TB disk.

We have identified 3 main scalability issues :
1. Multiple copies on pipeline during ingest
2. A lot of jobs are launched for each file (one instance) instead of
the whole asset
3. HTTP timeouts during API communication between Archivematica and
Storage Service

1. Multiple copies on pipeline during ingest
This is the part we've understood the less, and we're not quite sure yet how
many copies are performed by AM, how often those happen, how configurable
they are, and so on.

- is there any doc for finding which Jobs are performing a "copy" and
which one are performing a "move" ?
- is there any way to customize the workflow for performing a move
instead of a copy ?

2. There are lot of "Tasks" configured with a "TaskType" of "for each file".
Those scale very badly in our case. A typical example would be
"removeUnneededFiles". The purpose of this task is to remove two
files/directories named "Thumbs.db" and "Icon".
This seems simple, and should be, but the way it works is that 134000
instances of python are launched, each with one (and only one) filename
as argument.
This requires half a day, just to remove few files that we know for sure
will never be present in our transfers.

A solution would be to change the TaskType to "one instance" and to
write an improved script handling the directory as a whole. This would
make sense not only for our use cases.

- is there any reason this was made in such way (for each file) ?
- do you agree with the proposed solution ? Any other idea on how to fix
it ?
- how would you change the tasktype ? Any other way than modifying the
database directly ?

3. HTTP timeouts during API communication between Archivematica and
Storage Service. Most transfers take hours, if not days, and as the
management protocol is based on http, we very quickly hit timeouts.

We use nginx as the frontend, and we can easily change the timeout, but
it's mostly a 'workaround'.

Is there any one facing the same issue with HTTP timeout between
archivematica and StorageService on big assets ?
Do you have any proposal to handle this problem ?

Many thanks for any feedback.

Kind regards,

Mathieu & Thomas

--
Mathieu Giannecchini | IT project manager
E-mail : mgiann...@ymagis.com
YMAGIS Group

Thomas Capricelli

unread,

May 19, 2016, 3:43:49 PM5/19/16

to archiv...@googlegroups.com

Any follow up on this topic ? Nobody has experience using archivematica
on big files or big directories ?

best regards,
Thomas

Thomas Capricelli <capri...@sylphide-consulting.com>
http://www.sylphide-consulting.com

Sarah Romkey

unread,

May 19, 2016, 5:01:14 PM5/19/16

to archiv...@googlegroups.com

Certainly we know of some Archivematica community users who process at a large scale, so perhaps some of them will have the chance to contribute to this thread. For our part at Artefactual, we have our heads down on the 1.5 release at the moment and so may not have time to answer forum questions for the next while.

Cheers,

Sarah

Sarah Romkey, MAS,MLIS

Archivematica Program Manager
Artefactual Systems
604-527-2056

@archivematica / @accesstomemory

--
You received this message because you are subscribed to the Google Groups "archivematica" group.
To unsubscribe from this group and stop receiving emails from it, send an email to archivematic...@googlegroups.com.
To post to this group, send email to archiv...@googlegroups.com.
Visit this group at https://groups.google.com/group/archivematica.
For more options, visit https://groups.google.com/d/optout.

Kayleigh Roos

unread,

May 20, 2016, 9:31:56 AM5/20/16

to archivematica, capri...@sylphide-consulting.com

Hi

I'm a digital curator and don't have as much technical knowledge of the issues you've described above, but can confirm that we are experiencing issues with ingesting large files and collections, even with photographic collections less than a TB. The larger the collection, the more issues we have.

We've only just started using Archivematica and are only testing the software at the moment, so I'll definitely keep the issues you've mentioned in mind and revert to here if we find any useful solutions.

Kayleigh

UCT Libraries, South Africa

Dawson, Leilani

unread,

May 20, 2016, 10:00:33 AM5/20/16

to archiv...@googlegroups.com, capri...@sylphide-consulting.com

Hi, another non-techie here.

Right now we’re working with a 4 core machine with 24 GB memory and a 3TB drive (of which Archivematica has access to a little over 1TB).

Our test ingests (mostly geospatial datasets with associated imagery and CAD design/architecture files) have run up to around 35,000 files. That having been said, we haven’t been successful with any ingests containing more than around 5,000 or 6,000 files.

File size is also an issue—especially on the storage end, since our AIPs run 2-10x larger than our SIPs and our DIPs around 6x larger than our SIPs—but so far splitting ingests into groups of 5,000 files or less and then sort-of recombining them by linking them together in an AIC after-processing has been our main workaround.

-Leilani.

Wildlife Conservation Society Library & Archives

--

You received this message because you are subscribed to the Google Groups "archivematica" group.
To unsubscribe from this group and stop receiving emails from it, send an email to archivematic...@googlegroups.com.
To post to this group, send email to archiv...@googlegroups.com.
Visit this group at https://groups.google.com/group/archivematica.
For more options, visit https://groups.google.com/d/optout.

Click here to report this email as spam.

Andrew Berger

unread,

May 20, 2016, 12:58:11 PM5/20/16

to archiv...@googlegroups.com

Hi,

For large (in terms of disk space used) AIPs, we ran into the same issue with the default nginx timeout and have it now set at 4 hours, which has worked fine with the speed of our internal network. Our largest AIPs run into the 350 GB range, but usually contain only a small number of files (mostly video). We didn't make any other changes to Archivematica's default settings except for the nginx timeout.

For large (in terms of files) AIPs, we've run into a number of problems in testing, and so far haven't ingested any packages with more than 1000 files in production. I've tested with packages up to about 20,000 files (taking up about 2 GB or less in total) on a VM with fairly limited power (8 GB RAM, 4 cores). To successfully ingest packages with 5000+ files on that machine I've had to change some MySQL (number of connections allowed) and Elasticsearch (memory) settings.

I've also wondered why Archivematica carries out some tasks on a per file basis rather than per transfer. For example, in testing I've turned off some options in the hope of speeding up the process, but "do not identify file formats"[1] on a SIP with thousands of files results in thousands of tasks skipping file format identification instead of just skipping file format identification the whole transfer in one step.

[1] Our collection has many unusual and old formats that are either not identified or misidentified by file format tools. Under the circumstances, and unless I'm able to find time to document their signatures, I'd rather have no file format ID than the wrong file format ID in the AIP.

Andrew

Reply all

Reply to author

Forward