bulk ingest

94 views
Skip to first unread message

Joseph

unread,
Jun 29, 2011, 1:08:46 PM6/29/11
to archivematica
I've installed the virtual machine appliance of Archivematica and
tested ingesting a few items. One question I have concerns bulk
ingest. At times manually ingesting one SIP at a time would be
acceptable but other times completely impractical. I don't see how you
might accomplish a bulk ingest presently. It would necessarily have to
streamline and skip some of the micro-services procedures. Is that
defeating the purpose I wonder?

Thank you all for any feedback
Joseph

Evelyn McLellan

unread,
Jun 29, 2011, 2:39:23 PM6/29/11
to archivematica
Hi Joseph,

Although by default Archivematica includes a number of steps at which
the user reviews and approves a SIP, the system can be configured to
skip those steps and simply move the SIP through the micro-services.
See the thread at
http://groups.google.com/group/archivematica/browse_thread/thread/1a6c39835827348a/c1857761e363d9d5?lnk=gst&q=approval#c1857761e363d9d5
for more information. I don't think that removing some or even all of
the approval steps would be problematic if you were confident that the
SIPs were properly formed, contained preservable/easily normalized
formats, included all necessary transfer documentation etc. prior to
ingest. This might be the case if you had a large number of similar
SIPs, for example.

I hope I understood your question correctly and that this is helpful.
If not, please let me know.

Evelyn McLellan
Archivematica Community Manager

Peter Van Garderen

unread,
Jun 29, 2011, 2:41:23 PM6/29/11
to archiv...@googlegroups.com
This is another use case for SIP-specific config files which would tell
Archivematica to skip certain micro-services, steps, etc.

Cheers, --peter

Joseph

unread,
Jun 30, 2011, 9:18:42 AM6/30/11
to archivematica
Thank you, Evelyn. This is very helpful, and I will follow through
with a test of changing the mcpModulesCongig file as you explained to
Brian.

I am thinking of two scenarios in particular for which I am looking
for a practical digital preservation solution. One where I would need
to ingest several thousand text documents in PDF/A format. The other
where I need a solution for faculty research data.

The former couldn't possible be done manually by single item, though
does fit your criteria of having a large number of similar SIPs.
Although we don't use CONTENTdm, I would also think that a plugin to
that program would necessitate the easy import of multiple items
simultaneously.

For faculty research data, I could see how the manual review and
approval steps would be useful and manageable.

Thank you again for the feedback,
Joseph

On Jun 29, 2:39 pm, Evelyn McLellan <epmclel...@gmail.com> wrote:
> Hi Joseph,
>
> Although by default Archivematica includes a number of steps at which
> the user reviews and approves a SIP, the system can be configured to
> skip those steps and simply move the SIP through the micro-services.
> See the thread athttp://groups.google.com/group/archivematica/browse_thread/thread/1a6...

Evelyn McLellan

unread,
Jun 30, 2011, 1:02:58 PM6/30/11
to archivematica
Joseph, is there any reason to run your PDF/A files through one by
one? Archivematica doesn't restrict the number of objects in a SIP.
What is the intellectual relationship between the files? Do they all
belong to the same fonds/series/collection, for example, or can they
be logically divided into such groupings?

Also, I think you have another workflow requirement - you'll want to
skip the normalization workflow, or you'll want to edit the database
to ignore the PDF/A files during normalization, since they're already
in a valid preservation and access format. By default, Archivematica
converts incoming PDFs to PDF/A, and since the file extension is the
same it won't be able to tell that your files are already PDF/A. Note
that this is something we're planning to change in release 0.7.2. If
you want to edit the normalization path in the database please let us
know and we can help. Please also note that in 0.7.2 Archivematica
will allow the user to edit config files in the SIP which would change
the workflow as the SIP is processed, so various workflow decisions
can be implemented on a SIP by SIP basis.

Evelyn

Joseph

unread,
Jul 6, 2011, 1:57:14 PM7/6/11
to archivematica
Evelyn, I've been thinking about this, and although the files are of
the same series/collection, they each contain descriptive metadata. I
can't see how thousands of documents each with a metadata file could
be managed properly in a single SIP.

Thanks,
Joseph

Evelyn McLellan

unread,
Jul 6, 2011, 7:28:08 PM7/6/11
to archivematica
OK, I understand your use case a little better now. When you say each
object "contains" descriptive metadata, do you mean that the metadata
are embedded in the object, or that there is a separate metadata file
(for example, a separate xml file) for each object? If the latter,
what kind of system is generating the metadata?

Evelyn

Joseph

unread,
Jul 10, 2011, 2:56:06 PM7/10/11
to archivematica
The metadata for each object is held separately. We are using an
Access database designed in-house to collect MODS descriptive metadata
which we are then exporting as Dublin Core XML for ingest into Digital
Commons. Because Digital Commons is not a viable preservation solution
I am looking for a local system to manage preservation copies of
files. In the case of the document PDFs, I am thinking that an AIP
should consist of a PDF/A file, XML MODS and Dublin Core files, and a
PREMIS preservation metadata file.

Joseph

Evelyn McLellan

unread,
Jul 12, 2011, 1:42:56 PM7/12/11
to archivematica
Are there fields in the DC or MODS files that point to the PDF file?
For example, does the PDF file have a unique identifier that is
captured in the DC or MODS metadata? Or maybe the filename of the
metadata file relates to the filename of the PDF? In Archivematica
0.7.2 we are planning to be able to accommodate bulk ingest of objects
and their metadata from TRIM, and the TRIM metadata file will have
pointers to the objects being ingested. We will program Archivematica
to act on those pointers by linking the objects to their metadata in
the AIP METS file. We're doing this precisely because of the use case
you describe - linking multiple objects in the SIP to their own
individual metadata that are ingested with them.

Right now you could put your metadata files in the metadata directory
of the SIP, but Archivematica wouldn't notice them or act on them in
any way, and if you had multiple objects and multiple metadata files
you probably wouldn't have an easy way to tell which metadata
described which object. Even in 0.7.2 you would likely need some
integration work done to link the object to its metadata.

This is basically the state of things and a description of where we're
headed with Archivematica development this year. Ingesting SIPs from a
wide variety of systems with different metadata formats is definitely
something we want Archivematica to be able to do, but it will be an
ongoing process, and, depending on your particular requirements, there
may be a certain amount of integration work required to get
Archivematica to do exactly what you need it to do.

I hope this is helpful.

Evelyn

Joseph

unread,
Jul 12, 2011, 9:04:23 PM7/12/11
to archivematica
Thank you, Evelyn. This is very helpful indeed. What we are doing is
using Access for data entry and then for scanning as well using a
driver we found that can be controlled through Access. So the scan
files generated are named using the unique identifier from the
description record along with a short prefix. That file name is then
automatically added to a field in the item record, so, yes, a pointer
does exist in the item record. Currently we are generating DC XML
files, but I would like to also generate a MODS XML file, though
that's a bit more complicated.

I'm pleased to hear of your plans for .0.7.2. That should definitely
address a significant need. Still, for us, in a situation in which we
are digitizing an archival collection consisting nearly entirely of
documents with an anticipated completion amount of over 10,000 items,
I'm still not certain how I would break that up into manageable SIPs.
There seems to be a question of scale here that simply may not be
functional with this approach.

I was contemplating how potentially multiple items in individual SIP
folders could be staged in the receiveSIP directory and possibly moved
through the ingest process using scripts (and eliminating some micro-
services) that you looped through for each folder. You would lose
some benefits of the process but retain the most important ones such
as the auto-generated METS and Bagit files. But it seems as though
your approach is to think in terms of an entire collection or series
preserved as a single AIP. I'm still having a hard time wrapping my
head around that concept.

Thank you again for the comments. This is very interesting and
helpful.
Joseph

Evelyn McLellan

unread,
Jul 13, 2011, 1:30:03 PM7/13/11
to archivematica
I'm glad you're finding this useful. It's certainly an interesting
conversation from my point of view. I would point out, though, that
you need not put an entire collection or series into a SIP. You can
break down your collections into SIPs in whatever way makes sense to
you, and make the SIPs whatever size you want. Archivematica is
designed to be agnostic about SIP size and content. To facilitate SIP
arrangement, we are adding a new set of "transfer" functionalities to
0.7.2, which will allow you to physically and intellectually arrange
your objects, capture metadata at the SIP and/or object level, and
preserve information about the original organization of the objects if
that is important.

Evelyn
Reply all
Reply to author
Forward
0 new messages