Bagit Transfers

314 views
Skip to first unread message

Stephen Klein

unread,
Nov 6, 2018, 10:39:35 AM11/6/18
to archivematica
I am creating unzipped bag's using:

And learned that the bag will fail at transfer if I follow:
and modify the bag to:
Place digital objects in an /objects folder
Make sure the checksum files are  checksum.md5, checksum.sha1, or checksum.sha25
Place checksum files in the /metadata folder

In all previous attempts I was adjusting the directories and subdirectories to reflect those needs.

So I merely uploaded the unmodified bag created from:
and the transfer did not fail.

However, the AIP METS File does not pick up the metadata from:
/transfer-source/bag/metadata/metadata.csv

and after ingest and AIP creation my metadata.csv disappears from:
\bagxxx\data\objects\metadata\transfers\Jill_Demo-5b831181-8adc-4bc0-9635-e3e5af5da01b\


Note I was able to successfully transfer, ingest and write metadata to the AIP METs when not the transfer package is not bagged/

Can someone either supply documentation or step me through the use of a bagit based package with Archivematica?



Sara Allain

unread,
Nov 6, 2018, 12:15:37 PM11/6/18
to archiv...@googlegroups.com
Hi Stephen!

There have been been a couple of conversations recently on the list about structuring bags for transfer into Archivematica, which has resulted in this issue being filed: https://github.com/archivematica/Issues/issues/209. I'm currently working on updating the documentation around bags and how to structure digital objects for transfer. I am expecting to finish that change and make it available in the coming days.

Re: checksum files in the bags. When you create a bag, a checksum file is generated with the name "manifest-md5.txt" (replace md5 with sha1 or sha256, if you're using a different algorithm). You shouldn't change the name of this file. The instructions in https://www.archivematica.org/en/docs/archivematica-1.7/user-manual/transfer/transfer/#transfer-checksums only apply when you are hand-crafting checksum files, NOT when you are creating the checksum files through a bagging process. I will clarify this in the docs.

When you create a bag, the golden rule is that you don't mess with the structure. The BagIt specification requires files (like the manifests) to be located at certain paths. Archivematica interprets the BagIt specification and also expects those files to be located at certain paths. The one exception to this (because there's always an exception) is if you are adding descriptive or rights metadata. When adding descriptive or rights metadata, my preferred method is this:
  1. Make a bag out of the digital objects that you'd like to preserve (I used Bagger, but BagIt.py works too). At this point, the bag SHOULD NOT contain the objects you'd like to include in the metadata directory (i.e. metadata.csv, submission documentation, etc.)
  2. Save the bag to your desktop.
  3. In the top level of the bag, add a metadata directory. Place any metadata (like a metadata.csv) into this directory. Archivematica will ignore this directory when verifying the bag.
When you transfer a bag created through the above method into Archivematica, using the Unzipped bag transfer type, everything should work out fine. Note that there's an alternate method listed in the GitHub issue as well.

Hope this helps! I will update this thread once the improved documentation is made public.

Regards,
Sara

--

Sara Allain, MI
Systems Archivist

I recognize and acknowledge that the land on which I work is the traditional and unceded territory of the QayQayt First Nation.


--
You received this message because you are subscribed to the Google Groups "archivematica" group.
To unsubscribe from this group and stop receiving emails from it, send an email to archivematic...@googlegroups.com.
To post to this group, send email to archiv...@googlegroups.com.
Visit this group at https://groups.google.com/group/archivematica.
For more options, visit https://groups.google.com/d/optout.

Stephen Klein

unread,
Nov 6, 2018, 12:46:06 PM11/6/18
to archivematica
Thank you, Sara! Kelly Stewart just sent me some of this too.

So if the metadata.csv file is ignored why even include -AND- how to bulk write metadata to the AIPs METS?

Do not know what this means:

the bag SHOULD NOT contain the objects you'd like to include in the metadata directory


Sara Allain

unread,
Nov 6, 2018, 12:57:01 PM11/6/18
to archiv...@googlegroups.com
Hi Stephen,

Archivematica only ignores the metadata directory when verifying the bag - i.e. the presence of this directory will not cause a verification failure at Job: Verify bag, and restructure for compliance. Otherwise, the metadata directory is used by Archivematica like normal.

By "the bag SHOULD NOT contain the objects you'd like to include in the metadata directory", I meant that when you are creating your bag, you should only include your digital objects (i.e. the things that you are trying to preserve), and not the metadata directory and it's contents. The metadata directory can be added AFTER the bag is created. So if you have a bag with a structure like this:

bagTransfer/
├── bag-info.txt
├── bagit.txt
├── data/
│   ├── beihai.tif
│   ├── bird.mp3
│   ├── ocr-image.png
│   └── View_from_lookout_over_Queenstown_towards_the_Remarkables_in_spring.jpg
├── manifest-md5.txt
└── tagmanifest-md5.txt

You would then manually add the metadata directory and metadata.csv so that the bag looked like this:

bagTransfer/
├── bag-info.txt
├── bagit.txt
├── data/
│   ├── beihai.tif
│   ├── bird.mp3
│   ├── ocr-image.png
│   └── View_from_lookout_over_Queenstown_towards_the_Remarkables_in_spring.jpg
├── manifest-md5.txt
├── metadata/
│   └── metadata.csv
└── tagmanifest-md5.txt

Hope this helps!

Sara

--

Sara Allain, MI
Systems Archivist

I recognize and acknowledge that the land on which I work is the traditional and unceded territory of the QayQayt First Nation.

Stephen Klein

unread,
Nov 6, 2018, 1:43:14 PM11/6/18
to archivematica
Thanks, Sara!

Worked:

 

So, I ran bagit with just digital assets.

Then copied metadata directory over.

Edited metadata to reflect data/ rather than objects/ for filename field.

 

Note: I tried this many times with the metadata directory there at the time of bagging and then editing the file and it seemed not to work, so your suggestion might be the only working method.

 

Thanks for helping me.

Andrew Berger

unread,
Nov 6, 2018, 5:08:44 PM11/6/18
to archivematica

Hi,


It seems like more than one method of sending bags to Archivematica with metadata is supported, as I have been using a different transfer package structure. The structure I use looks like this:


sampletransfer/

   bag-info.txt

   bagit.txt

   manifest-md5.txt

   tagmanifest-md5.txt

   data/

       metadata/
metadata.csv

       objects/

[stuff to be ingested]


I have been creating this structure by creating a standard transfer with a metadata and objects folder like so

sampletransfer/
    metadata/
        metadata.csv
    objects/
        [stuff to be ingested]

and then turning that into a bag with bagit-python. I've just gotten a test instance of 1.7.2 running and Archivematica is generating PREMIS with the metadata from the metadata.csv. In production we're still a couple versions behind, so I needed to confirm that this functionality hasn't changed.

This makes me wonder if anyone else has worked out other ways of sending bags to Archivematica. I should note that I've zipped up the package structure listed above and that works as well for the "zipped bag" option.

Andrew

Sara Allain

unread,
Nov 6, 2018, 5:20:23 PM11/6/18
to archiv...@googlegroups.com
Hi Andrew,

Yep, there's more than one way to make a bag, for sure! Joe Carrano mentioned this as well, on this issue. What I like about the metadata-after method is that you know with certainty what the paths needs to be within the metadata CSV. If you add the metadata first, depending on how your package is structured, you might need to predict the filename paths in the metadata CSV - though I'm wondering if, in the structure you posted, having objects/filename.jpg is enough? I'd love to see how your metadata file is set up!

Regards,
Sara

--

Sara Allain, MI
Systems Archivist

I recognize and acknowledge that the land on which I work is the traditional and unceded territory of the QayQayt First Nation.

Andrew Berger

unread,
Nov 6, 2018, 6:22:09 PM11/6/18
to archivematica
Hi Sara,

Yes, I probably should have elaborated on the full ingest package structure we use. As a museum, so much of what we're ingesting is item-level cataloged - even if the "item" is a folder - so we don't have complex metadata.csv structures where metadata is assigned on a per file basis. I tested some more complex structures years ago, but would need to re-test with bags to see what happens. We're essentially just storing and retrieving AIPs, not sending DIPs to other systems or processing the AIP METS for other uses, so a simple package structure has served our needs.

Our standard ingest folder structure is what I described above, with one addition: there's always a subfolder in the objects folder, and the files to be ingested are in that subfolder. This is to establish some consistency between single-file ingests and multi-file ingests. Everything gets the same structure. The subfolder is always named for the item identifier in our collections management system, and this becomes the "AIP name" in Archivematica terminology.

A typical transfer package for us will look like this (102622396 is the cataloging system identifier):

102622396
├── bag-info.txt
├── bagit.txt
├── data
│   ├── metadata
│   │   └── metadata.csv
│   └── objects
│       └── 102622396
│           └── 102622396.tif
├── manifest-md5.txt
└── tagmanifest-md5.txt

The metadata.csv for that package looks like this:

parts,dc.identifier,dc.title,dc.date,dc.identifier
objects/102622396,102622396,ENIAC on-a-chip design team,1995 ca.,X7413.2015

Technically, it's the folder that gets the identifier in the AIP METS, so depending on how you plan to use the METS, this might not be what you want. I'm pretty sure this structure can be extended, but as I say I haven't tested it any further. As we get into more complex SIPs, we're probably going to have to take another look at this.

At the moment, because the package structure is predictable, creating the metadata.csv is also predictable, and I have a script that takes a cataloging identifier and essentially generates the structure from that, pulling in the appropriate metadata and the objects as part of that process. It's an interactive script, not fully automated, but the metadata.csv part of it is automated.

I suppose I should also note that I'll be at ArchivematicaCamp next week and would love to know how other people are structuring their SIPs/AIPs.

Andrew

Stephen Klein

unread,
Nov 7, 2018, 9:13:40 AM11/7/18
to archivematica
I am uploading multiple items from multiple authors (Smith and Jones) and creating many sub-directories, so wondering how my Filename field should be structured:

Should it be:

data/smith/1.pdf
data/smith/1.wmv
data/jones/1.pdf
data/jones/2.mpg

or:

data/1.pdf
data/1.wmv
data/2.pdf
data/2.mpg



Also, as suggested as the listing below, bagit creates a new directory (tmp8jrnhx_p) when there are sub-directories.  Why?


data/tmp8jrnhx_p/smith/1.pdf
data/ tmp8jrnhx_p/smith/1.wmv
data/tmp8jrnhx_p/jones/1.pdf
data/tmp8jrnhx_p/jones/2.mpg



Also, how does Archivematica handle spaces and diacredtics in file names?




 Directory of C:\Users\sklein\Desktop\etds_bag
11/07/2018  08:38 AM               164 bag-info.txt
11/07/2018  08:38 AM                55 bagit.txt
11/07/2018  08:38 AM    <DIR>          data
11/07/2018  08:38 AM             5,923 manifest-sha256.txt
11/07/2018  08:38 AM             8,739 manifest-sha512.txt
11/07/2018  08:46 AM    <DIR>          metadata
11/07/2018  08:38 AM               323 tagmanifest-sha256.txt
11/07/2018  08:38 AM               579 tagmanifest-sha512.txt


 Directory of C:\Users\sklein\Desktop\etds_bag\data
11/07/2018  08:38 AM    <DIR>          .
11/07/2018  08:38 AM    <DIR>          ..
11/07/2018  08:38 AM    <DIR>          tmp8jrnhx_p
              

 Directory of C:\Users\sklein\Desktop\etds_bag\data\tmp8jrnhx_p
11/07/2018  08:29 AM    <DIR>          Smith
11/07/2018  08:34 AM    <DIR>          Jones
    
 Directory of C:\Users\sklein\Desktop\etds_bag\data\tmp8jrnhx_p\Smith
11/07/2018  08:26 AM            48,340 Sheet1.csv
04/01/2017  05:11 PM            34,413 Goodings.xlsx
11/07/2018  08:26 AM           743,659 Caro.pdf
04/01/2017  05:05 PM            18,395 Agriculture.png
04/01/2017  05:01 PM            16,904 FemaleAges.png
04/01/2017  05:05 PM            22,036 Housing.png
04/01/2017  05:00 PM            17,435 MaleAges.png



 Directory of C:\Users\sklein\Desktop\etds_bag\data\tmp8jrnhx_p\Smith\SmithCapstoneFiles (NEW)
11/07/2018  08:29 AM    <DIR>          SmithCapstoneFiles


 Directory of C:\Users\sklein\Desktop\etds_bag\data\tmp8jrnhx_p\Smith\SmithCapstoneFiles (NEW)\SmithCapstoneFiles
11/07/2018  08:29 AM    <DIR>          Database
04/20/2017  06:37 PM            34,588 PROJECT MANIFEST.pdf
11/07/2018  08:29 AM    <DIR>          Visualizations


 Directory of C:\Users\sklein\Desktop\etds_bag\data\tmp8jrnhx_p\Smith\SmithCapstoneFiles (NEW)\SmithCapstoneFiles\Database
04/01/2017  05:11 PM            48,340 1860 Database (Final) - Sheet1.csv
04/01/2017  05:11 PM            34,413 1860 Database (Final).xlsx
      

 Directory of C:\Users\sklein\Desktop\etds_bag\data\tmp8jrnhx_p\Smith\SmithCapstoneFiles (NEW)\SmithCapstoneFiles\Visualizations
04/01/2017  05:05 PM            18,395 Enslaved.png
04/01/2017  05:01 PM            16,904 EnslavedAges.png
04/01/2017  05:05 PM            22,036 ousing.png

 
 
 Directory of C:\Users\sklein\Desktop\etds_bag\data\tmp8jrnhx_p\Jones
11/07/2018  08:33 AM         5,536,901 Narrow.wmv
11/07/2018  08:34 AM        33,495,193 ResT.wmv
11/07/2018  08:31 AM         2,973,498 Storytelling.pdf
11/07/2018  08:33 AM        35,902,581 The Sun, The Moon, And ST.wmv
11/07/2018  08:34 AM        24,471,121 NYC.wmv
11/07/2018  08:33 AM        28,326,437 record.wmv

Sara Allain

unread,
Nov 7, 2018, 4:17:08 PM11/7/18
to archiv...@googlegroups.com
Hi Stephen,

Answers below!

I am uploading multiple items from multiple authors (Smith and Jones) and creating many sub-directories, so wondering how my Filename field should be structured.

The filename field of the metadata.csv should reflected the subdirectories as well as the filename. If you have a file called 1.pdf but it's inside a nested directory like data/Smith/MyBook/Draft1/Chapter1, the filename field should contain the whole path - "data/Smith/BookX/Draft1/Chapter1/1.pdf". If you just put 1.pdf or data/1.pdf, Archivematica cannot locate the file.

Also, as suggested as the listing below, bagit creates a new directory (tmp8jrnhx_p) when there are sub-directories.  Why?

I don't know why BagIt is doing this. It could be a problem with BagIt or it could be something in your configuration. I use Bagger to create bags and it doesn't seem to do this - even with a deeply nested structure, I don't get any unexpected directories.

Also, how does Archivematica handle spaces and diacritics in file names?

Archivematica's name cleanup processes are not well documented, but Ross Spencer has written up a good summary here that I'm hoping to incorporate into the documentation soon. Essentially, because Archivematica packages so many different tools, we have to work around the limits of those tools - and one of the major limits is that some tools only accept a truncated list of ASCII values as valid filename inputs. Therefore, the list of valid filename characters is: -_.()abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789

To reduce the chance that spaces, diacritics, or any other character not included in the above list will cause a problem (like a tool malfunction), Archivematica runs a microservice on the Transfer tab called "Clean up names". This microservice does the following:

1. Looks at the name of every file and directory and replaces non-ASCII characters (anything not on the list above) with the nearest ASCII equivalent. If no equivalent is found, an underscore is used. For example, @file becomes _fileČeská republika becomes Ceska_republika.

2. Adds a PREMIS event to the METS with the eventType of "name cleanup" and an event detail note like this: Original name="%transferDirectory%objects/Česká republika"; cleaned up name="%transferDirectory%objects/Ceska_republika"

3. Creates a filenameCleanup.log that describes the changes, which is also stored in the AIP.

Regards,
Sara

--

Sara Allain, MI
Systems Archivist

I recognize and acknowledge that the land on which I work is the traditional and unceded territory of the QayQayt First Nation.

Stephen Klein

unread,
Nov 7, 2018, 4:30:07 PM11/7/18
to archivematica
These answers are great, thank you.

Bagit always creates that other directory when there are subdirectories, but I just ran a test on my bag with the peculiar directory and it seems to work. Here are some of the fields in the AIP METS suggesting that it understands the bag:
   <premis:originalName>%transferDirectory%data/tmp8jrnhx_p/Kinsey/Caro.pdf</premis:originalName>

-and- 

<filepath toolname="OIS File Information" toolversion="0.2" status="SINGLE_RESULT">/var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/extractPackagesChoice/etds_bag-4c5fb2cc-6387-4f83-83f8-aabdc56b51de/objects/tmp8jrnhx_p/Kinsey/Caro.pdf</filepath>

Correct?


So, although there is a santizer to cleanup filenames, is it suggested that we try to have clean filenames even before transfer and ingest?

Stephen Klein

unread,
Nov 7, 2018, 4:33:49 PM11/7/18
to archivematica
And despite, my metadata.csv being incorrectly cataloged:
Filename:
data/Caro.pdf

Archivematica understood its actual location and inserted:
tmp8jrnhx_p/Kinsey/Caro.pdf
into the    <premis:originalName>

Sara Allain

unread,
Nov 7, 2018, 4:38:53 PM11/7/18
to archiv...@googlegroups.com
Yeah, it doesn't look like Archivematica's having any issue with that weird extra directory - it's interpreting everything just fine.

On whether or not to have ASCII-only file and directory names prior to ingest, I think it's up to you. Archivematica creates the record of the filename change, which can be a useful record *especially* if the way that Archivematica changes diacritics is changing the meaning of the name (see Elvia Arroyo-Ramírez's article Invisible Defaults and Perceived Limitations: Processing the Juan Gelman Files for examples of this - i.e. "campaña" meaning "campaign" vs "campana" meaning "bell"). However, if it is a matter of removing spaces from the filenames, it might be faster to do this outside of Archivematica - and having a record of this change might not matter quite so much.

Regards,
Sara

--

Sara Allain, MI
Systems Archivist

I recognize and acknowledge that the land on which I work is the traditional and unceded territory of the QayQayt First Nation.

Stephen Klein

unread,
Nov 7, 2018, 4:42:46 PM11/7/18
to archivematica
and strange that, although my metadata did not include hat weird extra directory, Archivematica understood.

So it sounds like some of the essential pre-processing steps should be:
Santitize filenames (remove spaces)
Bag

Any other steps?

Thanks!

Sara Allain

unread,
Nov 7, 2018, 4:46:43 PM11/7/18
to archiv...@googlegroups.com
It's totally dependent on your workflow! Those things are essential for you, but I wouldn't say that they're essential for everyone. From what I know about your workflow, those sound like great steps for preparing your material for Archivematica and I can't think of any others that I would recommend.

-Sara

--

Sara Allain, MI
Systems Archivist

I recognize and acknowledge that the land on which I work is the traditional and unceded territory of the QayQayt First Nation.

Reply all
Reply to author
Forward
0 new messages