RESTful Bag Server

80 views
Skip to first unread message

Chris Adams

unread,
Feb 22, 2011, 10:03:49 PM2/22/11
to digital-...@googlegroups.com
We've been working on a protocol for serving bagged content RESTfully
since a small meeting in December. Now that the spec and example are
reasonably well fleshed out I wanted to open this up for wider
feedback:

https://github.com/acdha/restful-bag-server/blob/master/RESTful%20Bag%20Store.rst

The main goal here is to provide a simple path to content replication
without tackling larger problems such as extended metadata, specific
storage or versioning strategies, etc. and staying as close as
possible to common web practices. We're using the project's issue
tracker to collect problems and remaining work; use-cases or
substantial proposed changes are solicited in the form of pull
requests - for example we're currently tracking one proposal for
handling content versioning in a branch[1] so we can cleanly maintain
the specification and example.

My next steps are implementing a test client and a simple reference
server - if anyone's interested in collaborating, both of those will
also be done on github and there's been talk of taking the reference
server in more interesting directions (e.g. storing each bag in a Git
repository and/or adding cloud storage backends).

Chris

1. https://github.com/acdha/restful-bag-server/blob/versioned-example/RESTful%20Bag%20Store.rst

Michael J. Giarlo

unread,
Feb 23, 2011, 10:33:22 AM2/23/11
to digital-...@googlegroups.com
On 02/22/2011 10:03 PM, Chris Adams wrote:
>
> The main goal here is to provide a simple path to content replication
> without tackling larger problems such as extended metadata, specific
> storage or versioning strategies, etc. and staying as close as
> possible to common web practices. We're using the project's issue
> tracker to collect problems and remaining work; use-cases or
> substantial proposed changes are solicited in the form of pull
> requests - for example we're currently tracking one proposal for
> handling content versioning in a branch[1] so we can cleanly maintain
> the specification and example.
>

Glad you wrote this up, Chris. I would just add one thing: the "we"
Chris references is an ad hoc, open group of folks from all over. So if
you deal in bags and care about replication, you're welcome to join.

Matt Schultz from the Educopia Institute has been running our monthly
calls and keeping momentum going -- if you're interested in
participating in this effort, I'd wager Matt
(matt.s...@metaarchive.org) would love to hear from you & add you to
the list of folks who receive email on this effort.

Our next call will be on Friday, 3/25 at 3pm ET (1-270-400-2000, 282929#).

-Mike

matt.s...@metaarchive.org

unread,
Mar 23, 2011, 2:07:19 PM3/23/11
to Digital Curation
Hi Everybody,

Below is a loose agenda for the next all-groups meeting on the
development of a RESTful Bag Server (https://github.com/acdha/restful-
bag-server) scheduled for Friday, March 25th at 3pm ET/2pm CT/12pm PT.
Call-in Information is: 1-270-400-2000, 282929#

Tentative Agenda (feel free to suggest changes):

1. Brief review of updated use cases - Archivematica
2. Continue discussions on open issues with spec
* Small file transfers - use of keep-alive? tgz/zip?
* Version handling - use of courtesy URL?
* Validation history - URI or metadata?
* Handling manifest changes - final match on PUT? acceptable
manifests?
* Other?
3. Update on testing suite and reference server
4. Identifying practical next steps
5. Other?

Looking forward to talking with those who can join.

All best,

--
Matt Schultz
Collaborative Services Librarian
Educopia Institute, MetaArchive Cooperative
http://www.metaarchive.org
matt.s...@metaarchive.org
616-566-3204

On Feb 23, 11:33 am, "Michael J. Giarlo" <mich...@psu.edu> wrote:
> On 02/22/2011 10:03 PM, Chris Adams wrote:
>
>
>
> > The main goal here is to provide a simple path to content replication
> > without tackling larger problems such as extended metadata, specific
> > storage or versioning strategies, etc. and staying as close as
> > possible to common web practices. We're using the project's issue
> > tracker to collect problems and remaining work; use-cases or
> > substantial proposed changes are solicited in the form of pull
> > requests - for example we're currently tracking one proposal for
> > handling content versioning in a branch[1] so we can cleanly maintain
> > the specification and example.
>
> Glad you wrote this up, Chris.  I would just add one thing: the "we"
> Chris references is an ad hoc, open group of folks from all over.  So if
> you deal in bags and care about replication, you're welcome to join.
>
> Matt Schultz from the Educopia Institute has been running our monthly
> calls and keeping momentum going -- if you're interested in
> participating in this effort, I'd wager Matt
> (matt.schu...@metaarchive.org) would love to hear from you & add you to

matt.s...@metaarchive.org

unread,
Apr 27, 2011, 7:51:31 AM4/27/11
to Digital Curation
Hi Everybody,

The next all-groups meeting scheduled for Friday, April 29th at 3-4pm
ET has been re-scheduled for Friday, May 13th at 3-4pm ET. Call-in
info is: 1-270-400-2000, 282929#.

This call is open for new individuals and groups to attend to discuss
on-going development of a specification for managing and tracking the
availability of Bag defined data for the purposes of replication with
a view toward preservation. See the Github site for the current spec
definition: https://github.com/acdha/restful-bag-server.

There will be a request to add agenda items one week before the call.
Final agenda will posted a few days prior. Due to the limited time on
the call to address open issues and progress toward development,
please make every effort to review the Github site prior to the call,
add an issue, or respond to request for an added agenda item.

Notes from our previous meeting are below. Look forward to catching up
in a couple of weeks.

RESTful-Bag-Server Meeting 3
Minutes
03/25/2011

Attendees
1. Chris Adams (LoC)
2. Mike Burek (Chronopolis)
3. Mike Giarlo (Penn State)
4. John Kunze (CDL)
5. Matt Schultz (Educopia)
6. Mike Smorul (Chronopolis)
7. Don Sutton (Chronopolis)
8. Peter Van Garderen (Artefactual)
9. Charles Blair (University of Chicago)

Minutes

1. Reviewed New Use Case (https://github.com/acdha/restful-bag-server/
tree/master/Use%20Cases) from Artefactual

o Peter van Garderen clarified his written use case
https://github.com/acdha/restful-bag-server/blob/master/Use%20Cases/Archivematica.rst
o Simple submission/dissemination internal repository exchange (SIP/
AIP/DIP transformations) of Bag-based data
o Peter and Chris Adams commented that the current spec is very well
targeted towards this type of use case

2. Discussed Open Issues

A. Small file transfers – Chris suggested tabling this issue because
the range of potential solutions could probably be handled through
some added nomenclature on good http citizenship.

o Server should support things like keep-alive, pipelining, etc.
o May want to consider embedding some links in this section to
educate on the principles of RESTful Architecture

B. Version handling – Chris confirmed that the best approach to handle
the awareness of most current version of an uploaded bag would be to
dedicate a symbolic link location (/version/latest).

o Chris asked if there were any strong objections to creating a new
branch to bake in the versioning proposal that was approved on the
02/18 call – no objections. See here: https://github.com/acdha/restful-bag-server/tree/versioned-example
o Folks are encouraged to look this over before merging

C. Validation history – very briefly discussed the implementation of a
resource for exposing details of last full bag validity check (pass/
fail). Mike Smorul had a question about making this available in the
metadata

o No clear determination on this was recorded on the call – may need
to revisit briefly on 05/13

D. Handling manifest changes – Chris and Mike Smorul suggested making
the spec flexible enough to allow people to query for supported
manifests and to re-upload files as needed.

o Agreed also to accept any additional manifests but require at least
one supported format and report a 409 conflict if the listed files
don't match the standard md5 or sha-256.

3. Update on Testing Suite & Reference Server

o Chris got started on a Python-based testing suite that conforms to
the spec: https://github.com/acdha/restful-bag-server/blob/test-suite/tests.py
o Chris indicated that he wanted to work on the reporting
o Was planning on starting with a simple read-only server that would
be a good proof of concept for folks developing custom clients for
their environments that would want to interface
o The reference server would eventually be driven by Python and
should be flexible and extensible

4. Discussed Practical Next Steps

o Chris invited folks to feel free to issue a pull request on the
Github site if they are interested in lending him a hand:
https://github.com/acdha/restful-bag-server
o Matt inquired about promoting this work and the most reasonable
time frames/venues - Consensus was that June-August might be most
appropriate once tests of the spec have revealed themselves - Mike
Giarlo mentioned Open Repositories and Curate Camp as potential venues
o Questions about licensing of the spec arose – Chris from Library
of Congress would have to inquire (perhaps BSD or GPL)
o Agreed to check-in prior to the next call on May 13th on progress
toward test suite implementation & reporting and implementation of the
reference server

On Mar 23, 2:07 pm, "matt.schu...@metaarchive.org"
<matt.schu...@metaarchive.org> wrote:
> Hi Everybody,
>
> Below is a loose agenda for the next all-groups meeting on the
> development of a RESTful Bag Server (https://github.com/acdha/restful-
> bag-server) scheduled for Friday, March 25th at 3pm ET/2pm CT/12pm PT.
> Call-in Information is: 1-270-400-2000, 282929#
>
> Tentative Agenda (feel free to suggest changes):
>
>    1. Brief review of updated use cases - Archivematica
>    2. Continue discussions on open issues with spec
>           * Small file transfers - use of keep-alive? tgz/zip?
>           * Version handling - use of courtesy URL?
>           * Validation history - URI or metadata?
>           * Handling manifest changes - final match on PUT? acceptable
> manifests?
>           * Other?
>    3. Update on testing suite and reference server
>    4. Identifying practical next steps
>    5. Other?
>
> Looking forward to talking with those who can join.
>
> All best,
>
> --
> Matt Schultz
> Collaborative Services Librarian
> Educopia Institute, MetaArchive Cooperativehttp://www.metaarchive.org
> matt.schu...@metaarchive.org

matt.s...@metaarchive.org

unread,
May 12, 2011, 4:17:57 PM5/12/11
to Digital Curation
Hi Everybody,

Below is a starter agenda for the next all-groups meeting on the
development of a RESTful Bag Server (https://github.com/acdha/restful-
bag-server) scheduled for Friday, May 13th at 3pm ET/2pm CT/12pm PT.
Call-in Information is: 1-270-400-2000, 282929#.

This will be a great call for groups or individuals who have not yet
participated but are interested in this set of work to drop in and say
hi.

Starter Agenda (feel free to suggest additional items on the call):

1. Brief welcome and catch-up for new callers
2. Overview of a Java Bag Server - Mike Smorul
3. Scheduling future calls

Looking forward to talking with those who can join.

All best,

Matt Schultz
Collaborative Services Librarian
Educopia Institute, MetaArchive Cooperative
http://www.metaarchive.org
matt.s...@metaarchive.org
616-566-3204

On Apr 27, 7:51 am, "matt.schu...@metaarchive.org"
<matt.schu...@metaarchive.org> wrote:
> Hi Everybody,
>
> o  Peter van Garderen clarified his written use casehttps://github.com/acdha/restful-bag-server/blob/master/Use%20Cases/A...

Chris Adams

unread,
May 13, 2011, 2:37:16 PM5/13/11
to digital-...@googlegroups.com
Unfortunately it looks like I'm going to be talking to my mortgage rep
(buying a place) a bit after 3pm. I'll dial in if she's delayed but it
looks like that's the best time.

Key notes from me:

* limited progress on test suite
* I intend to go with the proposed versioning scheme so that branch of
the spec will be merged in soon
* Mike proposed relaxing the upload ordering constraints to only
require the manifest and files be complete by the commit. Feedback
from potential implementors welcome on this point or related.

Chris

Sent from my iPhone

On May 12, 2011, at 4:18 PM, "matt.s...@metaarchive.org"
<matt.s...@metaarchive.org> wrote:

> --
> You received this message because you are subscribed to the Google Groups "Digital Curation" group.
> To post to this group, send email to digital-...@googlegroups.com.
> To unsubscribe from this group, send email to digital-curati...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/digital-curation?hl=en.
>

matt.s...@metaarchive.org

unread,
Jun 15, 2011, 4:11:27 PM6/15/11
to Digital Curation
Hi Everybody,

A big thanks to Mike Smorul for tapping me on the shoulder to inquire
about this month's scheduled call to discuss the RESTful Bag Server
(https://github.com/acdha/restful-bag-server) development.

This month's call is slated for next week Friday, June 24th at 3pm ET.
Call-in info is 1-270-400-2000, 282929#.

Consider this an open call for agenda items. There are a couple of
relevant threads that have started (see below) since the previous call
that was held on Friday, May 13th. We can definitely follow up on
those. Notes from the May 13th call will be available shortly -
apologies for the delay.

Notes on implementing Restful Bag Server spec:
http://groups.google.com/group/digital-curation/browse_thread/thread/0e8dfdb8c5a3e2aa#
Bag Transfer Tools: http://groups.google.com/group/digital-curation/browse_thread/thread/60c070bdee101ef3#

I'll shoot out a tentative agenda by COB next week Wednesday, June
22nd once we've had a chance to hear from folks. Look forward to
catching up.

All best,

Matt Schultz
Collaborative Services Librarian
Educopia Institute, MetaArchive Cooperative
http://www.metaarchive.org
matt.s...@metaarchive.org
616-566-3204

On May 12, 4:17 pm, "matt.schu...@metaarchive.org"
<matt.schu...@metaarchive.org> wrote:
> Hi Everybody,
>
> Below is a starter agenda for the next all-groups meeting on the
> development of aRESTfulBagServer (https://github.com/acdha/restful-bag-server) scheduled for Friday, May 13th at 3pm ET/2pm CT/12pm PT.
> Call-in Information is: 1-270-400-2000, 282929#.
>
> This will be a great call for groups or individuals who have not yet
> participated but are interested in this set of work to drop in and say
> hi.
>
> Starter Agenda (feel free to suggest additional items on the call):
>
> 1. Brief welcome and catch-up for new callers
> 2. Overview of a JavaBagServer - Mike Smorul
> 3. Scheduling future calls
>
> Looking forward to talking with those who can join.
>
> All best,
>
> Matt Schultz
> Collaborative Services Librarian
> Educopia Institute, MetaArchive Cooperativehttp://www.metaarchive.org
> matt.schu...@metaarchive.org
> 616-566-3204
>
> On Apr 27, 7:51 am, "matt.schu...@metaarchive.org"
>
>
>
>
>
>
>
> <matt.schu...@metaarchive.org> wrote:
> > Hi Everybody,
>
> > The next all-groups meeting scheduled for Friday, April 29th at 3-4pm
> > ET has been re-scheduled for Friday, May 13th at 3-4pm ET. Call-in
> > info is: 1-270-400-2000, 282929#.
>
> > This call is open for new individuals and groups to attend to discuss
> > on-going development of a specification for managing and tracking the
> > availability ofBagdefined data for the purposes of replication with
> > AIP/DIP transformations) ofBag-based data
> > o  Peter and Chris Adams commented that the current spec is very well
> > targeted towards this type of use case
>
> > 2. Discussed Open Issues
>
> > A. Small file transfers – Chris suggested tabling this issue because
> > the range of potential solutions could probably be handled through
> > some added nomenclature on good http citizenship.
>
> > o  Server should support things like keep-alive, pipelining, etc.
> > o  May want to consider embedding some links in this section to
> > educate on the principles ofRESTfulArchitecture
>
> > B. Version handling – Chris confirmed that the best approach to handle
> > the awareness of most current version of an uploadedbagwould be to
> > dedicate a symbolic link location (/version/latest).
>
> > o  Chris asked if there were any strong objections to creating a new
> > branch to bake in the versioning proposal that was approved on the
> > 02/18 call – no objections. See here:https://github.com/acdha/restful-bag-server/tree/versioned-example
> > o  Folks are encouraged to look this over before merging
>
> > C. Validation history – very briefly discussed the implementation of a
> > resource for exposing details of last fullbagvalidity check (pass/
> > > development of aRESTfulBagServer (https://github.com/acdha/restful-

Brenda C. B. Rocco

unread,
Jun 16, 2011, 7:01:52 AM6/16/11
to digital-...@googlegroups.com
Hi Everybody,

Good morning!

What do you think also about to discuss the preservation of e-mails?tanto repositórios quanto gerenciamento dessas mensagens?

Cordially

Brenda Rocco


I would suggest another theme: the management of e-mails as records
--
You received this message because you are subscribed to the Google Groups "Digital Curation" group.
To post to this group, send email to digital-...@googlegroups.com.
To unsubscribe from this group, send email to digital-curati...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/digital-curation?hl=en.




--
"Se coisas pequenas te atingirem é porque você está precisando ser maior do que  tudo isso!!"

Chris Prom

unread,
Jun 16, 2011, 10:36:58 AM6/16/11
to digital-...@googlegroups.com
I would be very interested in participating in a discussion of this topic, either via the googlegroup or call in.  

I am beginning of a project to survey potential technical approaches to this topic, which will be published as a Tech Watch Report by the Digital Preservation Coaltion (UK).  I  am currently reviewing literature and will be posting notes on my blog as I go along.  Information about my project is here:

http://e-records.chrisprom.com/?p=1983

If anyone is currently undertaking a email preservation project, or contemplating one, I would appreciate the chance to touch base, as soon as possible, either via this list or off list, so that information about your project can help shape my report, and vice versa.

Thanks,

Chris
____

Chris Prom
Assistant University Archivist
University of Illinois at Urbana-Champaign
chris...@gmail.com


Walker Sampson

unread,
Jun 16, 2011, 11:54:54 AM6/16/11
to digital-...@googlegroups.com
Count me in as one interested in discussing this subject.

Chris, I will be on a records management project looking at email preservation and retention for state government agencies into a state archive. I am not sure how well a records management context applies to your interest, but I would be happy to share info with your project where I can.

Best,
Walker

--
You received this message because you are subscribed to the Google Groups "Digital Curation" group.
To post to this group, send email to digital-...@googlegroups.com.
To unsubscribe from this group, send email to digital-curati...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/digital-curation?hl=en.



--
Electronic Records Manager
Mississippi Department of Archives and History
http://wsampson.wordpress.com

Walker Sampson

unread,
Jun 16, 2011, 12:44:00 PM6/16/11
to Chris Prom, digital-...@googlegroups.com
Sure, it's the Mississippi Department of Archives and History, and the project begins next month.
Walker

On Thu, Jun 16, 2011 at 11:02 AM, Chris Prom <chris...@gmail.com> wrote:
Thanks Walker!  Do you mind sharing either with me or the list which state archives you work for?

Thanks,

Chris
____

Brenda C. B. Rocco

unread,
Jun 16, 2011, 12:59:51 PM6/16/11
to digital-...@googlegroups.com, Chris Prom
Walker  and Chris,

I
'm from Brazil's National Archive and  , currently,
I am working in the management of e-mails.
  I'm in a search of literature and best practices.

I think it's a good debate: both how to manage e-mails about how to preserve them, after a lot of information is recorded in this medium. And many of them are official information that produce official documents.

Our Project  about management of e-mails is in the beginning.


Best,
Brenda


2011/6/16 Walker Sampson <walker....@gmail.com>



--

slabrams

unread,
Jun 16, 2011, 2:00:30 PM6/16/11
to Digital Curation
Hi Chris:

Harvard has been working on an email archiving pilot project for a
year or so, but I don't know if they have published any results.

http://hul.harvard.edu/ois/digpres/projects.html

You may want to contact Wendy Gogel and/or Andrea Goethals for more
information.

--sla

On Jun 16, 7:36 am, Chris Prom <chris.p...@gmail.com> wrote:
> I would be very interested in participating in a discussion of this topic, either via the googlegroup or call in.  
>
> I am beginning of a project to survey potential technical approaches to this topic, which will be published as a Tech Watch Report by the Digital Preservation Coaltion (UK).  I  am currently reviewing literature and will be posting notes on my blog as I go along.  Information about my project is here:
>
> http://blog.beagrie.com/2011/05/23/dpc-and-charles-beagrie-limited-to...http://e-records.chrisprom.com/?p=1983

Chris Prom

unread,
Jun 16, 2011, 2:52:19 PM6/16/11
to digital-...@googlegroups.com
Thanks.  FYI, there is a report out from iPRES last year, here is my zotero entry for it:

Reshaping the Repository: The Challenge of Email Archiving

Type Conference Paper
Author Andrea Goethals
Author Wendy Gogel
Abstract Because of the historical value of email in the late 20th and 21st centuries, Harvard University Libraries began planning for an email archiving project in early 2007. A working group comprised of University archivists, curators, records managers, librarians and technologists studied the problem and recommended the undertaking of a pilot email archiving project at the University Library. This two-year pilot would implement a system for ingest, processing, preservation, and eventual end user delivery of email, in anticipation of it becoming an ongoing central service at the University after the pilot. This paper describes some of the unexpected challenges encountered during the pilot project and how they were addressed by design decisions. Key challenges included the requirement to design the system so that it could handle other types of born digital content in the future, and the effect of archiving email with sensitive data to Harvard’s preservation repository, the Digital Repository Service (DRS).
Date 2010
Proceedings Title 7th International Conference on Preservation of Digital Objects (iPRES2010)
Place Vienaa, Austria
URL http://www.ifs.tuwien.ac.at/dp/ipres2010/schedule.html
Accessed Wed Jun 1 19:00:00 2011
Date Added Thu Jun 2 13:05:45 2011
Modified Thu Jun 2 13:10:34 2011

Thanks,

Chris
____

Chris Prom

Simon Spero

unread,
Jun 16, 2011, 6:44:45 PM6/16/11
to digital-...@googlegroups.com, chris...@gmail.com, Christopher Lee
The most recent Sedona Conference guidelines specifically related to email policy are from 2007 (see attached bibtex entry for Allman:2007.  

Thoughts - 

Ingest

There are several ways to handle ingest of email.  One useful way is to provide an archival email server supporting a standard protocol (e.g. IMAP); content that the recipient considers archival can be dragged and dropped into that server using regular email clients. This form of ingestion is supported by some content/records management systems (OSS e.g - Alfresco).  Regular IMAP servers (OSS e.g. cyrus imapd) store message in unmangled, ingest friendly form.  Access controls can be set to allow for messages to be added, but not altered or deleted.Cyrus uses one file per message; bagging seems called for.  

Proprietary storage formats may take more work to process, especially if header and body information is separated.  

One interesting approach to  ingest would to set up an archival relay server in front of the operational servers; such a server would route incoming email into the ingest process.  Such a server would be mostly transparent to users, and could be operated under the auspices of Archives and Records management business units allowing for cleaner SoD from IT departments.  

Note that this layer must be interposed even for messages between internal users, especially for systems such as exchange that allow for senders to remotely delete messages from recipients mailboxes.

Formats

 In general, if it weren't for those darn attachments(giant asterisk), email would be one of the simplest formats for preservation. The reason is that in order to be useful, email messages had to be capable of passing through a variety of intermediate systems, each determined to wreak as much havoc on the content as possible, so mutated formats were strongly  selected against.   The fact that MIME attachments are still usually encapsulated in using  base64 is one example of this evolutionary history.  

Ah, but those attachments... 

Even though any real content management system will automatically separate out any attached files and process them using whatever mechanisms are available for dealing with files of that type, if those files are proprietary and undocumented, or just not supported, they remain unhelpful buckets of bits (though at least they might have a useful mime type). 

Versioning

One problem that comes to light with emailed attachments is if email is used for collaborative development of content; if this content is not stored in a version controlled system, envelope metadata (timestamps, message-ids, and forwarded bodies) may be the only way to sequence and assign responsibility for various changes. In addition, a great deal of storage may be wasted on 20K copies of  the PDF of the annual report  that could be better wasted on more online remote replicas. 

A similar problem occurs when email refers to external content (e.g. via links).  Unless the content can be captured at the time of sending (or receiving depending on context), essential parts of the context ( = meaning) of the communication may be lost. Attachmemento?

Authenticity 

Because much of the metadata in an email envelope is generated or relied on as part of the message transport process, it is generally reliable-unless-tampered-with. Some of this metadata can  still be forged (take a look at  the full headers in  your spam folder and see if you can spot where false headers were injected).  

In general, email messages are not inherently more or less trustworthy than paper ones. Email can be self authenticating if signed using a public key; however, this is relatively uncommon outside DoD, the IC and associated entities.   

One unsolved cryptographic  problem that has especial significance for long term preservation and archival uses is the current lack of protocols providing Perfect Forward Secrecy (PFS). With  PFS, if a key is compromised at some point, it cannot be used to attack messages sent before the compromise occurs.  Although this property is relatively easy to provide for online communications, it is much harder to provide for digital signatures.  

In addition, many hash algorithms (esp. MD5, and probably SHA-1) are considered broken, so messages signed using those older algorithms need to be migrated.  It's almost as if  preservation were an active continuing process.

Simon

@article{Allman:2007,
Editor = {Allman, Thomas Y.},
Group = {Digital Preservation},
Journal = {The Sedona Conference\textregistered\  Journal},
Pages = {239--250},
Title = {{The Sedona Conference\textregistered\ commentary on email management: Guidelines for the  selection  of retention policy}},
Volume = {8},
Year = {2007}
}

Peter Van Garderen

unread,
Jun 17, 2011, 9:52:48 AM6/17/11
to digital-...@googlegroups.com
Hi all,

After some evaluation we've settled on using MBOX as the preservation format in Archivematica. See http://archivematica.org/wiki/index.php?title=Email

In the February 0.7-alpha release we convert PST to MBOX. In the 0.7.1-alpha release due out next week we also identify attachments and convert them to their designated preservation and access copy formats. The access format for individual email messages is HTML. We should have a screencast up for the 0.7.1 release next week which demonstrates this functionality. We'll post a link to this group.

However, this is all still very much an initial attemp with gaps. See http://archivematica.org/wiki/index.php?title=Email_preservation

We are counting on the Archivematica early implementers to test this functionality and get back to us with critiques and further suggestions. This thread is also very helpful.

Cheers,

--peter

Peter Van Garderen
Archivematica Project Manager
--

Brenda C. B. Rocco

unread,
Jun 17, 2011, 10:01:55 AM6/17/11
to digital-...@googlegroups.com
Hi all,

My concern is on e-mail when they are archival documents, are due integridde maintain their authenticity and for a long time as it should be managed.
  Beyond the issues stecnológicas, of course!


 Brenda Rocco

2011/6/17 Peter Van Garderen <vangarde...@gmail.com>

Chris Prom

unread,
Jun 17, 2011, 10:21:57 AM6/17/11
to digital-...@googlegroups.com
Thanks Peter, the cites are useful. 

Have you come to any conclusion as to which parser you will use? The biggest issue I see is that, aside from Aid4Mail, none of the programs are able to deal with a diversity of formats on the input sid, to get it into mbox.  There seems to be a reasonable amount of support for pst, but not so much beyond that.  Aid4Mail is nice because the latest version includes a scripting language, so you can export in any format you want.  But, it is proprietary and Windows only, so obviously that rules it out for Archivematica.

Best,

Chris
____

Chris Prom
chris...@gmail.com


Priscilla Caplan

unread,
Jun 17, 2011, 10:35:04 AM6/17/11
to digital-...@googlegroups.com
I'm also doing an "environmental scan" on the state of email archiving and preservation for the state university libraries of Florida.  I had a long discussion with the programmer at Harvard who is implementing their EASI project.  They've settled on MBOX as their normalized format, which they then convert to the CERP XML schema for preservation.  He said he did an extensive analysis of conversion tools and settled on emailchemy.  It converts the largest number of formats reliably, and does the best job of preserving metadata in the conversion, including folder hierarchy and header tags.

p

Priscilla Caplan

unread,
Jun 17, 2011, 10:55:30 AM6/17/11
to Chris Prom, digital-...@googlegroups.com
Actually, there's quite a bit more, Harvard is worth talking to.   Once they convert to MBOX they parse using Mime4j which does a good job of decoding the attachments, and index them in SOLR.  They have also done a lot of work on security, because mail contents can be so sensitive.

p

On 6/17/2011 10:41 AM, Chris Prom wrote:
Thanks Priscilla,  that is helpful.  I had planned to contact Harvard, but it sounds like you provided a good summary of their project.

Brenda C. B. Rocco

unread,
Jun 17, 2011, 12:29:59 PM6/17/11
to digital-...@googlegroups.com, Chris Prom
Hi ,
Someone knows the Berkley Mail Box Frmat??

Att,
 Brenda Rocco

2011/6/17 Priscilla Caplan <pca...@ufl.edu>



--

Simon Spero

unread,
Jun 17, 2011, 2:56:17 PM6/17/11
to digital-...@googlegroups.com, Chris Prom
On Fri, Jun 17, 2011 at 12:29 PM, Brenda C. B. Rocco <brenda...@gmail.com> wrote:
Hi ,
Someone knows the Berkley Mail Box Frmat??

That's what's being referred to as mbox format, with the From+Space separators.

A lot of unix file formats are supported by the UW c-client library, which was used to build pine.  Supported formats are described here:

http://www.washington.edu/imap/documentation/formats.txt.html


matt.s...@metaarchive.org

unread,
Jun 22, 2011, 3:15:29 PM6/22/11
to Digital Curation
Hi Everybody,

Below is a starter agenda for the next all-groups meeting on the
development of a RESTful Bag Server (https://github.com/acdha/restful-
bag-server) scheduled for Friday, June 24th at 3pm ET/2pm CT/12pm PT.
Call-in Information is: 1-270-400-2000, 282929#.

Feel free to suggest additional items before or on the call:

1. Update from Chris Adams (merges, changes, test suite, ref server,
etc.)
2. Justin Littman's Play Framework implementation (open issues) - see:
http://groups.google.com/group/digital-curation/browse_thread/thread/0e8dfdb8c5a3e2aa#
3. Mike Smorul's Java Bag Server (use cases) - see:
http://groups.google.com/group/digital-curation/browse_thread/thread/60c070bdee101ef3#
4. Mike Giarlo's F2F at CURATECamp 2011 (plans) - see:
http://groups.google.com/group/digital-curation/browse_thread/thread/95808ad5076b4738
5. Next steps?

Looking forward to talking with those who can join.

All best,

Matt Schultz
Collaborative Services Librarian
Educopia Institute, MetaArchive Cooperative
http://www.metaarchive.org
matt.s...@metaarchive.org
616-566-3204


On Jun 15, 4:11 pm, "matt.schu...@metaarchive.org"
<matt.schu...@metaarchive.org> wrote:
> Hi Everybody,
>
> A big thanks to Mike Smorul for tapping me on the shoulder to inquire
> about this month's scheduled call to discuss the RESTful Bag Server
> (https://github.com/acdha/restful-bag-server) development.
>
> This month's call is slated for next week Friday, June 24th at 3pm ET.
> Call-in info is 1-270-400-2000, 282929#.
>
> Consider this an open call for agenda items. There are a couple of
> relevant threads that have started (see below) since the previous call
> that was held on Friday, May 13th. We can definitely follow up on
> those. Notes from the May 13th call will be available shortly -
> apologies for the delay.
>
> Notes on implementing Restful Bag Server spec:http://groups.google.com/group/digital-curation/browse_thread/thread/...
> Bag Transfer Tools:http://groups.google.com/group/digital-curation/browse_thread/thread/...
>
> I'll shoot out a tentative agenda by COB next week Wednesday, June
> 22nd once we've had a chance to hear from folks. Look forward to
> catching up.
>
> All best,
>
> Matt Schultz
> Collaborative Services Librarian
> ...
>
> read more »

matt.s...@metaarchive.org

unread,
Jun 23, 2011, 12:58:15 PM6/23/11
to Digital Curation
Hi Everybody,

Notes from our previous all groups RESTful Bag Server Meeting held on
Friday, May 13th are below. Apologies once again for the delay on
these. For those who were on the call, if I missed anything please
feel free to add to these notes.

Look forward to catching up tomorrow (Friday, 06/24) at 3pm ET,
1-270-400-2000, 282929#.

RESTful-Bag-Server Meeting 4
Minutes
05/13/2011

Attendees

1. Mike Burek (Chronopolis)
2. Esme Cowles (UCSD Libraries)
3. Declan Fleming (UCSD Libraries)
4. Mike Giarlo (Penn State)
5. Matt Schultz (Educopia)
6. Mike Smorul (Chronopolis)
7. Don Sutton (Chronopolis)

Minutes

The May call was brief and intended primarily as an opportunity for
new and interested groups or individuals to catch up on the
developments thus far. The call also provided an opportunity to check
in on Mike Smorul’s work to develop a useful tool that would layer
with the spec.

1. Brief Welcome & Catch-Up for New Callers

a. We were newly joined by Declan Fleming & Esme Cowles from the UCSD
Libraries
b. Matt provided a brief recap of the development work since last
December – goal has been to produce a spec for managing and tracking
the replication of BagIt-based data within and between preservation
repositories
c. Declan & Esme explained that they were interested in the
development work as collaborators with Chronopolis and perhaps for
other UCSD applications
d. Matt invited them to feel free to thoroughly review the github site
(https://github.com/acdha/restful-bag-server) and add a use case or
issue as desired

2. Overview of Java Bag Server

a. Mike Smorul gave a brief overview of his Java Bag Server – a set of
java libraries and tools that provides a straight-forward endpoint
directory setup for both creating and pushing/pulling bags
b. Mike mentioned that this will likely have direct applications for
Chronopolis but it is able to be embedded in other local applications
and would encourage other groups on this project to play with it and
provide feedback (now available here:
http://adaptvm01.umiacs.umd.edu:8080/jenkins/job/Chronopolis%20Ingestion%20tool/;
source code available here: https://subversion.umiacs.umd.edu/ingestion/trunk/)
• Mike Giarlo (Penn State) mentioned that he had spent some brief time
with it prior to the call and though he couldn’t comment on the Java
framework or Penn’s implementation he thought it would be useful
c. Mike Smorul would especially appreciate Chris Adams’s feedback in
terms of its integrations with the proposed spec
d. Matt inquired about the timeline for its dissemination, and Mike
requested time to do some debugging and think about licensing (now
available here: http://adaptvm01.umiacs.umd.edu:8080/jenkins/job/Chronopolis%20Ingestion%20tool/;
source code available here: https://subversion.umiacs.umd.edu/ingestion/trunk/)

3. Additional Items Covered
a. Chris Adams was unable to attend but requested feedback from the
group on Mike Smorul’s proposal to relax the upload ordering
constraints to only require the manifest and files be “committed” upon
a final PUT – until then changes should probably be allowed (see Issue
#20: https://github.com/acdha/restful-bag-server/issues/20)
• Mike Smorul expanded on this discussion in regards to flexibility of
DELETE (see Issue #22: https://github.com/acdha/restful-bag-server/issues/22)
• Mike Smorul documented a proposed implementation (see Issue #23:
https://github.com/acdha/restful-bag-server/issues/23)
b. Ed Summers mentioned on the call that the BagIt spec has its own
notion of “commit” in terms of final validation of a manifest – should
revisit – may or may not have bearing
• Followed up the call based on Mike Smorul's comments by posting an
Issue to the github site (see Issue #24: https://github.com/acdha/restful-bag-server/issues/24)

4. Scheduling Future Calls
a. Matt checked in with the group on the monthly meeting schedule –
whether we needed to change dates/times or frequency of the calls?
• Agreed that for now the last Friday of each month at 3pm ET was
still working out
b. June call was scheduled for Friday, 06/24 at 3pm ET
(1-270-400-2000, 282929#)

All best,

Matt Schultz
Collaborative Services Librarian
Educopia Institute, MetaArchive Cooperative
http://www.metaarchive.org
matt.s...@metaarchive.org
616-566-3204

On Jun 22, 3:15 pm, "matt.schu...@metaarchive.org"
<matt.schu...@metaarchive.org> wrote:
> Hi Everybody,
>
> Below is a starter agenda for the next all-groups meeting on the
> development of a RESTful Bag Server (https://github.com/acdha/restful-
> bag-server) scheduled for Friday, June 24th at 3pm ET/2pm CT/12pm PT.
> Call-in Information is: 1-270-400-2000, 282929#.
>
> Feel free to suggest additional items before or on the call:
>
> 1. Update from Chris Adams (merges, changes, test suite, ref server,
> etc.)
> 2. Justin Littman's Play Framework implementation (open issues) - see:http://groups.google.com/group/digital-curation/browse_thread/thread/...
> 3. Mike Smorul's Java Bag Server (use cases) - see:http://groups.google.com/group/digital-curation/browse_thread/thread/...
> 4. Mike Giarlo's F2F at CURATECamp 2011 (plans) - see:http://groups.google.com/group/digital-curation/browse_thread/thread/...
> 5. Next steps?
>
> Looking forward to talking with those who can join.
>
> All best,
>
> Matt Schultz
> Collaborative Services Librarian
> ...
>
> read more »

Ed Summers

unread,
Jun 24, 2011, 10:45:37 AM6/24/11
to digital-...@googlegroups.com
On Thu, Jun 23, 2011 at 12:58 PM, matt.s...@metaarchive.org
<matt.s...@metaarchive.org> wrote:
> 1.      Mike Burek (Chronopolis)
> 2.      Esme Cowles (UCSD Libraries)
> 3.      Declan Fleming (UCSD Libraries)
> 4.      Mike Giarlo (Penn State)
> 5.      Matt Schultz (Educopia)
> 6.      Mike Smorul (Chronopolis)
> 7.      Don Sutton (Chronopolis)

I was there too :-)

//Ed

Matt Schultz

unread,
Jun 24, 2011, 10:50:26 AM6/24/11
to digital-...@googlegroups.com
Hi Ed,

I know - huge apologies! I realized after I sent it out that I did not add you to the list. Even after I explicitly cited you in the notes from our conversation.

Big doh! on my part. Won't happen again...I hope : O

--
You received this message because you are subscribed to the Google Groups "Digital Curation" group.
To post to this group, send email to digital-...@googlegroups.com.
To unsubscribe from this group, send email to digital-curati...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/digital-curation?hl=en.




--

Ed Summers

unread,
Jun 24, 2011, 11:06:28 AM6/24/11
to digital-...@googlegroups.com
On Fri, Jun 24, 2011 at 10:50 AM, Matt Schultz
<matt.s...@metaarchive.org> wrote:
> Big doh! on my part. Won't happen again...I hope : O

No worries. That's what I get for arriving late :-)

//Ed

matt.s...@metaarchive.org

unread,
Jun 27, 2011, 4:11:43 PM6/27/11
to Digital Curation
Hi Everybody,

Notes from our previous all groups RESTful Bag Server Meeting held on
Friday, June 24th are below. For those who were on the call, if I
missed anything please feel free to add to these notes.

RESTful-Bag-Server Meeting 5
Minutes
06/24/2011

Attendees

1. Chris Adams (Library of Congress)
2. Mike Giarlo (Penn State)
3. Ben O’Steen (Oxford University)
4. Matt Schultz (Educopia)
5. Ed Summers (Library of Congress)

Minutes

1. Update from Chris Adams
a. Chris Adams (Library of Congress) mentioned that his schedule for
July was clearing up to allow him to begin closing out open Issues on
the github site. See here: https://github.com/acdha/restful-bag-server/issues
b. Chris also hopes to merge the versioning proposal and main branch
of the spec, and finish the test suite and reference server
implementation over the month of July

2. Justin Littman's Play Framework implementation
a. Justin Littman (Library of Congress) was not able to be on this
month’s call to provide further details
b. Discussion related to Justin’s initial implementation can be found
at: http://groups.google.com/group/digital-curation/browse_thread/thread/0e8dfdb8c5a3e2aa?pli=1

3. Mike Smorul's Java Bag Server
a. Mike Smorul (Chronopolis) was not able to be on this month’s call
to provide further details
b. Mike’s development implementation is available under a BSD license
for download and testing. See here for information and a link to the
software: http://groups.google.com/group/digital-curation/browse_thread/thread/60c070bdee101ef3#

4. Mike Giarlo's F2F at CURATECamp 2011
a. Mike Giarlo (Penn State) suggested that we check in on the July
call to get a final tally of attendees to CURATEcamp and discuss some
practical meetings for teaming up on work related to the spec

5. Additional Items
a. Ben O’Steen (Oxford University) suggested a use case and will
update the group on the next call

6. Next Steps
a. Matt is scheduling a mid-July catch-up call for the original group
that met in D.C. last December to check-in on overall direction of the
spec
b. The regular July call is scheduled for Friday, 07/29 at 3pm ET
(1-270-400-2000, 282929#)

All best,

Matt Schultz
Collaborative Services Librarian
Educopia Institute, MetaArchive Cooperative
http://www.metaarchive.org
matt.s...@metaarchive.org
616-566-3204

On Jun 23, 12:58 pm, "matt.schu...@metaarchive.org"
<matt.schu...@metaarchive.org> wrote:
> Hi Everybody,
>
> provide feedback (now available here:http://adaptvm01.umiacs.umd.edu:8080/jenkins/job/Chronopolis%20Ingest...
> source code available here:https://subversion.umiacs.umd.edu/ingestion/trunk/)
> •     Mike Giarlo (Penn State) mentioned that he had spent some brief time
> with it prior to the call and though he couldn’t comment on the Java
> framework or Penn’s implementation   he thought it would be useful
> c. Mike Smorul would especially appreciate Chris Adams’s feedback in
> terms of its integrations with the proposed spec
> d. Matt inquired about the timeline for its dissemination, and Mike
> requested time to do some debugging and think about licensing (now
> available here:http://adaptvm01.umiacs.umd.edu:8080/jenkins/job/Chronopolis%20Ingest...
> ...
>
> read more »

wmgogel

unread,
Jul 21, 2011, 3:55:34 PM7/21/11
to Digital Curation
Hi, All:

Thanks for citing our work (a direct link to the paper is here:
http://www.ifs.tuwien.ac.at/dp/ipres2010/papers/goethals-08.pdf ).
Since there have been a number of changes to our approach and
implementation, we'd like to give an informal update (remotely) to
interested parties in late fall/early summer. We'll post an invite to
this group when we have more details.

In the meantime, we are happy to answer questions:

Andrea Goethals is technical manager, one of the architects and is
working on the content model for long term preservation
(andrea_...@harvard.edu)
Chris Vicary is the primary software engineer and architect that
Priscilla mentions below (chris_...@harvard.edu)
And I (wendy...@harvard.edu) am the product manager (coordinating
functional requirements, UI design, Harvard testing partners, etc.)

- Wendy

On Jun 16, 2:52 pm, Chris Prom <chris.p...@gmail.com> wrote:
> Thanks.  FYI, there is a report out from iPRES last year, here is my zotero entry for it:
>
> Reshaping the Repository: The Challenge of Email Archiving
>
> Type    Conference Paper
> Author  Andrea Goethals
> Author  Wendy Gogel
> Abstract        Because of the historical value of email in the late 20th and 21st centuries, Harvard University Libraries began planning for an email archiving project in early 2007. A working group comprised of University archivists, curators, records managers, librarians and technologists studied the problem and recommended the undertaking of a pilot email archiving project at the University Library. This two-year pilot would implement a system for ingest, processing, preservation, and eventual end user delivery of email, in anticipation of it becoming an ongoing central service at the University after the pilot. This paper describes some of the unexpected challenges encountered during the pilot project and how they were addressed by design decisions. Key challenges included the requirement to design the system so that it could handle other types of born digital content in the future, and the effect of archiving email with sensitive data to Harvard’s preservation repository, the Digital Repository Service (DRS).
> Date    2010
> Proceedings Title       7th International Conference on Preservation of Digital Objects (iPRES2010)
> Place   Vienaa, Austria
> URL    http://www.ifs.tuwien.ac.at/dp/ipres2010/schedule.html
> Accessed        Wed Jun 1 19:00:00 2011
> Date Added      Thu Jun 2 13:05:45 2011
> Modified        Thu Jun 2 13:10:34 2011
>
> Thanks,
>
> Chris
> ____
>
> Chris Prom
> chris.p...@gmail.com
>
> On Jun 16, 2011, at 1:00 PM, slabrams wrote:
>
> Hi Chris:
>
> Harvard has been working on an email archiving pilot project for a
> year or so, but I don't know if they have published any results.
>
> http://hul.harvard.edu/ois/digpres/projects.html
>
> You may want to contact Wendy Gogel and/or Andrea Goethals for more
> information.
>
> --sla
>
> On Jun 16, 7:36 am, Chris Prom <chris.p...@gmail.com> wrote:
>
> > I would be very interested in participating in a discussion of this topic, either via the googlegroup or call in.  
>
> > I am beginning of a project to survey potential technical approaches to this topic, which will be published as a Tech Watch Report by the Digital Preservation Coaltion (UK).  I  am currently reviewing literature and will be posting notes on my blog as I go along.  Information about my project is here:
>
> >http://blog.beagrie.com/2011/05/23/dpc-and-charles-beagrie-limited-to...
>
Reply all
Reply to author
Forward
0 new messages