bagit checksum algorithms

139 views
Skip to first unread message

John Scancella

unread,
Jan 24, 2017, 3:44:09 PM1/24/17
to Digital Curation
Hello all,

I work on bagit-java and I was wondering what checksum algorithms does everyone use when using bagit? How many people would want to be able to use algorithms like SHA3 that don't come standard with the JVM?

Thanks
John Scancella

Bertram Lyons

unread,
Jan 24, 2017, 3:54:52 PM1/24/17
to digital-...@googlegroups.com
John --

We use MD5 and SHA-256.

- Bert

______________________________________
 
Bertram Lyons, CA
AVPreserve
634 W. Main St., Ste 202
Madison, Wisconsin 53703
 
office: 202-430-4457

https://www.avpreserve.com 
Facebook.com/AVPreserve
twitter.com/AVPreserve

--
You received this message because you are subscribed to the Google Groups "Digital Curation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digital-curation+unsubscribe@googlegroups.com.
To post to this group, send email to digital-curation@googlegroups.com.
Visit this group at https://groups.google.com/group/digital-curation.
For more options, visit https://groups.google.com/d/optout.

Henk Vanstappen

unread,
Jan 25, 2017, 11:02:31 AM1/25/17
to Digital Curation
Hi John,

Never used anything but MD5, as I've had no reason to use any other
But I'm curious to know about use cases that require other checksums.

Henk

John Scancella

unread,
Jan 25, 2017, 12:00:50 PM1/25/17
to Digital Curation
Hi Henk,

The main use case would be for hash collisions. Since MD5 uses 128 bits and due to the birthday problem you have a 1 in 2^64 chance of a hash collision. Since SHA-256 uses 256 bits (as it's name implies) which is double the 128 bits MD5 uses, it is half as likely to generate a collision. Here at the Library of Congress we have a unimaginable amount of files, so from a curation point of view it makes sense for us to use as many bits as possible to ensure we don't get a collision. 

There is also the case that since you have to read the entire file anyway, you can use the same bits over and over to calculate multiple checksums in case in the future there is discovered a weakness in one or more of the current checksum algorithms used.

Simon Spero

unread,
Jan 25, 2017, 12:41:14 PM1/25/17
to digital-...@googlegroups.com
On Jan 24, 2017 3:44 PM, "John Scancella" <blacksmi...@gmail.com> wrote:
Hello all,

I work on bagit-java and I was wondering what checksum algorithms does everyone use when using bagit? How many people would want to be able to use algorithms like SHA3 that don't come standard with the JVM?

SHA 3 is not seeing wide adoption, as SHA 2 has proven to be more resilient than was feared, and the focus of the next generation of cryptographic suites will be stronger protections against quantum computing based attacks. 

For security use sha384 should be used for all classified materials, but sha256 is adequate for non-adversarial applications; sha512 may be better for archival time frames. 

If multiple hash algorithms are used, then they must be used in parallel, not in series, and it must not be possible to remove one set of hashes without detection. This currently requires agreements outside of bagit. 

Simon

John Scancella

unread,
Jan 25, 2017, 12:46:45 PM1/25/17
to Digital Curation
Hi Simon,

I was indeed referring to in parallel, i.e. manifest-md5.txt as well as manifest-sha256.txt, etc

Geoff Froh

unread,
Jan 25, 2017, 1:08:08 PM1/25/17
to digital-...@googlegroups.com
Hi John,

On Wed, Jan 25, 2017 at 9:00 AM, John Scancella <blacksmi...@gmail.com> wrote:

There is also the case that since you have to read the entire file anyway, you can use the same bits over and over to calculate multiple checksums in case in the future there is discovered a weakness in one or more of the current checksum algorithms used.

Exactly the reason we use both MD5 and SHA256. They also seem the most commonly deployed in the software in our toolchain. 

Thanks for your work on this,

Geoff

--
Geoff Froh
Deputy Director

Densho
1416 S. Jackson St.
Seattle, WA 98144
US
(skype) geoff.froh

Shira Peltzman

unread,
Jan 25, 2017, 1:08:10 PM1/25/17
to Digital Curation
We use MD5 and SHA-256 as well, and it would be great to see support for the latter.

Shira

Simon Spero

unread,
Jan 25, 2017, 1:08:25 PM1/25/17
to digital-...@googlegroups.com
On Jan 25, 2017 12:46 PM, "John Scancella" <blacksmi...@gmail.com> wrote:
Hi Simon,

I was indeed referring to in parallel, i.e. manifest-md5.txt as well as manifest-sha256.txt, etc

I figured :-) ; I was thinking about how to ensure that the bag is sent with the right hashes (it also makes digital signatures a bit annoying :) 

Simon

Andrew Berger

unread,
Jan 25, 2017, 2:01:55 PM1/25/17
to Digital Curation
We use MD5 and SHA-512 (from within Archivematica -- this is what goes with our AIPs).

Andrew

Chris Adams

unread,
Jan 25, 2017, 3:13:43 PM1/25/17
to digital-...@googlegroups.com
On Wed, Jan 25, 2017 at 12:00 PM, John Scancella <blacksmi...@gmail.com> wrote:
The main use case would be for hash collisions. Since MD5 uses 128 bits and due to the birthday problem you have a 1 in 2^64 chance of a hash collision. Since SHA-256 uses 256 bits (as it's name implies) which is double the 128 bits MD5 uses, it is half as likely to generate a collision. Here at the Library of Congress we have a unimaginable amount of files, so from a curation point of view it makes sense for us to use as many bits as possible to ensure we don't get a collision. 

Beyond needing to archive things like security researcher's work, one reason for doing this is that the tools we use are based on standard crypto libraries and so algorithm availability is driven by the larger security community. That touches on both performance – recent processors often have dedicated SHA-2 support – but, more importantly, it also means that algorithms which are now considered insecure may not be available at all.

One of the challenges we encountered while working on https://github.com/LibraryOfCongress/bagger-js was that the WebCrypto API and asmcrypto.js don't implement MD-5 at all (browser SHA-2 performance is actually surprisingly good), which isn't a deal-breaker for creating new bags but means that anyone trying to validate existing bags in a JavaScript-based environment either needs to add extra complexity or be ready to do some sort of in-place upgrade to add manifests using newer algorithms.

Chris 

Geoff Froh

unread,
Jan 25, 2017, 4:58:50 PM1/25/17
to digital-...@googlegroups.com
On Wed, Jan 25, 2017 at 12:13 PM, Chris Adams <ch...@improbable.org> wrote:

On Wed, Jan 25, 2017 at 12:00 PM, John Scancella <blacksmi...@gmail.com> wrote:
The main use case would be for hash collisions. Since MD5 uses 128 bits and due to the birthday problem you have a 1 in 2^64 chance of a hash collision. Since SHA-256 uses 256 bits (as it's name implies) which is double the 128 bits MD5 uses, it is half as likely to generate a collision. Here at the Library of Congress we have a unimaginable amount of files, so from a curation point of view it makes sense for us to use as many bits as possible to ensure we don't get a collision. 

Beyond needing to archive things like security researcher's work, one reason for doing this is that the tools we use are based on standard crypto libraries and so algorithm availability is driven by the larger security community. That touches on both performance – recent processors often have dedicated SHA-2 support – but, more importantly, it also means that algorithms which are now considered insecure may not be available at all.

Probably a dumb question, but are we discussing using the algorithms for the purposes of hashing files in the bag (i.e., bit auditing) or for encrypting the bag itself? Those seem like two fairly different use cases.

Chris Adams

unread,
Jan 25, 2017, 5:07:56 PM1/25/17
to digital-...@googlegroups.com
On Wed, Jan 25, 2017 at 4:02 PM, Geoff Froh <geoff...@densho.org> wrote:
On Wed, Jan 25, 2017 at 12:13 PM, Chris Adams <ch...@improbable.org> wrote:
Beyond needing to archive things like security researcher's work, one reason for doing this is that the tools we use are based on standard crypto libraries and so algorithm availability is driven by the larger security community. That touches on both performance – recent processors often have dedicated SHA-2 support – but, more importantly, it also means that algorithms which are now considered insecure may not be available at all.

Probably a dumb question, but are we discussing using the algorithms for the purposes of hashing files in the bag (i.e., bit auditing) or for encrypting the bag itself? Those seem like two fairly different use cases.


Just hashing – the catch is that in practice this almost always means using a general-purpose crypto library like OpenSSL so we're affected by decisions made to protect against active attackers even though most of us are only concerned with preventing accidental bit-level corruption.

Chris 

Geoff Froh

unread,
Jan 25, 2017, 5:57:45 PM1/25/17
to digital-...@googlegroups.com
Hi Chris,

Makes sense! Guess we might be rolling our own for all those legacy CRC32 hashes in the coming years...

Thanks,

Geoff 

--
You received this message because you are subscribed to the Google Groups "Digital Curation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digital-curation+unsubscribe@googlegroups.com.
To post to this group, send email to digital-curation@googlegroups.com.
Visit this group at https://groups.google.com/group/digital-curation.
For more options, visit https://groups.google.com/d/optout.

Simon Spero

unread,
Jan 25, 2017, 6:10:59 PM1/25/17
to digital-...@googlegroups.com
On Jan 25, 2017 3:13 PM, "Chris Adams" <ch...@improbable.org> wrote:

One of the challenges we encountered while working on https://github.com/LibraryOfCongress/bagger-js was that the WebCrypto API and asmcrypto.js don't implement MD-5 at all (browser SHA-2 performance is actually surprisingly good), which isn't a deal-breaker for creating new bags but means that anyone trying to validate existing bags in a JavaScript-based environment either needs to add extra complexity or be ready to do some sort of in-place upgrade to add manifests using newer algorithms.

There are a load of old  MD5 implementations in Javascript -  eg http://www.myersdaily.org/joseph/javascript/md5-text.html 

Not includin  MD5 in a crypto suite is a feature :) 

Simon Spero

unread,
Jan 25, 2017, 6:16:22 PM1/25/17
to digital-...@googlegroups.com
On Jan 25, 2017 5:57 PM, "Geoff Froh" <geoff...@densho.org> wrote:
Hi Chris,

Makes sense! Guess we might be rolling our own for all those legacy CRC32 hashes in the coming years...

Or if you have an Intel processor with SSE 4.2, use the crc32 instruction ?


Reply all
Reply to author
Forward
0 new messages