Digesting streams: pipes-cryptohash

100 views
Skip to first unread message

Nicolas Trangez

unread,
Jul 2, 2014, 5:49:39 PM7/2/14
to haskel...@googlegroups.com
All,

Recently I was looking for a way to calculate the SHA256 hash of a
stream of ByteStrings, but not only calculate this hash: also pass on
the stream so it could be stored to a file (use-case: receiving data
from a socket, which needs to be saved to a file, named after a hash of
the content, then returning this 'name' to the client).

I couldn't find any such library of-the-shelve in the Pipes ecosystem,
so I tried to implement something myself (with some help from IRC), and
just now pushed a first version of the code to GitHub.

So, consider this a 'request for feedback/review' :-)

The hashing itself is based on the 'cryptohash' library.

There are 2 interfaces:
- One where a (strict or lazy) ByteString Producer can be wrapped,
resulting in a new Producer which will yield the exact same values as
the original one, but tupling the result of the original Producer with a
digest. E.g.

hash :: (HashAlgorithm a, Monad m)
=> Producer ByteString (Producer ByteString m) r
-> Producer ByteString m (r, Digest a)

- One which acts as a real 'Pipe' which passes along every (strict or
lazy) ByteString it receives, and updates a hashing Context kept in a
State layer of the monad stack. E.g.

hashPipe :: (HashAlgorithm a, MonadState (Context a) m)
=> Pipe ByteString ByteString m r


These have some variants: working over streams of strict or lazy
bytestrings, requiring type inference to determine the hash type, or an
explicit value, or allowing to pass in an existing context (e.g.
generated by another stream before).

Code @ https://github.com/NicolasT/pipes-cryptohash/

Please let me know if you think the API could be enhanced, I'm doing
something utterly wrong,...

Thanks!

Nicolas

Gabriel Gonzalez

unread,
Jul 2, 2014, 6:31:49 PM7/2/14
to haskel...@googlegroups.com, nic...@incubaid.com
I think the interface I would recommend would be:

hash :: (HashAlgorithm a, Monad m) => Producer ByteString m r ->
Producer ByteString m (Digest, r)

I noticed your version has two `Producer` layers. What's the role of
the second `Producer`?

Nicolas Trangez

unread,
Jul 2, 2014, 6:59:49 PM7/2/14
to haskel...@googlegroups.com
On Wed, 2014-07-02 at 15:31 -0700, Gabriel Gonzalez wrote:
> I think the interface I would recommend would be:
>
> hash :: (HashAlgorithm a, Monad m) => Producer ByteString m r ->
> Producer ByteString m (Digest, r)
>
> I noticed your version has two `Producer` layers. What's the role of
> the second `Producer`?

Good question :-) I expected to be able to write the function according
to the type you point out above, but when I wrote the thing, GHC
insisted the base `Producer` was required, and I tend to trust GHC when
it tells me such things.

Now, since you made the same remark, and I tend to trust you as well
(;-)), I went back to the code and figured out all I need is a `lift`...

So, first API change:

hash :: (HashAlgorithm a, Monad m)
=> Producer ByteString m r
-> Producer ByteString m (r, Digest a)

Much better!

Now I'm wondering whether I'm overlooking any function in the Pipes
library which does what my internal function does:

hashInternal :: Monad m
=> (a -> b -> a)
-> a
-> Producer b m r
-> Producer b m (r, a)
hashInternal f = loop
where
loop !ctx p =
lift (next p) >>=
either
(\r -> return (r, ctx))
(\(b, p') -> yield b >> loop (f ctx b) p')

This seems like a fairly reasonable/common thing to do, no (modulo the
strictness, could be pushed into `f`)?

Thanks,

Nicolas

Gabriel Gonzalez

unread,
Jul 2, 2014, 8:39:18 PM7/2/14
to haskel...@googlegroups.com, nic...@incubaid.com
This can be solved really elegantly by adding this function to
`Pipes.Prelude`:

-- I need a better name for this
andFold :: Monad m => (x -> a -> x) -> x -> (x -> b) -> Producer a
m r -> Producer a m (r, b)

Then you can make your code much more reusable by writing it as a `Fold`
from `foldl`:

hash :: HashAlgorithm a => Fold ByteString (Digest a)

Then you would apply `hash` to a `Producer` using:

purely andFold hash :: HashAlgorithm a => Producer ByteString m r
-> Producer ByteSTring m (r, Digest a)

The benefit of factoring out the `hash` logic into a `Fold` is that:

* You can then use it to fold other containers, like lists or vectors:

Control.Foldl.fold hash :: Foldable f => f ByteString -> Digest a

`conduit` also supports folding using the `Fold` type in
`conduit-extras` so your solution then generalizes to conduits, too.

* You can apply other folds at the same time using the `Applicative`
instance for `Fold`:

purely andFold ((,) <$> hash <*> Control.Foldl.ByteString.length)
:: Producer ByteString m r -> Producer ByteString m (r, (Digest
a, Int64))

The above code computes the hash and the length in bytes simultaneously
in a single pass over the data

* People can use your `hash` `Fold` without a `pipes` dependency. I
design all my libraries to minimize coupling like this.

Nicolas Trangez

unread,
Jul 3, 2014, 5:55:38 AM7/3/14
to Gabriel Gonzalez, haskel...@googlegroups.com
On Wed, 2014-07-02 at 17:39 -0700, Gabriel Gonzalez wrote:
> This can be solved really elegantly by adding this function to
> `Pipes.Prelude`:
>
> -- I need a better name for this
> andFold :: Monad m => (x -> a -> x) -> x -> (x -> b) -> Producer a
> m r -> Producer a m (r, b)

Yes please!

>
> Then you can make your code much more reusable by writing it as a `Fold`
> from `foldl`:
>
> hash :: HashAlgorithm a => Fold ByteString (Digest a)
>
> Then you would apply `hash` to a `Producer` using:
>
> purely andFold hash :: HashAlgorithm a => Producer ByteString m r
> -> Producer ByteSTring m (r, Digest a)
>
> The benefit of factoring out the `hash` logic into a `Fold` is that:
>
> * You can then use it to fold other containers, like lists or vectors:
>
> Control.Foldl.fold hash :: Foldable f => f ByteString -> Digest a
>
> `conduit` also supports folding using the `Fold` type in
> `conduit-extras` so your solution then generalizes to conduits, too.
>
> * You can apply other folds at the same time using the `Applicative`
> instance for `Fold`:
>
> purely andFold ((,) <$> hash <*> Control.Foldl.ByteString.length)
> :: Producer ByteString m r -> Producer ByteString m (r, (Digest
> a, Int64))
>
> The above code computes the hash and the length in bytes simultaneously
> in a single pass over the data
>
> * People can use your `hash` `Fold` without a `pipes` dependency. I
> design all my libraries to minimize coupling like this.

I'm familiar with the `foldl` library (even created a sort-of port to
OCaml), and it's really nice! If the `andFold` would be in `pipes`, I'd
certainly change my code to use a `Fold` instead. Might even try to
convince to get such `Fold` in `cryptohash` upstream, so my package
becomes obsolete.

Nicolas

Gabriel Gonzalez

unread,
Jul 3, 2014, 10:17:18 AM7/3/14
to Nicolas Trangez, haskel...@googlegroups.com
Alright, then I'll add `andFold` to `pipes` (but possibly with a
different name).

Gabriel Gonzalez

unread,
Jul 3, 2014, 10:33:55 AM7/3/14
to Nicolas Trangez, haskel...@googlegroups.com
I also forgot to mention another benefit of making it a `Fold`: you can
use it with `Pipes.Parse.foldAll`. For example, to compute the hash of
the first 100 bytes of a stream, you would write:

zoom (splitAt 100) (purely foldAll hash) :: HashAlgorithm a =>
Parser ByteString (Digest a)

Nicolas Trangez

unread,
Jul 3, 2014, 11:56:52 AM7/3/14
to Gabriel Gonzalez, haskel...@googlegroups.com
On Thu, 2014-07-03 at 07:17 -0700, Gabriel Gonzalez wrote:
> Alright, then I'll add `andFold` to `pipes` (but possibly with a
> different name).

FWIW, I called it `catWithFold` for now.

Nicolas

Gabriel Gonzalez

unread,
Sep 2, 2014, 7:12:41 PM9/2/14
to Nicolas Trangez, haskel...@googlegroups.com
I ended up adding this (and the monadic equivalent) as `fold'` and
`foldM'`. It's not up on Hackage yet, but it will be soon.
Reply all
Reply to author
Forward
0 new messages