Leftovers and data loss

Showing 1-8 of 8 messages
Leftovers and data loss Michael Snoyman 6/4/12 10:00 PM

I want to explore a bit the connection between leftovers and dataloss. To make this concrete, I've put some code below that uses the current development branch of conduit. I'm aware of three kinds of data loss that can occur, demonstrated by the three examples.

1. Returning leftovers without consuming input will cause data loss of the leftovers themselves in conduit 0.4. I believe this is inherent when you model leftovers via a `Maybe`. The switch to the `Leftover` constructor in 0.5 gets rid of this problem. In retrospect, I believe using a list of leftovers instead of Maybe would also solve this. (Chris: this might work for your layered Pipe approach as well, I'm not sure.)
2. The second kind of data loss occurs specifically because of a lack of leftovers. The best way to demonstrate this is via chunked data, such as a ByteString or Text. In the example below, we try to consume 4 bytes, but the first chunk only has 3 bytes. Therefore, the first call to `take` will consume the first two chunks, and the return part of the second chunk as leftovers. Without leftover support, the remainder of the second chunk would be entirely lost. Given that a very large amount of streaming code involves these kinds of chunked datatypes, this is a very serious concern.
3. This kind of data loss in inherent to streaming. Whenever you have a stream modifier (e.g., Enumeratee, Conduit) which produces more output values than it consumes, those extra values can be lost. In the example below, the -3 generated by the call to concatMap is lost. However, this has nothing to do with leftovers: I believe the equivalent code in pipes or pipes-core would have the exact same data loss.

So my contention is: a library lacking leftovers support is missing support for the most common streaming use cases, and adding leftovers to the library does not add any additional sources for data loss.

Also, if we go with the presumption that leftover is needed in most streaming code, having two separate data types (one with leftover support, one without) will lead to a lot of difficulty in getting components to integrate, and therefore either ugly user code or lots of library duplication.

Michael

{-# LANGUAGE OverloadedStrings #-}
import Data.Conduit
import qualified Data.Conduit.List as CL
import qualified Data.Conduit.Binary as CB
import qualified Data.ByteString.Char8 as S
import qualified Data.ByteString.Lazy as L

main = do
let print' f = f >>= print

-- The equivalent would cause data loss in conduit 0.4. However, this
-- usage is still not recommended, if only because the meaning is not yet
-- well understood.
putStrLn "Example 1"
print' $ CL.sourceList [1..10] $$ do
CL.drop 3
leftover 3
leftover 2
leftover 1
CL.consume
-- output: [1,2,3,4,5,6,7,8,9,10]

-- Without leftover handling, the following example cannot be implemented.
-- We would have data loss from consuming the beginning of the second
-- chunk.
putStrLn "Example 2"
print' $ CL.sourceList ["foo", "bar", "baz"] $$ do
x <- CB.take 4
y <- CB.take 4
z <- CB.take 4
let toStrict = S.concat . L.toChunks
return $ map toStrict [x, y, z]
-- output: ["foob","arba","z"]

-- Discarding downstream leftovers. The result would be the same if there
-- was no leftover support built into the library. In other words,
-- leftovers do not cause the problem here.
putStrLn "Example 3"
print' $ CL.sourceList [1..10] $$ do
x <- CL.concatMap (\i -> [i, negate i]) =$ CL.take 5
y <- CL.consume
return (x, y)
-- output: ([1,-1,2,-2,3],[4,5,6,7,8,9,10])

Re: Leftovers and data loss Aristid Breitkreuz 6/5/12 12:56 AM

Hi Michael,

This leaves me a bit stumped. I have to admit to not having followed conduit development very closely recently, but I would like to get up to speed again, as it seems like these are some interesting issues.

Could you please forward some relevant discussions to this list, and/or write a short summary of what changed and why and what is planned? (I have looked at the conduit 0.4 code and noticed the new Pipe type, but it seems like some context might be helpful.)

Aristid

--
You received this message because you are subscribed to the Google Groups "streaming-haskell" group.
To post to this group, send email to streamin...@googlegroups.com.
To unsubscribe from this group, send email to streaming-hask...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/streaming-haskell?hl=en.
Re: Leftovers and data loss Paolo Capriotti 6/5/12 2:48 AM
On Tue, Jun 5, 2012 at 6:00 AM, Michael Snoyman <mic...@snoyman.com> wrote:
> I want to explore a bit the connection between leftovers and dataloss. To
> make this concrete, I've put some code below that uses the current
> development branch of conduit. I'm aware of three kinds of data loss that
> can occur, demonstrated by the three examples.
>
> 1. Returning leftovers without consuming input will cause data loss of the
> leftovers themselves in conduit 0.4. I believe this is inherent when you
> model leftovers via a `Maybe`. The switch to the `Leftover` constructor in
> 0.5 gets rid of this problem. In retrospect, I believe using a list of
> leftovers instead of Maybe would also solve this. (Chris: this might work
> for your layered Pipe approach as well, I'm not sure.)

Failure of the identity law could still, to some extent, be considered a form
of data loss. Normally, you have

    CL.map id =$= s   ===   s

but if `s = leftover 1`, then the equality does not hold, and you end up losing
data in something like:

    (CL.map id =$= s) >> await

wrt to

    s >> await

I know conduit never claims to satisfy the category laws, but I find it hard to
justify why something like `CL.map id` should behave as an identity sometimes,
but now always.

> 2. The second kind of data loss occurs specifically because of a lack of
> leftovers. The best way to demonstrate this is via chunked data, such as a
> ByteString or Text. In the example below, we try to consume 4 bytes, but the
> first chunk only has 3 bytes. Therefore, the first call to `take` will
> consume the first two chunks, and the return part of the second chunk as
> leftovers. Without leftover support, the remainder of the second chunk would
> be entirely lost. Given that a very large amount of streaming code involves
> these kinds of chunked datatypes, this is a very serious concern.

I've addressed this a few times. I believe there is a false dilemma here. The
choice is not between supporting leftovers or not, as there are ways to layer
them on top of a solution without them.

The choice is between a library with direct support for leftovers, that can
however be used incorrectly producing surprising results (and violating the
category laws), and one that requires some explicit intervention on the part of
the user (in the form of newtype unwrapping and explicit conversion) so that
they can only assemble pipes in ways that actually make sense.

I'm really skeptical of the fact that handling leftovers is so important in
practice to justify having them in the core even at the cost of breaking
fundamental properties. Can you show me some examples of real-world
conduit-based code that makes heavy use of this feature?

> 3. This kind of data loss in inherent to streaming. Whenever you have a
> stream modifier (e.g., Enumeratee, Conduit) which produces more output
> values than it consumes, those extra values can be lost. In the example
> below, the -3 generated by the call to concatMap is lost. However, this has
> nothing to do with leftovers: I believe the equivalent code in pipes or
> pipes-core would have the exact same data loss.

So you think there is a problem here? I think this example behaves exactly as
expected.  If you want to preserve that -3 value, you should write the pipe in
a different way. With pipes-core, it would be:

import qualified Control.Pipe.Combinators as C

      C.fromList [1..10]
  >+> C.pipeList (\i -> [i, negate i])
  >+> ((,) <$> replicate 5 await <*> C.consume)

> So my contention is: a library lacking leftovers support is missing support
> for the most common streaming use cases, and adding leftovers to the library
> does not add any additional sources for data loss.
>
> Also, if we go with the presumption that leftover is needed in most
> streaming code, having two separate data types (one with leftover support,
> one without) will lead to a lot of difficulty in getting components to
> integrate, and therefore either ugly user code or lots of library
> duplication.

The difficulty you mention is inherent in the problem (as the above example
with the identity pipe shows), and not caused by the separate abstraction.

When you expose a pipe that deals with leftovers as part of your API, you still
have to think about how that would fit inside a generic pipeline. Can it be
composed with other pipes? Can it be used inside a `do` block? What happens to
leftovers produced by it when you compose it?

The separation between leftover-capable pipes and "normal" pipes gives you the
ability to provide answers for those questions directly in the types.

Using pipes-core terminology, if the exposed pipe is a `PutbackPipe`, then you
can use it in a `do` block with other `PutbackPipe`s, but you cannot compose it
directly.

If you want to compose it, you first have to convert it to a normal
pipe (using `runPutback`), which will cause any additional leftovers to be
discarded.

The explicit extra step is a good thing, IMHO, because it forces you to think
about what should happen to the leftovers, instead of just swallowing them
implicitly.

If, on the other hand, the library exposes a regular `Pipe`, then it is
understood that it deals with leftovers internally, so you are free to compose
it as you wish.

BR,
Paolo
Re: Leftovers and data loss Michael Snoyman 6/5/12 5:42 AM
Hi Aristid,

I'm in the middle of writing up a blog post on the motivation behind
the changes we're discussing right now for conduit 0.5. I probably
won't have a chance to finish it in the near future (especially since
the ideas are likely to change based on discussions here), but I've
put up what I've got at:

https://gist.github.com/2874732

Obvious caveats to non-proofread stuff applies ;).

Michael
Re: Leftovers and data loss Michael Snoyman 6/5/12 5:57 AM
On Tue, Jun 5, 2012 at 12:48 PM, Paolo Capriotti <p.cap...@gmail.com> wrote:
> On Tue, Jun 5, 2012 at 6:00 AM, Michael Snoyman <mic...@snoyman.com> wrote:
>> I want to explore a bit the connection between leftovers and dataloss. To
>> make this concrete, I've put some code below that uses the current
>> development branch of conduit. I'm aware of three kinds of data loss that
>> can occur, demonstrated by the three examples.
>>
>> 1. Returning leftovers without consuming input will cause data loss of the
>> leftovers themselves in conduit 0.4. I believe this is inherent when you
>> model leftovers via a `Maybe`. The switch to the `Leftover` constructor in
>> 0.5 gets rid of this problem. In retrospect, I believe using a list of
>> leftovers instead of Maybe would also solve this. (Chris: this might work
>> for your layered Pipe approach as well, I'm not sure.)
>
> Failure of the identity law could still, to some extent, be considered a form
> of data loss. Normally, you have
>
>    CL.map id =$= s   ===   s
>
> but if `s = leftover 1`, then the equality does not hold, and you end up losing
> data in something like:
>
>    (CL.map id =$= s) >> await
>
> wrt to
>
>    s >> await
>
> I know conduit never claims to satisfy the category laws, but I find it hard to
> justify why something like `CL.map id` should behave as an identity sometimes,
> but now always.

That's a fair point. I would argue that this isn't data loss, as we're
losing data that couldn't be captured without leftovers. But I agree
that it's surprising. I have an idea on this, I'll explain at the end
of this email.

>> 2. The second kind of data loss occurs specifically because of a lack of
>> leftovers. The best way to demonstrate this is via chunked data, such as a
>> ByteString or Text. In the example below, we try to consume 4 bytes, but the
>> first chunk only has 3 bytes. Therefore, the first call to `take` will
>> consume the first two chunks, and the return part of the second chunk as
>> leftovers. Without leftover support, the remainder of the second chunk would
>> be entirely lost. Given that a very large amount of streaming code involves
>> these kinds of chunked datatypes, this is a very serious concern.
>
> I've addressed this a few times. I believe there is a false dilemma here. The
> choice is not between supporting leftovers or not, as there are ways to layer
> them on top of a solution without them.
>
> The choice is between a library with direct support for leftovers, that can
> however be used incorrectly producing surprising results (and violating the
> category laws), and one that requires some explicit intervention on the part of
> the user (in the form of newtype unwrapping and explicit conversion) so that
> they can only assemble pipes in ways that actually make sense.
>
> I'm really skeptical of the fact that handling leftovers is so important in
> practice to justify having them in the core even at the cost of breaking
> fundamental properties. Can you show me some examples of real-world
> conduit-based code that makes heavy use of this feature?

Basically any code I've written using either conduit or enumerator
makes heavy usage of leftovers. I think warp and http-conduit are two
very strong examples of the usage. More to the point:
connect-and-resume requires some sort of leftover support. Perhaps it
can be implemented as a layer on top, I'm not certain. But in my
experience, having to manually wrap/unwrap all over the place would
make the library too difficult to use.

>> 3. This kind of data loss in inherent to streaming. Whenever you have a
>> stream modifier (e.g., Enumeratee, Conduit) which produces more output
>> values than it consumes, those extra values can be lost. In the example
>> below, the -3 generated by the call to concatMap is lost. However, this has
>> nothing to do with leftovers: I believe the equivalent code in pipes or
>> pipes-core would have the exact same data loss.
>
> So you think there is a problem here? I think this example behaves exactly as
> expected.  If you want to preserve that -3 value, you should write the pipe in
> a different way. With pipes-core, it would be:
>
> import qualified Control.Pipe.Combinators as C
>
>      C.fromList [1..10]
>  >+> C.pipeList (\i -> [i, negate i])
>  >+> ((,) <$> replicate 5 await <*> C.consume)

No, I don't think there's a problem, I believe this is the proper
behavior of any streaming library. As an aside, your sample behaves
very differently than mine, as yours will consume *all* of the
following values from the stream, not just the -3, and therefore alter
the next pipe in the monadic binding. I don't think it's worth trying
to find a way to recover lost data in these kinds of circumstances, as
I can't think of a case where the data would have any real meaning.
OK, here's an idea. I'm not actually advocating for it (yet), but it
does hold a certain appeal. I've tested it in conduit and pushed it as
a branch[1]: all code compiles, and all tests pass, which is very
encouraging.

Why not have an extra type parameter for leftovers? The modified type
looks like:

    data Pipe l i o u m r =
        HaveOutput (Pipe l i o u m r) (m ()) o
      | NeedInput (i -> Pipe l i o u m r) (u -> Pipe l i o u m r)
      | Done r
      | PipeM (m (Pipe l i o u m r))
      | Leftover (Pipe l i o u m r) l

Unless I've messed up, all of the monad laws will still follow from my
previous proofs, as the actual definitions of return and >>= haven't
changed.

Now we have a pipe composition function with type:

    pipe :: Monad m => Pipe Void a b r0 m r1 -> Pipe Void b c r1 m r2
-> Pipe Void a c r0 m r2

This would be equivalent to simply dropping the Leftover constructor
entirely. But it gives us something much more convenient than layering
leftovers on top. Connect-and-resume can still work, we just need to
unify the leftover and input types:

    pipeResume :: Monad m
               => ResumablePipe a a b r0 m r1
               -> Pipe b b c r1 m r2
               -> Pipe a a c r0 m (ResumablePipe a a b r0 m r1, r2)

Functions that don't call `leftover` at all will have a free parameter
for leftover, and therefore can be passed directly to `pipe`.
Functions which call `leftover` will have leftover equal to input. In
order to compose them, we would need:

    injectLeftovers :: Monad m => Pipe i i o u m r -> Pipe l i o u m r

And the really cool part: we can make this completely
backwards-compatible with existing conduit code. I defined Source,
Sink, and Conduit as:

    type Source m i = Pipe Void Void i () m ()
    type Sink i m r = Pipe i i Void () m r
    type Conduit i m o = Pipe i i o () m ()

And then just have the existing operators use injectLeftovers, e.g.:

    ($$) :: Monad m => Source m a -> Sink a m b -> m b
    src $$ sink = runPipe $ injectLeftovers src `pipe` injectLeftovers sink

Like I said, I'm not convinced yet that this is a good idea, but it's
certainly interesting. Any thoughts?

Michael

[1] https://github.com/snoyberg/conduit/tree/leftover-type-param
Re: Leftovers and data loss Paolo Capriotti 6/5/12 6:17 AM
Ah! That looks like a great idea! I'm going to think about it a little bit
more, but it's certainly promising. Your pipe with leftovers is basically the
same thing as my `PutbackPipe`, but it allows an implicit conversion `Pipe` ->
`PutbackPipe` (given by unification of the first two type parameters). Very
nice.

I'm a little bit worried by the fact that `Source` is automatically
"leftover-ready" in the types, so there's nothing enforcing a call to
`injectLeftovers` there. I take this as another sign that the leftmost input
type shouldn't be `Void` (but I guess that's a debate for another time :)).

BR,
Paolo
Re: Leftovers and data loss Michael Snoyman 6/5/12 6:26 AM
On Tue, Jun 5, 2012 at 4:17 PM, Paolo Capriotti <p.cap...@gmail.com> wrote:
>
> Ah! That looks like a great idea! I'm going to think about it a little bit
> more, but it's certainly promising. Your pipe with leftovers is basically the
> same thing as my `PutbackPipe`, but it allows an implicit conversion `Pipe` ->
> `PutbackPipe` (given by unification of the first two type parameters). Very
> nice.
>
> I'm a little bit worried by the fact that `Source` is automatically
> "leftover-ready" in the types, so there's nothing enforcing a call to
> `injectLeftovers` there. I take this as another sign that the leftmost input
> type shouldn't be `Void` (but I guess that's a debate for another time :)).

Actually, the main reason for the setup of `Source` in this way is for
backwards compatibility. Like I said, I'm not sold on *anything* here,
but it's encouraging that we could simultaneously have a Category and
minimal cost of upgrade for existing conduit code. I think going
forward it would be more recommended to explicitly call
`injectLeftovers`, but I'm not certain.

Regarding the leftmost input question of Void vs (): I've never been
really sold one way or another, but I think I'm leaning back towards
() right now. It's certainly an interesting debate to have another
time.

Michael
Re: Leftovers and data loss Michael Snoyman 6/5/12 10:06 PM
The new docs in the Haddocks might be a bit better actually:

http://www.snoyman.com/haddocks/conduit-0.5.0a/Data-Conduit.html

Please let me know if you have any questions/suggestions.

Michael