I want to explore a bit the connection between leftovers and dataloss. To make this concrete, I've put some code below that uses the current development branch of conduit. I'm aware of three kinds of data loss that can occur, demonstrated by the three examples.
1. Returning leftovers without consuming input will cause data loss of the leftovers themselves in conduit 0.4. I believe this is inherent when you model leftovers via a `Maybe`. The switch to the `Leftover` constructor in 0.5 gets rid of this problem. In retrospect, I believe using a list of leftovers instead of Maybe would also solve this. (Chris: this might work for your layered Pipe approach as well, I'm not sure.)
2. The second kind of data loss occurs specifically because of a lack of leftovers. The best way to demonstrate this is via chunked data, such as a ByteString or Text. In the example below, we try to consume 4 bytes, but the first chunk only has 3 bytes. Therefore, the first call to `take` will consume the first two chunks, and the return part of the second chunk as leftovers. Without leftover support, the remainder of the second chunk would be entirely lost. Given that a very large amount of streaming code involves these kinds of chunked datatypes, this is a very serious concern.
3. This kind of data loss in inherent to streaming. Whenever you have a stream modifier (e.g., Enumeratee, Conduit) which produces more output values than it consumes, those extra values can be lost. In the example below, the -3 generated by the call to concatMap is lost. However, this has nothing to do with leftovers: I believe the equivalent code in pipes or pipes-core would have the exact same data loss.
So my contention is: a library lacking leftovers support is missing support for the most common streaming use cases, and adding leftovers to the library does not add any additional sources for data loss.
Also, if we go with the presumption that leftover is needed in most streaming code, having two separate data types (one with leftover support, one without) will lead to a lot of difficulty in getting components to integrate, and therefore either ugly user code or lots of library duplication.
Michael
{-# LANGUAGE OverloadedStrings #-}
import Data.Conduit
import qualified Data.Conduit.List as CL
import qualified Data.Conduit.Binary as CB
import qualified Data.ByteString.Char8 as S
import qualified Data.ByteString.Lazy as L
main = do
let print' f = f >>= print
-- The equivalent would cause data loss in conduit 0.4. However, this
-- usage is still not recommended, if only because the meaning is not yet
-- well understood.
putStrLn "Example 1"
print' $ CL.sourceList [1..10] $$ do
CL.drop 3
leftover 3
leftover 2
leftover 1
CL.consume
-- output: [1,2,3,4,5,6,7,8,9,10]
-- Without leftover handling, the following example cannot be implemented.
-- We would have data loss from consuming the beginning of the second
-- chunk.
putStrLn "Example 2"
print' $ CL.sourceList ["foo", "bar", "baz"] $$ do
x <- CB.take 4
y <- CB.take 4
z <- CB.take 4
let toStrict = S.concat . L.toChunks
return $ map toStrict [x, y, z]
-- output: ["foob","arba","z"]
-- Discarding downstream leftovers. The result would be the same if there
-- was no leftover support built into the library. In other words,
-- leftovers do not cause the problem here.
putStrLn "Example 3"
print' $ CL.sourceList [1..10] $$ do
x <- CL.concatMap (\i -> [i, negate i]) =$ CL.take 5
y <- CL.consume
return (x, y)
-- output: ([1,-1,2,-2,3],[4,5,6,7,8,9,10])
Hi Michael,
This leaves me a bit stumped. I have to admit to not having followed conduit development very closely recently, but I would like to get up to speed again, as it seems like these are some interesting issues.
Could you please forward some relevant discussions to this list, and/or write a short summary of what changed and why and what is planned? (I have looked at the conduit 0.4 code and noticed the new Pipe type, but it seems like some context might be helpful.)
Aristid
--
You received this message because you are subscribed to the Google Groups "streaming-haskell" group.
To post to this group, send email to streamin...@googlegroups.com.
To unsubscribe from this group, send email to streaming-hask...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/streaming-haskell?hl=en.