import Data.ByteString (ByteString)
import Data.Iteratee as I
import Data.Iteratee.Char
import Data.Iteratee.ZLib
import System
main = do
args <- getArgs
let fname = args !! 0
let blockSize = read $ args !! 1
fileDriver (leak blockSize) fname >>= print
leak :: Int -> Iteratee ByteString IO ()
leak blockSize = joinIM $ enumInflate GZip defaultDecompressParams chunkedRead
where
consChunk :: Iteratee ByteString IO String
consChunk = (joinI $ I.take blockSize I.length) >>= return . show
chunkedRead :: Iteratee ByteString IO ()
chunkedRead = joinI $ convStream consChunk printLines
First argument - file name (/var/log/messages.1.gz will do)
second - size of block to consume input. with low size (10 bytes) of consumed blocks it leaks very fast, with larger blocks (~10000) it works almost without leaks.
So. Is it bugs within my code, or iteratee-compress should behave differently?
_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Thanks for looking in to this.
> From: Maciej Piechotka <uzytk...@gmail.com>
> After looking into problem (or rather onto your code) - the problem have
> nothing to do with iteratee-compress I believe. I get similar behaviour
> and results when I replace "joinIM $ enumInflate GZip
> defaultDecompressParams chunkedRead" by chunkedRead. (The memory is
> smaller but it is due to decompression not iteratee fault).
>
This is due to "printLines". Whether it's a bug depends on what the correct
behavior of "printLines" should be.
"printLines" currently only prints lines that are terminated by an EOL
(either "\n" or "\r\n"). This means that it needs to hold on to the entire
stream received until it finds EOL, and then prints the stream, or drops it
if it reaches EOF first. In your case, the stream generated by "convStream
consChunk printLines" is just a stream of numbers without any EOL, where the
length is dependent on the specified block size. This causes the space
leak.
If I change the behavior of "printLines" to print lines that aren't
terminated by EOL, the leak could be fixed. Whether that behavior is more
useful than the present, I don't know. Alternatively, if you insert some
newlines into your stream this could be improved as well.
As a result of investigating this, I realized that
Data.Iteratee.ListLike.break can be very inefficient in cases where the
predicate is not satisfied relatively early. I should actually provide an
enumeratee interface for it. So thanks very much for (indirectly)
suggesting that.
Cheers,
John L
> Hi Maciej,
>
> Thanks for looking in to this.
>
> > After looking into problem (or rather onto your code) - the problem have
> > nothing to do with iteratee-compress I believe. I get similar behaviour
> > and results when I replace "joinIM $ enumInflate GZip
> > defaultDecompressParams chunkedRead" by chunkedRead. (The memory is
> > smaller but it is due to decompression not iteratee fault).
> >
>
> This is due to "printLines". Whether it's a bug depends on what the correct
> behavior of "printLines" should be.
>
> "printLines" currently only prints lines that are terminated by an EOL
> (either "\n" or "\r\n"). This means that it needs to hold on to the entire
> stream received until it finds EOL, and then prints the stream, or drops it
> if it reaches EOF first. In your case, the stream generated by "convStream
> consChunk printLines" is just a stream of numbers without any EOL, where the
> length is dependent on the specified block size. This causes the space
> leak.
>
> If I change the behavior of "printLines" to print lines that aren't
> terminated by EOL, the leak could be fixed. Whether that behavior is more
> useful than the present, I don't know. Alternatively, if you insert some
> newlines into your stream this could be improved as well.
>
> As a result of investigating this, I realized that
> Data.Iteratee.ListLike.break can be very inefficient in cases where the
> predicate is not satisfied relatively early. I should actually provide an
> enumeratee interface for it. So thanks very much for (indirectly)
> suggesting that.
Actually i can give you full sorce code - it uses also attoparsec-iteratee. it leaks with iteratee-compress and works fine without it.
Whole idea - get bytestring from access.log, convert it to stream of data object with usernames and bytes downliaded and then feed this stream into iteratee which will collect all data into one big Map ByteString Integer.
I'm not familiar with iteratee-compress, but you could be getting hit by
Map's laziness. Instead of a map, could you use something like hashmap[1],
bytestring-trie[2], or Johan's new containers library[3]?
Also, I've recently posted a minor update to iteratee which includes an
enumeratee version of break and an alternative to printLines that doesn't
retain data, which you may find useful.
Cheers,
John
[1] http://hackage.haskell.org/package/hashmap
[2] http://hackage.haskell.org/package/bytestring-trie
[3] http://hackage.haskell.org/package/unordered-containers
Sent from my iPhone
I've definitely hit lazy Map update problems more than once in my Haskell career. Something I was expecting to run in constant space was spiraling out of control. The profiling tools that ship with GHC were outstanding in helping pinpoint the issue. As was Real World Haskell's chapter on the topic.
> Also, I've recently posted a minor update to iteratee which includes an enumeratee version of break and an alternative to printLines that doesn't retain data, which you may find useful.
>
> Cheers,
> John
>
> [1] http://hackage.haskell.org/package/hashmap
> [2] http://hackage.haskell.org/package/bytestring-trie
> [3] http://hackage.haskell.org/package/unordered-containers
>
If you believe that there is leak - please do so. However I don't
imagine a place where they may occur.
Regards