[Haskell-cafe] Conduit+GHC high memory use for simple Sink

34 views
Skip to first unread message

Bryan Vicknair

unread,
Aug 27, 2014, 2:20:06 PM8/27/14
to haskel...@haskell.org
Hello Cafe,

First I'd like to thank Michael Snoyman and Gabriel Gonzalez for the work
they've done on the conduit and pipes stream processing libraries, and all the
accompanying tutorial content. I've been having fun converting a text
processing app from C to Haskell.

I'm seeing unexpectedly high memory usage in a stream-processing program that
uses the conduit library.

I've created a example that demonstrates the problem. The program accepts gzip
files as arguments, and for each one, classifies each line as either Even or
Odd depending on the length, then outputs some result depending on the Sink
used. For each gzip file:

action :: GzipFilePath -> IO ()
action (GzipFilePath filePath) = do
result <- runResourceT $ CB.sourceFile filePath
$$ Zlib.ungzip
=$ CB.lines
=$ token
=$ sink2
putStrLn $ show result

The problem is the following Sink, which counts how many even/odd Tokens are
seen:

type SinkState = (Integer, Integer)

sink2 :: (Monad m) => SinkState -> Sink Token m SinkState
sink2 state@(!evenCount, !oddCount) = do
maybeToken <- await
case maybeToken of
Nothing -> return state
(Just Even) -> sink2 (evenCount + 1, oddCount )
(Just Odd ) -> sink2 (evenCount , oddCount + 1)

When I give this program a few gzip files, it uses hundreds of megabytes of
resident memory. When I give the same files as input, but use the following
simple Sink, it only uses about 8Mb of resident memory:

sink1 :: MonadIO m => Sink Token m ()
sink1 = awaitForever (liftIO . putStrLn . show)

At first I thought that sink2 performed so poorly because the addition thunks
were being placed onto the heap until the end, so I added some bang patterns to
make it strict. That didn't help however.

I've done profiling, but I'm unable to figure out exactly what is being added
to the heap in sink2 but not sink1, or what is being garbage collected in
sink1, but not sink2.


The full source is here:
https://bitbucket.org/bryanvick/conduit-mem/src/HEAD/hsrc/bin/mem.hs

Or you can clone the repo, which contains a cabal file for easy building:

git clone g...@bitbucket.org:bryanvick/conduit-mem.git
cd comduit-mem
cabal sandbox init
cabal install --only-dependencies
cabal build mem
./dist/build/mem/mem [GIVE SOME GZIP FILES HERE]

You can change which sink is used in the 'action' function to see the different
memory usage.
_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Michael Snoyman

unread,
Aug 27, 2014, 4:48:43 PM8/27/14
to Bryan Vicknair, Haskell Cafe
Wow, talk about timing! What you've run into here is expensive monadic bindings. As it turns out, this is exactly what my blog post from last week[1] covered. You have three options to fix this:

1. Just upgrade to conduit 1.2.0, which I released a few hours ago, and uses the codensity transform to avoid the problem. (I just tested your code; you get constant memory usage under conduit 1.2.0, seemingly without any code change necessary.)
2. Instead of writing your `sink2` as you have, express it in terms of Data.Conduit.List.fold, which associates the right way. This looks like:

    fold add (0, 0)
      where
        add (!x, !y) Even = (x + 1, y)
        add (!x, !y) Odd = (x, y + 1)
3. Allow conduit 1.1's rewrite rules to emulate the same behavior and bypass the expensive monadic bind. This can be done by replacing your current `await` with "await followed by bind", e.g.:


sink2 :: (Monad m) => SinkState -> Sink Token m SinkState
sink2 state@(!evenCount, !oddCount) = do
  await >>= maybe (return state) go
 where
  go Even = sink2 (evenCount + 1, oddCount    )
  go Odd  = sink2 (evenCount    , oddCount + 1)

I'd definitely recommend (1). I'd *also* recommend using (2), as the built in functions will often times be more efficient than something hand-rolled, especially now that stream fusion is being added[2]. With conduit 1.2, step (3) *will* be a bit more efficient still (it avoids create an extra Maybe value), but not in a significant way.

Michael

Roman Cheplyaka

unread,
Aug 27, 2014, 5:55:36 PM8/27/14
to Michael Snoyman, Haskell Cafe
* Michael Snoyman <mic...@snoyman.com> [2014-08-27 23:48:06+0300]
> > The problem is the following Sink, which counts how many even/odd Tokens
> > are
> > seen:
> >
> > type SinkState = (Integer, Integer)
> >
> > sink2 :: (Monad m) => SinkState -> Sink Token m SinkState
> > sink2 state@(!evenCount, !oddCount) = do
> > maybeToken <- await
> > case maybeToken of
> > Nothing -> return state
> > (Just Even) -> sink2 (evenCount + 1, oddCount )
> > (Just Odd ) -> sink2 (evenCount , oddCount + 1)
>
> Wow, talk about timing! What you've run into here is expensive monadic
> bindings. As it turns out, this is exactly what my blog post from last
> week[1] covered. You have three options to fix this:
>
> 1. Just upgrade to conduit 1.2.0, which I released a few hours ago, and
> uses the codensity transform to avoid the problem. (I just tested your
> code; you get constant memory usage under conduit 1.2.0, seemingly without
> any code change necessary.)

Interesting. From looking at sink2, it seems that it produces a good,
right-associated bind tree. Am I missing something?

And what occupies the memory in this case?

Roman
signature.asc

Dan Burton

unread,
Aug 27, 2014, 6:39:37 PM8/27/14
to Roman Cheplyaka, Haskell Cafe
Michael, I don't see how your code sample for (3) is any different to the compiler than Roman's original sink2.

I also don't see how the original sink2 creates a bad bind tree. I presume that the reason "fold" works is due to the streaming optimization rule, and not due to its implementation, which looks almost identical to (3).

I worry about using fold in this case, which is only strict up to WHNF, and therefore wouldn't necessarily force the integers in the tuples; instead it would create tons of integer thunks, wouldn't it? Roman's hand-coded sink2 avoids this issue so I presume that's not what is causing his memory woes.

-- Dan Burton


Michael Snoyman

unread,
Aug 28, 2014, 12:01:24 AM8/28/14
to Dan Burton, Haskell Cafe
Firstly, I think you mean Bryan's code, right? Secondly, I wrote this email five minutes before going to bed, so please excuse any inaccuracies or incoherence in it.

(3) is different to the original because it is form as `await >>= maybe`. There's a rewrite rule for this (providing the conduit 1.1 version):

{-# RULES "await >>= maybe" forall x y. await >>= maybe x y = ConduitM (NeedInput (unConduitM . y) (unConduitM . const x)) #-}

This gives two advantages: it avoids a costly (at least in conduit 1.1) monadic bind, and avoids creating a Maybe value that we're going to immediately throw away. This is the general idea of church encoding data structures, which is why it should give *some* speedup, but nothing significant.

Regarding fold: it *is* identical to (3). I don't think the rewrite rules are playing a big part here. You're right about needing to be careful about fold due to WHNF. However, Bryan's code already used bang patterns to force evaluation of both values, so that isn't a problem. In a real-world application, I'd *also* recommend creating a strict pair datatype to avoid accidentally leaking thunks.

But looking at the code again with fresher eyes than last night: I really don't understand why it had such abysmal performance. I'll look into this a bit more, looks like it should be interesting.

Bryan Vicknair

unread,
Aug 28, 2014, 1:53:39 AM8/28/14
to Michael Snoyman, haskel...@haskell.org
Thanks for the interesting blog posts Michael. I updated the example project
[1] to use conduit 1.2. Unfortunately, on my machine [2], my original sink2
still uses about 500Mb of memory when processing 4 gzip files of about 5Mb
each, while sink1 only uses about 8Mb. I added sink3, which does the same as
sink2 but uses fold from Conduit.List as you recommended, and that seems to
work, using about 8Mb.

Looking at the code for sink2 vs sink3, I don't understand what would be
occupying so much memory in sink2 even in the case of expensive monadic
binding, or exclusion from stream fusion. I'm curious if sink2 adds thunks to
the heap that sink3 doesn't, or if the GC is failing to clean up heap objects
in sink2 that is cleans up in sink3. I'm new at memory profiling, but the
chart I get with '+RTS -h' or '+RTS -hr' basically just tells me that the
action function is expensive.

In the real project that inspired this example I'm going to do some cleanup,
replacing manual recursion with higher-level functions from Conduit.List, as
that seems like an all around good idea.


Bryan Vicknair

[1] https://bitbucket.org/bryanvick/conduit-mem
[2] GHC 7.8.3, Arch Linux 3.16.1 kernel x86-64


On Thu, Aug 28, 2014 at 07:00:41AM +0300, Michael Snoyman wrote:
<snip>

Michael Snoyman

unread,
Aug 28, 2014, 1:57:07 AM8/28/14
to Bryan Vicknair, Haskell Cafe
I actually just got to an interesting result: sink2 is a red herring. Consider the following program:

import Control.Monad.IO.Class ( liftIO )
import Data.Conduit.Internal (ConduitM (..), Pipe (..), (>+>), runPipe, awaitForever)

main :: IO ()
main = runPipe $
    (HaveOutput (Done ()) (return ()) ()) >+>
    awaitForever (\_ -> liftIO $ lengthM 0 [1..10000000 :: Int] >>= print)

lengthM :: Monad m => Int -> [a] -> m Int
lengthM cnt [] = return cnt
lengthM cnt (_:xs) =
    cnt' `seq` lengthM cnt' xs
  where
    cnt' = cnt + 1


On my machine, it takes 375MB of memory. What appears to be the cause is that GHC is keeping the entire representation of `lengthM` in memory, which is clearly a pessimization. I still need to research this further, but I thought you'd want to see these results now. (Plus, maybe someone else has some other ideas.)

In case anyone wants, the core for this code is available at:

http://lpaste.net/110125

Michael

Michael Snoyman

unread,
Aug 28, 2014, 2:32:13 AM8/28/14
to Bryan Vicknair, Haskell Cafe
Well, I got it down to a case that depends on only base, and uses its own local implementation of a minimal conduit:

http://lpaste.net/110126

But I'm not certain if this is still reproducing the original issue, or if the list getting lifted out here is a different issue.

Michael Snoyman

unread,
Aug 28, 2014, 2:38:18 AM8/28/14
to Bryan Vicknair, Haskell Cafe
Can you provide the output from +RTS -s, as well as the heap profile for -hy?

I believe I'm now also reproducing the memory leak on conduit 1.2, so it must have been a mistake in my testing last night when I thought 1.2 fixed it.


On Thu, Aug 28, 2014 at 8:58 AM, Bryan Vicknair <brya...@gmail.com> wrote:

Simon Peyton Jones

unread,
Aug 28, 2014, 4:38:06 AM8/28/14
to Michael Snoyman, Bryan Vicknair, Haskell Cafe

GHC is keeping the entire representation of `lengthM` in memory

 

Do you mean that?  lengthM is a function; its representation is just code.

 

Perhaps you mean that GHC is keeping the entire list [1..1000000] in memory?  Now that certainly makes sense… after all, doing so saves allocating (I# 4), (I# 5) etc for each call of the function passed to awaitForever.  Granted, it’s probably a bad idea in this case.

 

If that is your issue (still to be confirmed) the relevant ticket is https://ghc.haskell.org/trac/ghc/ticket/7206; could you add your example to that ticket, as further evidence that something should be done?

 

See also comment:9 in the ticket, which I have just added.

 

Simon

Michael Snoyman

unread,
Aug 28, 2014, 4:50:23 AM8/28/14
to Simon Peyton Jones, Haskell Cafe
On Thu, Aug 28, 2014 at 11:37 AM, Simon Peyton Jones <sim...@microsoft.com> wrote:

GHC is keeping the entire representation of `lengthM` in memory

 

Do you mean that?  lengthM is a function; its representation is just code.

 


At the time I wrote it, I did. What I was seeing in the earlier profiling was that a large number of conduit constructors were being kept in memory, and I initially thought something similar was happening with lengthM. It *does* in fact seem like the memory problems with this later example are simply the list being kept in memory. And in fact, there's a far simpler version of this that demonstrates the problem:

main :: IO ()
main = printLen >> printLen

printLen :: IO ()
printLen = lengthM 0 [1..40000000 :: Int] >>= print


lengthM :: Monad m => Int -> [a] -> m Int
lengthM cnt [] = return cnt
lengthM cnt (_:xs) =
    cnt' `seq` lengthM cnt' xs
  where
    cnt' = cnt + 1

I'll add that as a comment to #7206.

This still doesn't answer what's going on in the original code. I'm concerned that the issue may be the same, but I'm not seeing anything in the core yet that's jumping out at me as being the problem. I'll try to look at the code again with fresher eyes later today.

Michael

Michael Snoyman

unread,
Aug 28, 2014, 9:44:24 AM8/28/14
to Simon Peyton Jones, Haskell Cafe



On Thu, Aug 28, 2014 at 11:49 AM, Michael Snoyman <mic...@snoyman.com> wrote:



On Thu, Aug 28, 2014 at 11:37 AM, Simon Peyton Jones <sim...@microsoft.com> wrote:

GHC is keeping the entire representation of `lengthM` in memory

 

Do you mean that?  lengthM is a function; its representation is just code.

 


At the time I wrote it, I did. What I was seeing in the earlier profiling was that a large number of conduit constructors were being kept in memory, and I initially thought something similar was happening with lengthM. It *does* in fact seem like the memory problems with this later example are simply the list being kept in memory. And in fact, there's a far simpler version of this that demonstrates the problem:

main :: IO ()
main = printLen >> printLen

printLen :: IO ()
printLen = lengthM 0 [1..40000000 :: Int] >>= print


lengthM :: Monad m => Int -> [a] -> m Int
lengthM cnt [] = return cnt
lengthM cnt (_:xs) =
    cnt' `seq` lengthM cnt' xs
  where
    cnt' = cnt + 1

I'll add that as a comment to #7206.

This still doesn't answer what's going on in the original code. I'm concerned that the issue may be the same, but I'm not seeing anything in the core yet that's jumping out at me as being the problem. I'll try to look at the code again with fresher eyes later today.


Alright, I've opened up a GHC issue about this:

https://ghc.haskell.org/trac/ghc/ticket/9520

I'm going to continue trying to knock this down to a simpler test case, but it seems that it's sufficient to call `action` twice to make the memory usage high.

Michael

Bryan Vicknair

unread,
Aug 28, 2014, 2:28:29 PM8/28/14
to Michael Snoyman, haskel...@haskell.org
Michael,

When I was first digging into this bug, I also ran into the case where doing an
action twice would trigger a large increase in memory usage. Also strange was
that doing 'sequence_ [action]' was causing the problem for me, but 'do action'
was not.

Strangely enough though, on my machine there is no memory difference between
running 'action' once or twice in your 1st example [1] in comment 4 of bug
#9520. Whether action is done once or twice, the maximum resident memory as
reported by /usr/bin/time -v is about 1.1Mb. I get roughly the same memory
usage from the second code example in that same comment.

Attached are the .prof and .hp files from running the 'mem' binary using sink2
on my machine.

Here is the output from the +RTS -s switch:

2,191,403,328 bytes allocated in the heap
4,269,946,560 bytes copied during GC
528,829,096 bytes maximum residency (21 sample(s))
21,830,752 bytes maximum slop
1070 MB total memory in use (0 MB lost due to fragmentation)

Tot time (elapsed) Avg pause Max pause
Gen 0 3826 colls, 0 par 0.37s 0.42s 0.0001s 0.0032s
Gen 1 21 colls, 0 par 4.02s 14.98s 0.7131s 7.6060s

INIT time 0.00s ( 0.00s elapsed)
MUT time 0.90s ( 6.22s elapsed)
GC time 2.74s ( 11.04s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 1.65s ( 4.35s elapsed)
EXIT time 0.04s ( 0.05s elapsed)
Total time 5.33s ( 17.31s elapsed)

%GC time 51.4% (63.8% elapsed)

Alloc rate 2,432,664,887 bytes per MUT second

Productivity 17.6% of total user, 5.4% of total elapsed


Bryan Vicknair

[1] https://ghc.haskell.org/trac/ghc/ticket/9520#comment:4
mem.hp
mem.prof

Michael Snoyman

unread,
Aug 28, 2014, 4:10:35 PM8/28/14
to Bryan Vicknair, Haskell Cafe
I just added a comment onto that issue. I forgot to mention that that memory problem only occurs with optimizations turned on (-O or -O2). Can you test it out with one of those flags and let me know what happens?

Your heap profile looks pretty similar to what I've been seeing as well, thanks for providing it.

Bryan Vicknair

unread,
Aug 28, 2014, 7:06:47 PM8/28/14
to Michael Snoyman, Haskell Cafe
On Thu, Aug 28, 2014 at 11:09:55PM +0300, Michael Snoyman wrote:
> I just added a comment onto that issue. I forgot to mention that that
> memory problem only occurs with optimizations turned on (-O or -O2). Can
> you test it out with one of those flags and let me know what happens?

Wow, quite a large difference appears when using -O. 4Mb when the action is
run only once vs 789Mb when it is run twice. What's interesting is that the
bytes allocated in the heap seems to grow by a reasonable amount when action is
run twice, but the total resident memory explodes.

The results when action is run only once:


1,440,041,040 bytes allocated in the heap
465,368 bytes copied during GC
35,992 bytes maximum residency (2 sample(s))
21,352 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)

Tot time (elapsed) Avg pause Max pause
Gen 0 2756 colls, 0 par 0.01s 0.01s 0.0000s 0.0006s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s

INIT time 0.00s ( 0.00s elapsed)
MUT time 0.19s ( 0.19s elapsed)
GC time 0.01s ( 0.01s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.20s ( 0.20s elapsed)

%GC time 6.0% (5.3% elapsed)

Alloc rate 7,563,673,522 bytes per MUT second

Productivity 93.6% of total user, 93.5% of total elapsed

Command being timed: "./foo +RTS -s"
User time (seconds): 0.16
System time (seconds): 0.03
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.20
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 4024
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 338
Voluntary context switches: 1
Involuntary context switches: 21
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0


The results when action is run twice:


2,080,041,040 bytes allocated in the heap
1,346,503,136 bytes copied during GC
389,736,000 bytes maximum residency (11 sample(s))
8,871,312 bytes maximum slop
768 MB total memory in use (0 MB lost due to fragmentation)

Tot time (elapsed) Avg pause Max pause
Gen 0 3968 colls, 0 par 0.28s 0.28s 0.0001s 0.0009s
Gen 1 11 colls, 0 par 0.57s 0.57s 0.0519s 0.2668s

INIT time 0.00s ( 0.00s elapsed)
MUT time 0.32s ( 0.32s elapsed)
GC time 0.85s ( 0.85s elapsed)
EXIT time 0.03s ( 0.03s elapsed)
Total time 1.20s ( 1.21s elapsed)

%GC time 71.0% (70.7% elapsed)

Alloc rate 6,553,280,639 bytes per MUT second

Productivity 29.0% of total user, 28.8% of total elapsed

Command being timed: "./foo +RTS -s"
User time (seconds): 0.91
System time (seconds): 0.28
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.20
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 789432
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 196760
Voluntary context switches: 1
Involuntary context switches: 47
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Reply all
Reply to author
Forward
0 new messages