[Haskell-cafe] How to write fast for loops

Niklas Hambüchen

unread,

Apr 26, 2014, 5:40:50 PM4/26/14

to haskell-cafe

As you probably now, `forM_ [1..n]` is incredibly slow in Haskell due to
lack of list fusion [1]; a manually written loop is 10x faster.

How do you people work around this?

I keep defining myself something like

loop :: (Monad m) => Int -> (Int -> m ()) -> m ()
loop bex f = go 0
where
go !n | n == bex = return ()
| otherwise = f n >> go (n+1)

Is there a function for this somewhere already?
Or do you have another way to deal with this problem?

Thanks!

[1]: https://ghc.haskell.org/trac/ghc/ticket/8763
_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

John Lato

unread,

Apr 26, 2014, 8:13:52 PM4/26/14

to Niklas Hambüchen, haskell-cafe

I usually use a manually written loop, but you can use Data.Vector for this and it should fuse.

John L.

Niklas Hambüchen

unread,

Apr 26, 2014, 9:16:06 PM4/26/14

to John Lato, haskell-cafe

Hey John,

do you mean something along the lines of `V.forM_ (fromList [1..n])`?
I guess GHC wouldn't mash away the list in this case and I'd have to
use the new OverloadedLists with `V.forM_ [1..n]` (indeed that looks
quite sweet).
However, would that actually be fast? Judging from how list syntax is
desugared, I guess that would use `V.enumFromTo` then, whose docs claim
"WARNING: This operation can be very inefficient. If at all possible,
use enumFromN instead".

Does anybody know wonder if "vector literals" desugar enumFromN, or
what "can be very inefficient" means?

Niklas

On Sun 27 Apr 2014 01:13:52 BST, John Lato wrote:
> I usually use a manually written loop, but you can use Data.Vector for
> this and it should fuse.

Daniel Fischer

unread,

Apr 26, 2014, 9:32:12 PM4/26/14

to haskel...@haskell.org

On Saturday 26 April 2014, 22:40:50, Niklas Hambüchen wrote:
> As you probably now, `forM_ [1..n]` is incredibly slow in Haskell due to
> lack of list fusion [1];

It's not the lack of list fusion per se. If you have a

forM_ [a .. b] $ \n -> do stuff

where the type is Int, GHC is perfectly capable of eliminating the list and
rewriting it to a loop (usually on par with a hand-written loop, although the
core you get from a hand-written loop often is smaller and nicer to look at).

The problem is that you use the same list multiple times, and GHC thinks "hey,
let me re-use that", so it gets floated to the top-level, and the inner "loop"
really uses the list, which apart from the `case`s on the list forces it to
use boxed 'Int's instead of Int#. Then the boxed Ints need to be unboxed to be
used, oops.

If you manage to conceal the fact that the lists are the same from GHC, it
eliminates the lists and you get fast code also with forM_.

In your matmult example, using

forM_ [k-k .. _SIZE-1] $ \j -> do ...

does the trick for GHC-7.0 to 7.8.

That is of course a brittle workaround, future GHCs might rewrite the k-k to 0
and share the list again.

> a manually written loop is 10x faster.
>
> How do you people work around this?

Do the idiomatic thing and write a loop ;)

>
> [1]: https://ghc.haskell.org/trac/ghc/ticket/8763

Cheers,
Daniel

Niklas Hambüchen

unread,

Apr 26, 2014, 9:46:07 PM4/26/14

to Daniel Fischer, haskel...@haskell.org

Thanks for the insight! There's one thing I don't understand yet:

On 27/04/14 02:32, Daniel Fischer wrote:
> The problem is that you use the same list multiple times, and GHC thinks "hey,
> let me re-use that", so it gets floated to the top-level, and the inner "loop"
> really uses the list, which apart from the `case`s on the list forces it to
> use boxed 'Int's instead of Int#. Then the boxed Ints need to be unboxed to be
> used, oops.

When you say that the problem is that I "use the same list multiple
times", do you mean that I use the actual syntactic expression
`[0.._SIZE-1]` multiple times on multiple lines, and suggest that this
should not happen if I use it in only one place?

If so, how far does that go, and would GHC even notice that it is the
same as a potential `[0.._511]` I might write elsewhere?

> Do the idiomatic thing and write a loop ;)

Unfortunately, `forM_ [1..n]` is pretty much the most idiomatic and
beautiful way I can come up with when ask for an iteration over 1 up to
n! So this better be fixed :)

Daniel Fischer

unread,

Apr 26, 2014, 10:12:55 PM4/26/14

to Niklas Hambüchen, haskel...@haskell.org

On Sunday 27 April 2014, 02:46:07, Niklas Hambüchen wrote:
> Thanks for the insight! There's one thing I don't understand yet:
>
> On 27/04/14 02:32, Daniel Fischer wrote:
> > The problem is that you use the same list multiple times, and GHC thinks
> > "hey, let me re-use that", so it gets floated to the top-level, and the
> > inner "loop" really uses the list, which apart from the `case`s on the
> > list forces it to use boxed 'Int's instead of Int#. Then the boxed Ints
> > need to be unboxed to be used, oops.
>
> When you say that the problem is that I "use the same list multiple
> times", do you mean that I use the actual syntactic expression
> `[0.._SIZE-1]` multiple times on multiple lines, and suggest that this
> should not happen if I use it in only one place?

More or less. The `_SIZE - 1` is constant-folded to 511, so writing the list
as [0 .. 511] would probably not prevent the sharing. And using the same list
in a different function may or may not affect the produced code, I don't know
GHC well enough to predict that.

If you write it in a form that GHC doesn't recognise as the same value, GHC
can and does optimise it as if the value were used only once.

>
> If so, how far does that go, and would GHC even notice that it is the
> same as a potential `[0.._511]` I might write elsewhere?

Depends on "elsewhere". I've seen GHC produce multiple top-level values for
the very same thing used in different functions in the same module, so if
"elsewhere" means a different function and inlining doesn't cause GHC to see
the two at the same time, it will probably not notice. But ask somebody who
knows how GHC works if you want to be sure.

>
> > Do the idiomatic thing and write a loop ;)
>
> Unfortunately, `forM_ [1..n]` is pretty much the most idiomatic and
> beautiful way I can come up with when ask for an iteration over 1 up to
> n!

Well, my idiom is looping, because.

> So this better be fixed :)

Agreed.

John Lato

unread,

Apr 26, 2014, 10:28:30 PM4/26/14

to Niklas Hambüchen, haskell-cafe

Hi Niklas,

No, I mean this:

> V.forM_ (V.enumFromN 1 n)

I agree it's less elegant. And like Daniel points out with lists, you definitely don't want the "enumFromN" to be shared. But one of the biggest warts of Haskell IMHO is that it's still very difficult to get really high-performance code without an intimate understanding of the optimizer.

John L.

Carter Schonwald

unread,

Apr 26, 2014, 11:20:54 PM4/26/14

to John Lato, haskell-cafe

John, can't a careful use of the identity monad or the like prevent that CSE? (i'm a bit fuzzy on that piece of ghc lore)

Tom Ellis

unread,

Apr 27, 2014, 3:47:32 AM4/27/14

to haskel...@haskell.org

On Sun, Apr 27, 2014 at 03:32:12AM +0200, Daniel Fischer wrote:
> On Saturday 26 April 2014, 22:40:50, Niklas Hambüchen wrote:
> > As you probably now, `forM_ [1..n]` is incredibly slow in Haskell due to
> > lack of list fusion [1];
>
> It's not the lack of list fusion per se. If you have a
>
> forM_ [a .. b] $ \n -> do stuff
>
> where the type is Int, GHC is perfectly capable of eliminating the list and
> rewriting it to a loop (usually on par with a hand-written loop, although the
> core you get from a hand-written loop often is smaller and nicer to look at).
>
> The problem is that you use the same list multiple times, and GHC thinks "hey,
> let me re-use that", so it gets floated to the top-level

If this is the case, does -fno-full-laziness help?

Daniel Fischer

unread,

Apr 27, 2014, 8:21:34 AM4/27/14

to Tom Ellis, haskel...@haskell.org

On Sunday 27 April 2014, 08:47:32, Tom Ellis wrote:
> On Sun, Apr 27, 2014 at 03:32:12AM +0200, Daniel Fischer wrote:
> > The problem is that you use the same list multiple times, and GHC thinks
> > "hey, let me re-use that", so it gets floated to the top-level
>
> If this is the case, does -fno-full-laziness help?

Yes and no. It prevents the list from being floated to the top-level, but the
overall code you get is slower than with the shared list.

So it works as far as preventing unwanted sharing is concerned, but it
inhibits optimisations at other places.

Niklas Hambüchen

unread,

Apr 27, 2014, 4:10:00 PM4/27/14

to John Lato, haskell-cafe

On 27/04/14 03:28, John Lato wrote:
> No, I mean this:
>
>> V.forM_ (V.enumFromN 1 n)

Hey John,

unfortunately that's not fast, and it eats all the RAMs.

Here comes the sad comparison of looping in Haskell:

- loop (maxBound :: Word32)
3 seconds, constant space

- forM_ [0..maxBound :: Word32]
36 seconds, constant space

- V.forM_ (V.enumFromTo 0 (maxBound :: Word32))
37 seconds, constant space

- V.forM_ (V.enumFromN 0 (fromIntegral (maxBound :: Word32)))
linear space -> crashes

All loops execute `return ()`.

:(

John Lato

unread,

Apr 27, 2014, 6:18:13 PM4/27/14

to Niklas Hambüchen, haskell-cafe

On Apr 27, 2014 1:10 PM, "Niklas Hambüchen" <ma...@nh2.me> wrote:
>
> On 27/04/14 03:28, John Lato wrote:
> > No, I mean this:
> >
> >> V.forM_ (V.enumFromN 1 n)
>
> Hey John,
>
> unfortunately that's not fast, and it eats all the RAMs.
>
> Here comes the sad comparison of looping in Haskell:
>
> - loop (maxBound :: Word32)
> 3 seconds, constant space
>
> - forM_ [0..maxBound :: Word32]
> 36 seconds, constant space
>
> - V.forM_ (V.enumFromTo 0 (maxBound :: Word32))
> 37 seconds, constant space
>
> - V.forM_ (V.enumFromN 0 (fromIntegral (maxBound :: Word32)))
> linear space -> crashes
>
> All loops execute `return ()`.

Hmm. Are you using regular vectors or unboxed? Also what sort of crash? Is it a stack space overflow or are you on a 32-bit platform?

Niklas Hambüchen

unread,

Apr 27, 2014, 6:37:20 PM4/27/14

to John Lato, haskell-cafe

On 27/04/14 23:18, John Lato wrote:> Hmm. Are you using regular vectors

or unboxed? Also what sort of crash?
> Is it a stack space overflow or are you on a 32-bit platform?

I'm using regular Data.Vector as V.

It's just eating my 8GB and then I kill it (or let my ulimit do it). It
can't be the stack space since I'm on 7.6 where stack space is limited
by default.

It's on 64 bit Linux.

John Lato

unread,

Apr 27, 2014, 7:22:26 PM4/27/14

to Niklas Hambüchen, haskell-cafe

On Sun, Apr 27, 2014 at 3:37 PM, Niklas Hambüchen <ma...@nh2.me> wrote:

On 27/04/14 23:18, John Lato wrote:> Hmm. Are you using regular vectors

or unboxed? Also what sort of crash?
> Is it a stack space overflow or are you on a 32-bit platform?

I'm using regular Data.Vector as V.

Are unboxed vectors faster? My rule of thumb is to use them over Data.Vector whenever possible.

It's just eating my 8GB and then I kill it (or let my ulimit do it). It
can't be the stack space since I'm on 7.6 where stack space is limited
by default.

It's on 64 bit Linux.

I would expect it's because you never force the argument. With `enumFromTo` the argument is forced because it needs to be checked for termination, but `enumFromN` is probably building up a big chain of thunks. I guess for this case `enumFromN` has no benefit over `enumFromTo` because the intention is to create a single loop instead of actually allocating the vector, so the warning in the documentation doesn't necessarily apply.

Niklas Hambüchen

unread,

Apr 27, 2014, 7:49:05 PM4/27/14

to John Lato, haskell-cafe

On 28/04/14 00:22, John Lato wrote:
> Are unboxed vectors faster? My rule of thumb is to use them over
> Data.Vector whenever possible.

I haven't checked yet, but should it matter?
Because my goal is that the vector never be created *at all*, and boxed
or not shouldn't make a difference on that!

> I would expect it's because you never force the argument. With
> `enumFromTo` the argument is forced because it needs to be checked for
> termination, but `enumFromN` is probably building up a big chain of
> thunks. I guess for this case `enumFromN` has no benefit over
> `enumFromTo` because the intention is to create a single loop instead of
> actually allocating the vector, so the warning in the documentation
> doesn't necessarily apply.

Also haven't checked that yet, but I suspect that instead of something
thunk-related, the thing plainly allocates the vector.

Just to clarify: `V.enumFromTo` works much better than `V.enumFromN`
because in contrast to the latter it doesn't actually try to create the
fully sized vector.

Conrad Parker

unread,

Apr 27, 2014, 8:00:00 PM4/27/14

to Niklas Hambüchen, haskell-cafe

Niklas,

just for fun, and seeing as your goal is to make something that works elegeantly and efficiently with list syntax, how about making a new type that's an instance of OverloadedLists but never actually allocates anything, so you can basically desugar "X.forM_ [a..b]" to your loop function?

Conrad.

John Lato

unread,

Apr 27, 2014, 8:34:09 PM4/27/14

to Niklas Hambüchen, haskell-cafe

On Sun, Apr 27, 2014 at 4:49 PM, Niklas Hambüchen <ma...@nh2.me> wrote:

On 28/04/14 00:22, John Lato wrote:
> Are unboxed vectors faster? My rule of thumb is to use them over
> Data.Vector whenever possible.

I haven't checked yet, but should it matter?
Because my goal is that the vector never be created *at all*, and boxed
or not shouldn't make a difference on that!

It can make a difference in that, with unboxed vectors, the compiler can statically determine that it is able to use unboxed values, and therefore is more likely to do so. Having finally broken down and run some tests, I can report that on my system using V.enumFromTo with unboxed vectors results in the same performance as the hand-written loop.

> I would expect it's because you never force the argument. With
> `enumFromTo` the argument is forced because it needs to be checked for
> termination, but `enumFromN` is probably building up a big chain of
> thunks. I guess for this case `enumFromN` has no benefit over
> `enumFromTo` because the intention is to create a single loop instead of
> actually allocating the vector, so the warning in the documentation
> doesn't necessarily apply.

Also haven't checked that yet, but I suspect that instead of something
thunk-related, the thing plainly allocates the vector.

Just to clarify: `V.enumFromTo` works much better than `V.enumFromN`
because in contrast to the latter it doesn't actually try to create the
fully sized vector.

I believe that if you check this with ghc-7.6.3 and -O2, you will discover that my analysis is correct :)

However, I like Conrad's suggestion, looks like an interesting library.

Niklas Hambüchen

unread,

Apr 27, 2014, 10:23:54 PM4/27/14

to John Lato, haskell-cafe

I've just uploaded my benchmarks to https://github.com/nh2/loop.

Please take a look. There are some interesting results.

The first thing I don't understand at all is:

http://htmlpreview.github.io/?https://github.com/nh2/loop/blob/master/results/bench.html

See how w32/loop and w32/unsafeLoop are equally fast. Then look at

http://htmlpreview.github.io/?https://github.com/nh2/loop/blob/master/results/bench-traverse-w32.html

Here I run the same thing over the whole of Word32.
See how `loop` is faster here than `unsafeLoop`? How does that make sense?

Next thing:

It seems that V.enumFromTo and V.fromList are actually the same:

http://hackage.haskell.org/package/vector-0.10.9.1/docs/src/Data-Vector-Fusion-Stream-Monadic.html#enumFromTo

However in my benchmark, at least for `Int`, the performance is
different - am I overlooking something?

On 28/04/14 01:34, John Lato wrote:
> It can make a difference in that, with unboxed vectors, the compiler can
> statically determine that it is able to use unboxed values, and
> therefore is more likely to do so. Having finally broken down and run
> some tests, I can report that on my system using V.enumFromTo with
> unboxed vectors results in the same performance as the hand-written loop.

I cannot see a difference between Vector.enumFromTo and
Vector.Unboxed.enumFromTo in my benchmark.

Vector.enumFromTo is as fast as the hand-written loop, but only for
`Int`. For `Word32`, it is 5 times slower. Any idea why?

> > I would expect it's because you never force the argument. With
> > `enumFromTo` the argument is forced because it needs to be checked for
> > termination, but `enumFromN` is probably building up a big chain of
> > thunks. I guess for this case `enumFromN` has no benefit over
> > `enumFromTo` because the intention is to create a single loop
> instead of
> > actually allocating the vector, so the warning in the documentation
> > doesn't necessarily apply.
>
> Also haven't checked that yet, but I suspect that instead of something
> thunk-related, the thing plainly allocates the vector.
>
> Just to clarify: `V.enumFromTo` works much better than `V.enumFromN`
> because in contrast to the latter it doesn't actually try to create the
> fully sized vector.
>
>
> I believe that if you check this with ghc-7.6.3 and -O2, you will
> discover that my analysis is correct :)

Ok, I think I understand now what you mean.

Niklas Hambüchen

unread,

Apr 27, 2014, 10:25:25 PM4/27/14

to Conrad Parker, haskell-cafe

Hey Conrad!

That sounds like a nice idea. I might have a look at it after I have
actually figured out what the fastest/best way is (see the benchmarks I
just posted).

Niklas

On 28/04/14 01:00, Conrad Parker wrote:
> Niklas,
>
> just for fun, and seeing as your goal is to make something that works
> elegeantly and efficiently with list syntax, how about making a new type
> that's an instance of OverloadedLists but never actually allocates
> anything, so you can basically desugar "X.forM_ [a..b]" to your loop
> function?
>
> Conrad.

John Lato

unread,

Apr 28, 2014, 12:24:38 AM4/28/14

to Niklas Hambüchen, haskell-cafe

On Sun, Apr 27, 2014 at 7:23 PM, Niklas Hambüchen <ma...@nh2.me> wrote:

I've just uploaded my benchmarks to https://github.com/nh2/loop.

Please take a look. There are some interesting results.

The first thing I don't understand at all is:

http://htmlpreview.github.io/?https://github.com/nh2/loop/blob/master/results/bench.html

See how w32/loop and w32/unsafeLoop are equally fast. Then look at

http://htmlpreview.github.io/?https://github.com/nh2/loop/blob/master/results/bench-traverse-w32.html

Here I run the same thing over the whole of Word32.
See how `loop` is faster here than `unsafeLoop`? How does that make sense?

Huh? In the comments you wrote:

-- Note that some types (e.g. Word32) have bounds checks even for

-- `toEnum`.

Doesn't that explain it? For Int, toEnum/fromEnum is a noop, but on Word32 it's not.

Next thing:

It seems that V.enumFromTo and V.fromList are actually the same:

http://hackage.haskell.org/package/vector-0.10.9.1/docs/src/Data-Vector-Fusion-Stream-Monadic.html#enumFromTo

However in my benchmark, at least for `Int`, the performance is
different - am I overlooking something?

Probably with V.fromList, the list gets floated out? Just guessing, check the core!

On 28/04/14 01:34, John Lato wrote:
> It can make a difference in that, with unboxed vectors, the compiler can
> statically determine that it is able to use unboxed values, and
> therefore is more likely to do so. Having finally broken down and run
> some tests, I can report that on my system using V.enumFromTo with
> unboxed vectors results in the same performance as the hand-written loop.

I cannot see a difference between Vector.enumFromTo and
Vector.Unboxed.enumFromTo in my benchmark.

Vector.enumFromTo is as fast as the hand-written loop, but only for
`Int`. For `Word32`, it is 5 times slower. Any idea why?

Ahh, you made me look at the core again. I think this is related to your observation about V.enumFromTo being the same as V.fromList. With Word32 the generated core shows that this goes via a list representation instead of a nice loop. Which makes me suspect there's some RULE that applies to Stream.enumFromTo that is firing in the first case but not the second. And if I build both versions with -ddump-rule-firings, indeed I see that the Int version has

Rule fired: enumFromTo<Int> [Stream]

With nothing comparable for the Word32 version. I'd imagine if you grep for that in the Vector sources, you'd find something interesting.

The EnumFromN version does not seem to suffer from this (but again it's necessary to evaluate the argument).

Mark Thom

unread,

Apr 28, 2014, 1:06:01 PM4/28/14

to haskell-cafe

Has anyone written a blog post to the effect of "Haskell: The Slow Parts"? Or a heuristic for reliably identifying them, maybe? It sounds as though I could really benefit from it. I had no idea about forM.

Don Stewart

unread,

Apr 28, 2014, 1:08:45 PM4/28/14

to Mark Thom, haskell-cafe

http://book.realworldhaskell.org/read/profiling-and-optimization.html

Would be a start

Niklas Hambüchen

unread,

Apr 28, 2014, 8:35:29 PM4/28/14

to John Lato, haskell-cafe

On 28/04/14 05:24, John Lato wrote:
> Doesn't that explain it? For Int, toEnum/fromEnum is a noop, but on
> Word32 it's not.

No, I'm not talking about the comparison between Word32 and Int:
I'm looking at Word32 only, "w32/loop vs w32/unsafeLoop" in the one
benchmark, and "loop vs unsafeLoop" in the other one.
That's the same functions over the same data type, just that the second
one iterates over more of that data type.

> Ahh, you made me look at the core again. I think this is related to
> your observation about V.enumFromTo being the same as V.fromList. With
> Word32 the generated core shows that this goes via a list representation
> instead of a nice loop. Which makes me suspect there's some RULE that
> applies to Stream.enumFromTo that is firing in the first case but not
> the second. And if I build both versions with -ddump-rule-firings,
> indeed I see that the Int version has
>
> Rule fired: enumFromTo<Int> [Stream]
>
> With nothing comparable for the Word32 version. I'd imagine if you grep
> for that in the Vector sources, you'd find something interesting.

Nice observation.

I filed this as https://github.com/haskell/vector/issues/21.

Hopefully something comes out of it.

Niklas Hambüchen

unread,

Apr 28, 2014, 11:39:35 PM4/28/14

to haskell-cafe

After some evaluation, here come some results of my benchmarks, and the
final function I've packaged:

* It is tricky to write an Enum-based general loop function that is fast
for all data types. The way overflow checks are implemented in different
types makes e.g. Int and Word32 behave differently; Word32 is slower.

* [a..b] is nice and can be just as fast as a manual loop - if you are
lucky. Unfortunately, if you don't want to rely on luck for your
program's performance, you can't have your [a..b] (at the time being).

* Vector's equivalent of [a..b] *might* not be as luck dependent, but
suffers from a factor 5 penalty, which is hopefully a but (see John's post).

Current most appealing solution for fast loops
----------------------------------------------

Unfortunately, this seems to be the most robustly fast (across all types
I have tested it with) loop:

forLoop :: (Monad m) => a -> (a -> Bool) -> (a -> a) -> (a -> m ()) ->
m ()
forLoop start cond inc f = go start
where
go !x | cond x = f x >> go (inc x)
| otherwise = return ()

And you can use it as

forLoop 0 (< n) (+1) $ \i -> ...

Now, you probably don't like this, and neither do I, but it really looks
like this is what you should use if you want to write high performance
code :/

What also works is a `numLoop :: (Num a, Eq a, Monad m) => a -> a -> (a
-> m ()) -> m ()` with start and end value like [a..b], but the
`forLoop` aboves gives more flexibility and you can use either (+1) or
`succ` with it depending on whether you want performance or overflow-safety.

Does it really matter?
----------------------

One might think that in many cases the loop incrementing performance
doesn't matter too much, since the loop body should dominate the time spent.

That is true, but it is easy to get into cases where it doesn't. Some
examples:

* You want to write a test to make sure that reinterpret-casting a
Word32 to Float and back gives you the same Word32 [1]. Using `forLoop`
can make the difference if you have to wait 40 seconds for your test to
pass or 3 seconds. (This is how I got into this topic.)

* You want to implement something that does trivial operations on
unboxed vectors, say dot product or matrix multiplication. `forLoop` can
make a 5x performance difference.

Shouldn't I write something this trivial ad-hoc?
------------------------------------------------

I've packed `forLoop` into https://github.com/nh2/loop and uploaded it
to Hackage (http://hackage.haskell.org/package/loop).

* Maintaining the fastest way to loop in one place frees one from
thinking about it, and I plan to keep this updated with the fastest
implementation known (contributions welcome).

* It gives us a place to discuss alternatives and collect benchmarks for
high-performance looping.

Let's hope it becomes useless soon!

[1]: http://stackoverflow.com/a/7002812/263061

Daniel Trstenjak

unread,

Apr 29, 2014, 3:36:00 AM4/29/14

to haskel...@haskell.org

On Tue, Apr 29, 2014 at 04:39:35AM +0100, Niklas Hambüchen wrote:
> Current most appealing solution for fast loops
> ----------------------------------------------
>
> Unfortunately, this seems to be the most robustly fast (across all types
> I have tested it with) loop:
>

> forLoop :: (Monad m) => a -> (a -> Bool) -> (a -> a) -> (a -> m ()) -> m ()
> forLoop start cond inc f = go start
> where
> go !x | cond x = f x >> go (inc x)
> | otherwise = return ()

Regarding the used stack space, wouldn't the following solution be even better?

forLoop :: (Monad m) => a -> (a -> Bool) -> (a -> a) -> (a -> m ()) -> m ()

forLoop start cond inc f = go start (return ())
where
go !x m | cond x = go (inc x, m >> f x)
| otherwise = m

Greetings,
Daniel

Daniel Trstenjak

unread,

Apr 29, 2014, 4:01:41 AM4/29/14

to haskel...@haskell.org

On Tue, Apr 29, 2014 at 09:36:00AM +0200, Daniel Trstenjak wrote:
> go !x m | cond x = go (inc x, m >> f x)

Obviously I'm programming too much C, it should be:

go !x m | cond x = go (inc x) (m >> f x)

Bertram Felgenhauer

unread,

Apr 29, 2014, 9:03:54 AM4/29/14

to haskel...@haskell.org

Daniel Trstenjak wrote:
> Regarding the used stack space, wouldn't the following solution be even better?
>
>
> forLoop :: (Monad m) => a -> (a -> Bool) -> (a -> a) -> (a -> m ()) -> m ()
> forLoop start cond inc f = go start (return ())
> where
> go !x m | cond x = go (inc x, m >> f x)
> | otherwise = m

This version would build a thunk of shape (return () >> f x1 >> ...
before any of the f xi calls get a chance to run; to make matters
worse, evaluating that thunk *will* use the evaluation stack. The

go !x | cond x = f x >> go (inc x)
| otherwise = return ()

version is better. The guideline here is the strictness of (>>): in most
monads, (>>) is strict in its first argument but lazy in its second one;
furthermore, the second argument is used at most once in typical monad
stacks. So the evaluation stack is not used up at all: Evaluating
'go x' will suspend the 'go (inc x)' call while 'f x' is being run;
once 'f x' returns, the 'go (inc x)' is taken care of, which is now
effectively a tail call. Laziness saves the day.

Cheers,

Bertram

Daniel Trstenjak

unread,

Apr 29, 2014, 11:29:26 AM4/29/14

to haskel...@haskell.org

On Tue, Apr 29, 2014 at 03:03:54PM +0200, Bertram Felgenhauer wrote:
> This version would build a thunk of shape (return () >> f x1 >> ...
> before any of the f xi calls get a chance to run; to make matters
> worse, evaluating that thunk *will* use the evaluation stack.

Ok, I can see, that it has to build the whole thunk, because otherwise
'go' couldn't return anything.

But why would the evaluation of the thunk use the stack?
Why does the evaluation of '>>' in this case use the stack and in the other it doesn't?

> The
>
> go !x | cond x = f x >> go (inc x)
> | otherwise = return ()
>
> version is better. The guideline here is the strictness of (>>): in most
> monads, (>>) is strict in its first argument but lazy in its second one;
> furthermore, the second argument is used at most once in typical monad
> stacks. So the evaluation stack is not used up at all: Evaluating
> 'go x' will suspend the 'go (inc x)' call while 'f x' is being run;
> once 'f x' returns, the 'go (inc x)' is taken care of, which is now
> effectively a tail call. Laziness saves the day.

I think that I was mostly irritated by 'mapM', which uses the stack,
but that's because of it's use of '>>=', and it's bound value has
to be kept on the stack.

Thanks!

Greetings,
Daniel

Niklas Hambüchen

unread,

May 26, 2014, 11:21:33 AM5/26/14

to John Lato, haskell-cafe

There has just been a fix for vector that improves on the bad
performance of Word32 as compared to Int
(https://github.com/haskell/vector/issues/21#issuecomment-44154305).

The new benchmarks
(https://rawgit.com/nh2/loop/master/results/bench-vector-bbd726d.html)
now show Vector+Word32 on par, but functions in base, especially forM_,
are still horrendously slow on Word32.

On 28/04/14 05:24, John Lato wrote:

> Ahh, you made me look at the core again. I think this is related to
> your observation about V.enumFromTo being the same as V.fromList. With
> Word32 the generated core shows that this goes via a list representation
> instead of a nice loop. Which makes me suspect there's some RULE that
> applies to Stream.enumFromTo that is firing in the first case but not
> the second. And if I build both versions with -ddump-rule-firings,
> indeed I see that the Int version has
>
> Rule fired: enumFromTo<Int> [Stream]
>
> With nothing comparable for the Word32 version. I'd imagine if you grep
> for that in the Vector sources, you'd find something interesting.
>
> The EnumFromN version does not seem to suffer from this (but again it's
> necessary to evaluate the argument).

Yitzchak Gale

unread,

May 26, 2014, 12:03:45 PM5/26/14

to John Lato, haskell-cafe

John Lato wrote:
> For Int, toEnum/fromEnum is a noop, but on Word32 it's not.

Actually, GHC has type-specialized rewrite rules for
toEnum/fromEnum on types from Data.Word.
So it's a noop on a 32-bit platform and nearly a noop
on 64-bit.

Regards,
Yitz

Reply all

Reply to author

Forward