[Haskell-cafe] FASTER primes (was: Re: Code and Perf. Data for Prime Finders (was: Genuine Eratosthenes sieve))

Will Ness

unread,

Dec 28, 2009, 10:39:13 PM12/28/09

to haskel...@haskell.org

apfelmus <apfelmus <at> quantentunnel.de> writes:

>

~~ This is a repost, with apologies to anyone who sees this twice (I've replied
to a two years old thread, and it doesn't show up in GMANE as I thought it
would). ~~

> Dave Bayer wrote:
> > What I'm calling a "venturi"
> >
> > venturi :: Ord a => [[a]] -> [a]
> >
> > merges an infinite list of infinite lists into one list, under the
> > assumption that each list, and the heads of the lists, are in
> > increasing order.
> >
> > I wrote this as an experiment in coding data structures in Haskell's
> > lazy evaluation model, rather than as explicit data. The majority of the
> > work done by this code is done by "merge"; the multiples of each prime
> > percolate up through a tournament consisting of a balanced tree of
> > suspended "merge" function calls. In order to create an infinite lazy
> > balanced tree of lists, the datatype
> >
> > data List a = A a (List a) | B [a]
> >
> > is used as scaffolding. One thinks of the root of the infinite tree as
> > starting at the leftmost child, and crawling up the left spine as
> > necessary.
>
> After some pondering, the List a data structure for merging is really
> ingenious! :) Here's a try to explain how it works:
>
> The task is to merge a number of sorted and infinite lists when it's
> known that their heads are in increasing order. In particular, we want
> to write
>
> primes = (2:) $ diff [3..] $ venturi $ map multiple primes
>
> Thus, we have a (maybe infinite) list
>
> xss = [xs1, xs2, xs3, ...]
>
> of infinite lists with the following properties
>
> all sorted xss
> sorted (map head xss)
>
> where sorted is a function that returns True if the argument is a
> sorted list. A first try to implement the merging function is
>
> venturi xss = foldr1 merge xss
> = xs1 `merge` (xs2 `merge` (xs3 `merge` ...
>
> where merge is the standard function to merge to sorted lists.
>
> However, there are two problems. The first problem is that this doesn't
> work for infinite lists since merge is strict in both arguments. But
> the property head xs1 < head xs2 < head xs3 < ... we missed to exploit
> yet can now be used in the following way
>
> venturi xss = foldr1 merge' xss
>
> merge' (x:xt) ys = x : merge xt ys
>
> In other words, merge' is biased towards the left element
>
> merge' (x:_|_) _|_ = x : _|_
>
> which is correct since we know that (head xs < head ys).
>
> The second problem is that we want the calls to merge to be arranged
> as a balanced binary tree since that gives an efficient heap. It's not
> so difficult to find a good shape for the infinite tree, the real
> problem is to adapt merge' to this situation since it's not associative:
>
> ......
>
> The problem is that the second form doesn't realize that y is also
> smaller than the third argument. In other words, the second form has to
> treat more than one element as "privileged", namely x1,x2,... and y.
> This can be done with the aforementioned list data structure
>
> data People a = VIP a (People a) | Crowd [a]
>
> The people (VIPs and crowd) are assumed to be _sorted_. Now, we can
> start to implement
>
> merge' :: Ord a => People a -> People a -> People a

Hi,

.. replying to a two-years-old post here, :) :) and after consulting the
full "VIP" version in haskellwiki/Prime_Numers#Implicit_Heap ...

It is indeed the major problem with the merged multiples removing code (similar
one to Richard Bird's code from Melissa O'Neill's JFP article) - the linear
nature of foldr, requiring an (:: a->b->b) merge function. To make it freely
composable to rearrange the list into arbitrary form tree it must indeed be
type uniform (:: a->a->a) first, and associative second.

The structure of the folded tree should be chosen to better suit the primes
multiples production. I guestimate the total cost as Sum (1/p)*d, where p is a
generating prime at the leaf, and d the leaf's depth, i.e. the amount of merge
nodes its produced multiple must pass on its way to the top.

The structure used in your VIP code, 1+(2+(4+(8+...))), can actually be
improved upon with another, (2+4)+( (4+8)+( (8+16)+...)), for which the
estimated cost is about 10%-12% lower.

This can be expressed concisely as the following:

primes :: () -> [Integer]
primes () = 2:primes'
where
primes' = [3,5] ++ drop 2 [3,5..] `minus` comps
mults = map (\p-> fromList [p*p,p*p+2*p..]) $ primes'
(comps,_) = tfold mergeSP (pairwise mergeSP mults)
fromList (x:xs) = ([x],xs)

tfold f (a: ~(b: ~(c:xs)))
= (a `f` (b `f` c)) `f` tfold f (pairwise f xs)
pairwise f (x:y:ys) = f x y : pairwise f ys

mergeSP (a,b) ~(c,d) = let (bc,b') = spMerge b c
in (a ++ bc, merge b' d)
where
spMerge u@(x:xs) w@(y:ys) = case compare x y of
LT -> (x:c,d) where (c,d) = spMerge xs w
EQ -> (x:c,d) where (c,d) = spMerge xs ys
GT -> (y:c,d) where (c,d) = spMerge u ys
spMerge u [] = ([], u)
spMerge [] w = ([], w)

with ''merge'' and ''minus'' defined in the usual way. Its run times are indeed
improved 10%-12% over the VIP code from the haskellwiki page. Testing was done
by running the code, interpreted, inside GHCi. The ordered "split pairs"
representing ordered lists here as pairs of a known, finite (so far) prefix and
the rest of list, form a _monoid_ under mergeSP.

Or with wheel,

primes :: () -> [Integer]
primes () = 2:3:5:7:primes'
where
primes' = [11,13] ++ drop 2 (rollFrom 11) `minus` comps
mults = map (\p-> fromList $ map (p*) $ rollFrom p) $ primes'
(comps,_) = tfold mergeSP (pairwise mergeSP mults)
fromList (x:xs) = ([x],xs)

rollFrom n = let x = (n-11) `mod` 210
(y,_) = span (< x) wheelSums
in roll n $ drop (length y) wheel

wheelSums = roll 0 wdiffs
roll = scanl (+)
wheel = wdiffs ++ wheel
wdiffs = 2:4:2:4:6:2:6:4:2:4:6:6:2:6:4:2:6:4:6:8:4:2:4:2:
4:8:6:4:6:2:4:6:2:6:6:4:2:4:6:2:6:4:2:4:2:10:2:10:wdiffs

Now _this_, when tested as interpreted code in GHCi, runs about 2.5x times
faster than Priority Queue based code from Melissa O'Neill's ZIP package
mentioned at the haskellwiki/Prime_Numbers page, with
about half used memory reported, in producing 10,000 to 300,000 primes.

It is faster than BayerPrimes.hs from the ZIP package too, in the tested range,
at about 35 lines of code in total.

~~ This is a repost, with apologies to anyone who sees this twice (I've replied
to a two years old thread, and it doesn't show up in GMANE as I thought it
would). ~~

_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Will Ness

unread,

Dec 28, 2009, 10:50:10 PM12/28/09

to haskel...@haskell.org

Will Ness <will_n48 <at> yahoo.com> writes:

>
>
> wheelSums = roll 0 wdiffs
> roll = scanl (+)
> wheel = wdiffs ++ wheel
> wdiffs = 2:4:2:4:6:2:6:4:2:4:6:6:2:6:4:2:6:4:6:8:4:2:4:2:
> 4:8:6:4:6:2:4:6:2:6:6:4:2:4:6:2:6:4:2:4:2:10:2:10:wdiffs
>

Apparently that works too, but I meant it to be:

wdiffs = 2:4:2:4:6:2:6:4:2:4:6:6:2:6:4:2:6:4:6:8:4:2:4:2:

4:8:6:4:6:2:4:6:2:6:6:4:2:4:6:2:6:4:2:4:2:10:2:10:[]

:)

Daniel Fischer

unread,

Dec 29, 2009, 6:28:45 AM12/29/09

to haskel...@haskell.org, Will Ness

Am Dienstag 29 Dezember 2009 04:38:21 schrieb Will Ness:
> Now _this_, when tested as interpreted code in GHCi, runs about 2.5x times
> faster than Priority Queue based code from Melissa O'Neill's ZIP package
> mentioned at the haskellwiki/Prime_Numbers page, with
> about half used memory reported, in producing 10,000 to 300,000 primes.
>
> It is faster than BayerPrimes.hs from the ZIP package too, in the tested
> range, at about 35 lines of code in total.

That's nice. However, the important criterion is how compiled code (-O2) fares. Do the
relations continue to hold? How does it compare to a bitsieve?

Will Ness

unread,

Dec 29, 2009, 8:34:52 AM12/29/09

to haskel...@haskell.org

Haven't gotten to that part yet. :)

But why is it more important? Would that not tell us more about the compiler
performance than the code itself?

This code is just an endpoint (so far) in a short procession of natural stepwise
development of the famous classic Turner's sieve, through the "postponed
filters", through to Euler's sieve, the merging sieve (i.e. Richard Bird's) and
on to the tree-fold merging, with wheel. I just wanted to see where the simple
"normal" (i.e. _beginner_-friendly) functional code can get, in a natural way.

It's not about writing the fastest code in _advanced_ Haskell. It's about having
clear and simple code that can be understood at a glance - i.e. contributes to
our understanding of a problem - faithfully reflecting its essential elements,
and because of _that_, fast. It's kind of like _not_ using mutable arrays in a
quicksort.

Seeing claims that it's _either_ Turner's _or_ the PQ-based code didn't feel
right to me somehow, especially the claim that going by primes squares is "a
pleasing but minor optimization", what with the postponed filters (which serves
as the framework for all the other variants) achieving the orders of magnitude
speedup and cutting the Turner's O(n^2) right down to O(n^1.5) just by doing
that squares optimization (with the final version hovering around 1.24..1.17 in
the tested range). The Euler's sieve being a special case of Eratosthenes's,
too, doesn't let credence to claims that only the PQ version is somehow uniquely
authentic and "faithful" to it.

Turner's sieve should have been always looked at as just a specification, not a
code, anyway, and actually running it is ridiculous. Postponed filters version,
is the one to be used as a reference point of the basic _code_, precisely
because it _does_ use the primes squares optimization, which _is_ essential to
any basic sieve.

> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe <at> haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe

Eugene Kirpichov

unread,

Dec 29, 2009, 8:37:57 AM12/29/09

to Will Ness, haskel...@haskell.org

2009/12/29 Will Ness <will...@yahoo.com>:

> Daniel Fischer <daniel.is.fischer <at> web.de> writes:
>
>>
>>
>> Am Dienstag 29 Dezember 2009 04:38:21 schrieb Will Ness:
>> > Now _this_, when tested as interpreted code in GHCi, runs about 2.5x times
>> > faster than Priority Queue based code from Melissa O'Neill's ZIP package
>> > mentioned at the haskellwiki/Prime_Numbers page, with
>> > about half used memory reported, in producing 10,000 to 300,000 primes.
>> >
>> > It is faster than BayerPrimes.hs from the ZIP package too, in the tested
>> > range, at about 35 lines of code in total.
>>
>> That's nice. However, the important criterion is how compiled code (-O2)
> fares. Do the relations continue to hold? How does it compare to a bitsieve?
>
>
> Haven't gotten to that part yet. :)
>
> But why is it more important? Would that not tell us more about the compiler
> performance than the code itself?
>

If you mean "algorithmic complexity", you shouldn't care about a
difference of 2.5x.
If you mean "actual performance for a particular task", you should
measure the performance in realistic conditions. Namely, if you're
implementing a program that needs efficient generation of primes,
won't you compile it with -O2?

--
Eugene Kirpichov
Web IR developer, market.yandex.ru

Will Ness

unread,

Dec 29, 2009, 8:58:58 AM12/29/09

to haskel...@haskell.org

Eugene Kirpichov <ekirpichov <at> gmail.com> writes:

>
> 2009/12/29 Will Ness <will_n48 <at> yahoo.com>:

> > Daniel Fischer <daniel.is.fischer <at> web.de> writes:
> >
> >>
> >>
> >> Am Dienstag 29 Dezember 2009 04:38:21 schrieb Will Ness:
> >> > Now _this_, when tested as interpreted code in GHCi, runs about 2.5x times
> >> > faster than Priority Queue based code from Melissa O'Neill's ZIP package
> >> > mentioned at the haskellwiki/Prime_Numbers page, with
> >> > about half used memory reported, in producing 10,000 to 300,000 primes.
> >> >
> >> > It is faster than BayerPrimes.hs from the ZIP package too, in the tested
> >> > range, at about 35 lines of code in total.
> >>
> >> That's nice. However, the important criterion is how compiled code (-O2)
> > fares. Do the relations continue to hold? How does it compare to a bitsieve?
> >
> >
> > Haven't gotten to that part yet. :)
> >
> > But why is it more important? Would that not tell us more about the compiler
> > performance than the code itself?
> >
>
> If you mean "algorithmic complexity", you shouldn't care about a
> difference of 2.5x.

It's not just at one point; the asymptotics are _the_same_ across the range that
I've tested (admittedly, somewhat narrow). I measure local behavior simply as
logBase in base of ratio of problem sizes, of the ratio of run times.

> If you mean "actual performance for a particular task", you should
> measure the performance in realistic conditions. Namely, if you're
> implementing a program that needs efficient generation of primes,
> won't you compile it with -O2?

If I realistically needed primes generated in a real life setting, I'd probably
had to use some C for that. If OTOH we're talking about a tutorial code that is
to be as efficient as possible without loosing it clarity, being a reflection of
essentials of the problem, then any overly complicated advanced Haskell wouldn't
be my choice either. And seeing that this overly-complicated (IMO),
steps-jumping PQ-based code was sold to us as the only "faithful" rendering of
the sieve, I wanted to see for myself whether this really holds water.

Will Ness

unread,

Dec 29, 2009, 11:43:27 AM12/29/09

to haskel...@haskell.org

Daniel Fischer <daniel.is.fischer <at> web.de> writes:

OK, I've tested it now. For some reason it runs about 1.3x times slower than
Melissa O'Neill's code when compiled, while taking about the same amount of
memory (going slightly worse actually, 0.97..1.05..1.08), in the range of
100,000..1,000,000..2,000,000 primes produced. The local asymptotic behavior was
about the same, again, in both versions, about O(n^1.20..1.25) - worsening
slightly for the merging version (the ratio of run times going 1.25..1.29..1.32).

I guess that makes the two versions (almost) operationally equivalent in
producing of up to a million primes or two.

>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe <at> haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>

Daniel Fischer

unread,

Dec 29, 2009, 2:19:08 PM12/29/09

to haskel...@haskell.org, Will Ness

Am Dienstag 29 Dezember 2009 14:34:03 schrieb Will Ness:
> Daniel Fischer <daniel.is.fischer <at> web.de> writes:
> > Am Dienstag 29 Dezember 2009 04:38:21 schrieb Will Ness:
> > > Now _this_, when tested as interpreted code in GHCi, runs about 2.5x
> > > times faster than Priority Queue based code from Melissa O'Neill's ZIP
> > > package mentioned at the haskellwiki/Prime_Numbers page, with
> > > about half used memory reported, in producing 10,000 to 300,000 primes.
> > >
> > > It is faster than BayerPrimes.hs from the ZIP package too, in the
> > > tested range, at about 35 lines of code in total.
> >
> > That's nice. However, the important criterion is how compiled code (-O2)
>
> fares. Do the relations continue to hold? How does it compare to a
> bitsieve?
>
>
> Haven't gotten to that part yet. :)
>
> But why is it more important?

I thought the uppercase FASTER in the subject meant you were really interested in speed.
If you're only interested in asymptotics, interpreted may be appropriate.
However, it is possible that optimisation can change the perceived asymptotics of an
algorithm (determined strictness eliminating thunks for example).

While I haven't detected that with the primes code, I find that in my ghci your code is
approximately 2.5 times faster than ONeill or Bayer when interpreted (no difference in
scaling observed), while when compiled with -O2, ONeill is approximately three times as
fast as your code and twice as fast as Bayer as an executable, about twice as fast as your
code and slightly slower than Bayer in ghci.
And I have huge memory problems in ghci with your code.
That may be due to my implementation of merge and minus, though. You wrote 'standard' and
I coded the straightforward methods.

> Would that not tell us more about the compiler performance than the code itself?

Unless you write machine code or assembly, don't all performance tests tell us more about
the compiler/interpreter performance than the code itself?
That is, of course, with respect to algorithms with the same scaling behaviour.

>
> This code is just an endpoint (so far) in a short procession of natural
> stepwise development of the famous classic Turner's sieve,

That was

sieve (x:xs) = x:sieve (filter ((/= 0) . (`mod` x)) xs)

, was it?

> through the
> "postponed filters", through to Euler's sieve, the merging sieve (i.e.
> Richard Bird's) and on to the tree-fold merging, with wheel. I just wanted
> to see where the simple "normal" (i.e. _beginner_-friendly) functional code
> can get, in a natural way.

Good.

>
> It's not about writing the fastest code in _advanced_ Haskell. It's about
> having clear and simple code that can be understood at a glance - i.e.
> contributes to our understanding of a problem - faithfully reflecting its
> essential elements, and because of _that_, fast. It's kind of like _not_
> using mutable arrays in a quicksort.

What's wrong with mutable arrays? There are a lot of algorithms which can be easily and
efficiently implemented using mutable unboxed arrays while a comparably efficient
implementation without mutable arrays is hard. For those, I consider STUArrays the natural
choice. Sieving primes falls into that category.

>
> Seeing claims that it's _either_ Turner's _or_ the PQ-based code didn't
> feel right to me somehow,

I fully agree.

> especially the claim that going by primes squares
> is "a pleasing but minor optimization",

Which it is not. It is a major optimisation. It reduces the algorithmic complexity *and*
reduces the constant facors significantly.

> what with the postponed filters
> (which serves as the framework for all the other variants) achieving the
> orders of magnitude speedup and cutting the Turner's O(n^2) right down to
> O(n^1.5) just by doing that squares optimization (with the final version
> hovering around 1.24..1.17 in the tested range). The Euler's sieve being a
> special case of Eratosthenes's, too, doesn't let credence to claims that
> only the PQ version is somehow uniquely authentic and "faithful" to it.

I never found that claim convincing either.

Daniel Fischer

unread,

Dec 29, 2009, 2:37:23 PM12/29/09

to haskel...@haskell.org, Will Ness

Gee, seems my mail server reads your posts very thoroughly today :)

If you need the utmost speed, then probably yes. If you can do with a little less, my
STUArray bitsieves take about 35% longer than the equivalent C code and are roughly eight
times faster than ONeillPrimes. I can usually live well with that.

> If OTOH we're talking about a tutorial
> code that is to be as efficient as possible without loosing it clarity,
> being a reflection of essentials of the problem, then any overly
> complicated advanced Haskell wouldn't be my choice either.

+1
Though perhaps we view mutable array code differently. In my view, it's neither advanced
nor complicated.

> And seeing that
> this overly-complicated (IMO), steps-jumping PQ-based code was sold to us
> as the only "faithful" rendering of the sieve, I wanted to see for myself
> whether this really holds water.

I can understand that very well.

Will Ness

unread,

Dec 29, 2009, 7:05:24 PM12/29/09

to haskel...@haskell.org

that was what I was getting at first too, before I've put into my code the
_type_signatures_ and the "specialize" _pragmas_ as per her file. Then it was
only 1.3x slower, when compiled (with about same asymptotics and memory usage).

>and twice as fast as Bayer as an executable, about twice as fast as your code
> and slightly slower than Bayer in ghci.

see, this kind of inconsistencies is exactly why I was concentrating only on
one platform in measuring the speed - the interp'/GHCi combination. Especially
when developing and trying out several approaches, to test with compiler just
takes too long. :) And why should it give (sometimes) wildly different readings
when running inside GHCi or standalone ??

> And I have huge memory problems in ghci with your code.
> That may be due to my implementation of merge and minus, though. You wrote
> 'standard' and I coded the straightforward methods.

Here's what I'm using (BTW I've put it on the primes haskellwiki page too). The
memory reported for interpreted is about half of PQ's (IIRC), and compiled - the
same:

minus a@(x:xs) b@(y:ys) = case compare x y of
LT -> x: xs `minus` b
GT -> a `minus` ys
EQ -> xs `minus` ys
minus a b = a

merge a@(x:xs) b@(y:ys) = case compare x y of
LT -> x: merge xs b
EQ -> x: merge xs ys
GT -> y: merge a ys
merge a b = if null b then a else b

>
> > Would that not tell us more about the compiler performance than the code
> > itself?
>
> Unless you write machine code or assembly, don't all performance tests tell us
more about the compiler/interpreter performance than the code itself?
> That is, of course, with respect to algorithms with the same scaling behaviour.
>
> >
> > This code is just an endpoint (so far) in a short procession of natural
> > stepwise development of the famous classic Turner's sieve,
>
> That was
>
> sieve (x:xs) = x:sieve (filter ((/= 0) . (`mod` x)) xs)
>
> , was it?

right

>
> > through the
> > "postponed filters", through to Euler's sieve, the merging sieve (i.e.
> > Richard Bird's) and on to the tree-fold merging, with wheel. I just wanted
> > to see where the simple "normal" (i.e. _beginner_-friendly) functional code
> > can get, in a natural way.
>
> Good.
>
> >
> > It's not about writing the fastest code in _advanced_ Haskell. It's about
> > having clear and simple code that can be understood at a glance - i.e.
> > contributes to our understanding of a problem - faithfully reflecting its
> > essential elements, and because of _that_, fast. It's kind of like _not_
> > using mutable arrays in a quicksort.
>
> What's wrong with mutable arrays? There are a lot of algorithms which can be
> easily and efficiently implemented using mutable unboxed arrays while a
> comparably efficient implementation without mutable arrays is hard. For
> those, I consider STUArrays the natural choice. Sieving primes falls into
> that category.

It's just that the mutating code tends to be convoluted, like in the example I
mentioned of quicksort. One has to read the C code with good attention to
understand it. "Normal" Haskell is much more visually apparent, like

primes = 2: 3: sieve (tail primes) [5,7..]
where
sieve (p:ps) xs = h ++ sieve ps (t `minus` tail [q,q+2*p..])
where (h,~(_:t)) = span (< q) xs
q = p*p

or

primes = 2: 3: sieve [] (tail primes) 5
where
sieve fs (p:ps) x = [i | i<- [x,x+2..q-2], a!i]
++ sieve ((2*p,q):fs') ps (q+2)
where
q = p*p
mults = [ [y+s,y+2*s..q] | (s,y)<- fs]
fs' = [ (s,last ms) | ((s,_),ms)<- zip fs mults]
a = accumArray (\a b->False) True (x,q-2)
[(i,()) | ms<- mults, i<- ms]

>
> >
> > Seeing claims that it's _either_ Turner's _or_ the PQ-based code didn't
> > feel right to me somehow,
>
> I fully agree.

:) :) :) :) :)

>
> > especially the claim that going by primes squares
> > is "a pleasing but minor optimization",
>
> Which it is not. It is a major optimisation. It reduces the algorithmic
complexity *and* reduces the constant facors significantly.

Exactly! Seeing this claim was just incredible to me. I've spent a
considerable
time when I first learned Haskell, tweaking the SICP code (as I remembred it;
probably very similar to Turner's) until coming up with an equivalent of the
"postponed sieve" (some years ago, didn't know about "span" yet :) ). But I
assumed that this result was well known. Turner's sieve should long be
regarded
as _specification_, not an actual _code_, I thought.

I think what happened was that Melissa O'Neill thought about the mutable
storage
i.e. "imperative" implementation when she said that, where numbers do get
"crossed off" from the same "canvas". But here in functional code we don't
"cross off" no numbers; we deal with numbers supply and filtering and merging,
and nested function calls with their overhead etc., which costs can't be just
ignored. IOW there's no "crossing off" done by any of extra filters, which
nevertheless are all VERY busy, doing nothing. _Not_ "crossing" the multiples
"off".

>
> > what with the postponed filters
> > (which serves as the framework for all the other variants) achieving the
> > orders of magnitude speedup and cutting the Turner's O(n^2) right down to
> > O(n^1.5) just by doing that squares optimization (with the final version
> > hovering around 1.24..1.17 in the tested range). The Euler's sieve being a
> > special case of Eratosthenes's, too, doesn't let credence to claims that
> > only the PQ version is somehow uniquely authentic and "faithful" to it.
>
> I never found that claim convincing either.

I think what got crossed probably was "faithful to the original algorithm",
with
"faithful" to its typical imperative mutable storage implementation, as in
"having same _complexity_". In that sense of course, linear merging is worse;
it
has worse complexity than "C", but is nevertheless faithful to the original
algorithm, only under the functional setting. It is worse because of linear
nature of lists, and it is all too easy to overlook the possibility of tree
folding and jump to the conclusion that one needs a specialized data structure
for that... But the article didn't even get to that part; instead it was all
about proving rigorously that divisibility testing for primes is very costly
(without actually formulating this conclusion). It was frustrating to read
that
"details of what gets crossed off and how, matter" without these details being
actually spelled out - simply, that primes shouldn't be tested at all. That's
the real insight of the article, IMO.

>
> >
> > Turner's sieve should have been always looked at as just a specification,
> > not a code, anyway, and actually running it is ridiculous. Postponed
> > filters version, is the one to be used as a reference point of the basic
> > _code_, precisely because it _does_ use the primes squares optimization,
> > which _is_ essential to any basic sieve.
>
>
>

Will Ness

unread,

Dec 29, 2009, 7:24:18 PM12/29/09

to haskel...@haskell.org

Daniel Fischer <daniel.is.fischer <at> web.de> writes:

>
>
> Gee, seems my mail server reads your posts very thoroughly today :)

I hope it's not a bad thing. :)

>
> Am Dienstag 29 Dezember 2009 14:58:10 schrieb Will Ness:
> >
> > If I realistically needed primes generated in a real life setting, I'd
> > probably had to use some C for that.
>
> If you need the utmost speed, then probably yes. If you can do with a little
> less, my STUArray bitsieves take about 35% longer than the equivalent C code >
and are roughly eight times faster than ONeillPrimes. I can usually live well >
with that.

Wow! That's fast! (that's the code from haskellwiki's primes page, right?)

>
> > If OTOH we're talking about a tutorial
> > code that is to be as efficient as possible without loosing it clarity,
> > being a reflection of essentials of the problem, then any overly
> > complicated advanced Haskell wouldn't be my choice either.
>
> +1
> Though perhaps we view mutable array code differently. In my view, it's
> neither advanced nor complicated.

convoluted, then. Not using "higher level" concepts, like map and foldr, :) or
head.until isSingleton (pairwise merge).map wrap , that kind of thing. :)

BTW I think a really smart compiler should just get a specification, like
Turner's sieve, and just derive a good code from that by itself.

Another example would be

qq n m = sum $ take n $ cycle [1..m]

which should really get compiled into just a math formula, IMO. Now _that_ I
would call a good compiler. That way I really won't have to learn how to use
STUArray`s you see.

I've seen this question asked a lot, what should be a good programming language?

IMO, English (plus math where needed, and maybe some sketches by hand). :)

>
> > And seeing that
> > this overly-complicated (IMO), steps-jumping PQ-based code was sold to us
> > as the only "faithful" rendering of the sieve, I wanted to see for myself
> > whether this really holds water.
>
> I can understand that very well.

:)

Daniel Fischer

unread,

Dec 29, 2009, 9:18:49 PM12/29/09

to haskel...@haskell.org, Will Ness

Am Mittwoch 30 Dezember 2009 01:04:34 schrieb Will Ness:
>
> > While I haven't detected that with the primes code, I find that in my
> > ghci your code is approximately 2.5 times faster than ONeill or Bayer
> > when interpreted (no difference in scaling observed), while when compiled
> > with -O2, ONeill is approximately three times as fast as your code
>
> that was what I was getting at first too, before I've put into my code the
> _type_signatures_ and the "specialize" _pragmas_ as per her file. Then it
> was only 1.3x slower, when compiled (with about same asymptotics and memory
> usage).
>

Specialising your code to Int makes it half as fast as ONeill here (as an executable).
That is largely due to the fact that your code uses much more memory here (54MB vs. 2MB
for the millionth prime), though, the MUT times have a ratio of about 1.5.

Now an interesting question is, why does it use so much memory here?
Can you send me your exact code so I can see how that behaves here?

> >and twice as fast as Bayer as an executable, about twice as fast as your
> > code and slightly slower than Bayer in ghci.
>
> see, this kind of inconsistencies is exactly why I was concentrating only
> on one platform in measuring the speed - the interp'/GHCi combination.

The problem with that is that one is primarily interested in speed for library functions,
which are mostly used as compiled code.

> Especially when developing and trying out several approaches, to test with
> compiler just takes too long. :) And why should it give (sometimes) wildly
> different readings when running inside GHCi or standalone ??

Good question.

>
> > And I have huge memory problems in ghci with your code.
> > That may be due to my implementation of merge and minus, though. You
> > wrote 'standard' and I coded the straightforward methods.
>
> Here's what I'm using (BTW I've put it on the primes haskellwiki page too).
> The memory reported for interpreted is about half of PQ's (IIRC), and
> compiled - the same:
>
> minus a@(x:xs) b@(y:ys) = case compare x y of
> LT -> x: xs `minus` b
> GT -> a `minus` ys
> EQ -> xs `minus` ys
> minus a b = a
>
> merge a@(x:xs) b@(y:ys) = case compare x y of
> LT -> x: merge xs b
> EQ -> x: merge xs ys
> GT -> y: merge a ys
> merge a b = if null b then a else b

More or less the same that I wrote.

> > What's wrong with mutable arrays? There are a lot of algorithms which can
> > be easily and efficiently implemented using mutable unboxed arrays while
> > a comparably efficient implementation without mutable arrays is hard. For
> > those, I consider STUArrays the natural choice. Sieving primes falls into
> > that category.
>
> It's just that the mutating code tends to be convoluted, like in the
> example I mentioned of quicksort. One has to read the C code with good
> attention to understand it.

Convoluted is (often) an exaggeration. But I agree that the specification of 'what' is
usually easier to understand than that of 'how'.

> "Normal" Haskell is much more visually apparent, like
>
> primes = 2: 3: sieve (tail primes) [5,7..]
> where
> sieve (p:ps) xs = h ++ sieve ps (t `minus` tail [q,q+2*p..])
> where (h,~(_:t)) = span (< q) xs
> q = p*p
>

Yes.

> or
>
> primes = 2: 3: sieve [] (tail primes) 5
> where
> sieve fs (p:ps) x = [i | i<- [x,x+2..q-2], a!i]
> ++ sieve ((2*p,q):fs') ps (q+2)
> where
> q = p*p
> mults = [ [y+s,y+2*s..q] | (s,y)<- fs]
> fs' = [ (s,last ms) | ((s,_),ms)<- zip fs mults]
> a = accumArray (\a b->False) True (x,q-2)
> [(i,()) | ms<- mults, i<- ms]
>

Umm, really?
I'd think if you see what that does, you won't have difficulties with a mutable array
sieve.

Daniel Fischer

unread,

Dec 29, 2009, 10:05:09 PM12/29/09

to haskel...@haskell.org, Will Ness

Am Mittwoch 30 Dezember 2009 01:23:32 schrieb Will Ness:
> Daniel Fischer <daniel.is.fischer <at> web.de> writes:
> > Gee, seems my mail server reads your posts very thoroughly today :)
>
> I hope it's not a bad thing. :)
>

It means, twenty minutes after I replied to the previous, I got your hours old post which
mentions points I could have included in the aforementioned reply.

Not really bad, but somewhat inconvenient.

> > Am Dienstag 29 Dezember 2009 14:58:10 schrieb Will Ness:
> > > If I realistically needed primes generated in a real life setting, I'd
> > > probably had to use some C for that.
> >
> > If you need the utmost speed, then probably yes. If you can do with a
> > little less, my STUArray bitsieves take about 35% longer than the
> > equivalent C code and are roughly eight times faster than ONeillPrimes.
> > I can usually live well with that.
>
>
> Wow! That's fast! (that's the code from haskellwiki's primes page, right?)
>

No, it's my own code. Nothing elaborate, just sieving numbers 6k�1, twice as fast as the
haskellwiki code (here) and uses only 1/3 the memory. For the record:

----------------------------------------------------------------------
{-# LANGUAGE BangPatterns #-}
module SievePrimes where

import Data.Array.Base (unsafeRead, unsafeWrite, unsafeAt)
import Data.Array.ST
import Data.Array.Unboxed
import Control.Monad (when)
import Data.Bits

primesUpTo :: Integer -> [Integer]
primesUpTo bound
= 2:3:[fromIntegral (3*i + (if testBit i 0 then 4 else 5))
| i <- [0 .. bd], par `unsafeAt` i]
where
bd0 = 2*(bound `div` 6) - case bound `mod` 6 of { 0 -> 2; 5 -> 0; _ -> 1 }
bd = fromInteger bd0
sr = floor (sqrt (fromIntegral bound))
sbd = 2*(sr `div` 6) - case sr `mod` 6 of { 0 -> 2; 5 -> 0; _ -> 1 }
par :: UArray Int Bool
par = runSTUArray $ do
ar <- newArray (0,bd) True
let tick i s1 s2
| bd < i = return ()
| otherwise = do
p <- unsafeRead ar i
when p (unsafeWrite ar i False)
tick (i+s1) s2 s1
sift i
| i > sbd = return ar
| otherwise = do
p <- unsafeRead ar i
when p (let (!start,!s1,!s2) = if even i then
(i*(3*i+10)+7,2*i+3,4*i+7) else (i*(3*i+8)+4,4*i+5,2*i+3) in tick start s1 s2)
sift (i+1)
sift 0
----------------------------------------------------------------------

The index magic is admittedly not obvious, but if you take that on faith, the rest is
pretty clear. To find the n-th prime,

----------------------------------------------------------------------
module Main (main) where

import SievePrimes (primesUpTo)
import System.Environment (getArgs)

printNthPrime :: Int -> IO ()
printNthPrime n = print (n, primesUpTo (estim n) !! (n-1))

main = do
args <- getArgs
printNthPrime $ read $ head args

estim :: Int -> Integer
estim k
| n < 13 = 37
| n < 70 = 349
| otherwise = round x
where
n = fromIntegral k :: Double
x = n * (log n + log (log n) - 1 + 1.05 * log (log n) / log n)
----------------------------------------------------------------------

If I don't construct the prime list but count directly on the array, it's ~6% faster.

> > > If OTOH we're talking about a tutorial
> > > code that is to be as efficient as possible without loosing it clarity,
> > > being a reflection of essentials of the problem, then any overly
> > > complicated advanced Haskell wouldn't be my choice either.
> >
> > +1
> > Though perhaps we view mutable array code differently. In my view, it's
> > neither advanced nor complicated.
>
> convoluted, then. Not using "higher level" concepts, like map and foldr, :)
> or head.until isSingleton (pairwise merge).map wrap , that kind of thing.
> :)
>
> BTW I think a really smart compiler should just get a specification, like
> Turner's sieve, and just derive a good code from that by itself.

Go ahead and write one. I would love such a beast.

>
> Another example would be
>
> qq n m = sum $ take n $ cycle [1..m]
>
> which should really get compiled into just a math formula, IMO. Now _that_
> I would call a good compiler.

Dream on, dream on, with hope in your heart.

> That way I really won't have to learn how to use STUArray`s you see.
>
> I've seen this question asked a lot, what should be a good programming
> language?
>
> IMO, English (plus math where needed, and maybe some sketches by hand). :)
>

Maybe you'd like http://en.wikipedia.org/wiki/Shakespeare_(programming_language) ?

Will Ness

unread,

Dec 30, 2009, 4:58:28 AM12/30/09

to haskel...@haskell.org

Daniel Fischer <daniel.is.fischer <at> web.de> writes:

>
>
> Am Mittwoch 30 Dezember 2009 01:04:34 schrieb Will Ness:
> >
> > > While I haven't detected that with the primes code, I find that in my
> > > ghci your code is approximately 2.5 times faster than ONeill or Bayer
> > > when interpreted (no difference in scaling observed), while when
> > > compiled with -O2, ONeill is approximately three times as fast as
> > > your code
> >
> > that was what I was getting at first too, before I've put into my code the
> > _type_signatures_ and the "specialize" _pragmas_ as per her file. Then it
> > was only 1.3x slower, when compiled (with about same asymptotics and
> > memory usage).
> >
>
> Specialising your code to Int makes it half as fast as ONeill here
> (as an executable).
> That is largely due to the fact that your code uses much more memory here
> (54MB vs. 2MB for the millionth prime), though, the MUT times have a ratio
> of about 1.5.

I'm an unsophisticated tester. I just use

GHC -O2 -c filename.hs
GHCi filename

and then it says ( for primes()!!1000000 )

(8.24 secs, 1906813836 bytes) for my code, and
(6.09 secs, 1800873864 bytes) for O'Neill's

But now when I've looked at system resources I see this too. Well, it means
we've found where the PQ code is better. OK.

> Now an interesting question is, why does it use so much memory here?
> Can you send me your exact code so I can see how that behaves here?

will do. It's probably doing a lot more bookkeeping. Or it might be some impl
issue with scanl or span etc., and it'll go away if we'd recode it directly,
who knows. We can only guess if we don't know the compiler in and out. That's
exactly what kept me off using the compiler. Guessing. Sheesh.

(and I did see a 2.0x speedup once when replacing one simple code snippet for
its operationally equivalent twin).

>
> > >and twice as fast as Bayer as an executable, about twice as fast as your
> > > code and slightly slower than Bayer in ghci.
> >
> > see, this kind of inconsistencies is exactly why I was concentrating only
> > on one platform in measuring the speed - the interp'/GHCi combination.
>
> The problem with that is that one is primarily interested in speed for
> library functions, which are mostly used as compiled code.

right and I used it as a measure of code's "fitness" to the problem so it was
only comparative to me.

>
> > Especially when developing and trying out several approaches, to test with
> > compiler just takes too long. :) And why should it give (sometimes) wildly
> > different readings when running inside GHCi or standalone ??
>
> Good question.
>
> >
> > > And I have huge memory problems in ghci with your code.
> > > That may be due to my implementation of merge and minus, though. You
> > > wrote 'standard' and I coded the straightforward methods.
> >

> > Here's what I'm using ...

>
> More or less the same that I wrote.

The rest is exactly as I've posted it, I've only added type signatures and
specialize pragmas, as per Melissa's code,

{-# SPECIALIZE primes :: () -> [Int] #-}
{-# SPECIALIZE primes :: () -> [Integer] #-}
primes :: Integral a => () -> [a]

primes () = 2:3:5:7:primes'
where
primes' = [11,13] ++ drop 2 (rollFrom 11) `minus` comps

(comps,_) = tfold mergeSP (pairwise mergeSP mults)

mults = map (\p-> fromList $ map (p*) $ rollFrom p) $ primes'

maybe it's about memoization of primes'. She writes something about it in
her code.

> >
> > It's just that the mutating code tends to be convoluted, like in the
> > example I mentioned of quicksort. One has to read the C code with good
> > attention to understand it.
>
> Convoluted is (often) an exaggeration. But I agree that the specification
> of 'what' is usually easier to understand than that of 'how'.

well put.

>
> > "Normal" Haskell is much more visually apparent, like
> >
> > primes = 2: 3: sieve (tail primes) [5,7..]
> > where
> > sieve (p:ps) xs = h ++ sieve ps (t `minus` tail [q,q+2*p..])
> > where (h,~(_:t)) = span (< q) xs
> > q = p*p
> >
>
> Yes.
>
> > or
> >
> > primes = 2: 3: sieve [] (tail primes) 5
> > where
> > sieve fs (p:ps) x = [i | i<- [x,x+2..q-2], a!i]
> > ++ sieve ((2*p,q):fs') ps (q+2)
> > where
> > q = p*p
> > mults = [ [y+s,y+2*s..q] | (s,y)<- fs]
> > fs' = [ (s,last ms) | ((s,_),ms)<- zip fs mults]
> > a = accumArray (\a b->False) True (x,q-2)
> > [(i,()) | ms<- mults, i<- ms]
> >
>
> Umm, really?
> I'd think if you see what that does, you won't have difficulties with a
> mutable array sieve.

You're right, bad example. :)

Will Ness

unread,

Dec 30, 2009, 5:06:10 AM12/30/09

to haskel...@haskell.org

Daniel Fischer <daniel.is.fischer <at> web.de> writes:

>

No, it's my own code. Nothing elaborate, just sieving numbers 6k±1, twice as
fast as the haskellwiki code (here) and uses only 1/3 the memory. For the record:

....

thanks! will need to sift through it thoroughly... :) :)

> >
> > BTW I think a really smart compiler should just get a specification, like
> > Turner's sieve, and just derive a good code from that by itself.
>
> Go ahead and write one. I would love such a beast.
>
> >
> > Another example would be
> >
> > qq n m = sum $ take n $ cycle [1..m]
> >
> > which should really get compiled into just a math formula, IMO. Now _that_
> > I would call a good compiler.
>
> Dream on, dream on, with hope in your heart.

Those who can't do, dream. And rant. :)

>
> Maybe you'd like http://en.wikipedia.org/wiki/Shakespeare_(programming_language) ?

niice.

:)

>
>
>
> _______________________________________________
> Haskell-Cafe mailing list

> Haskell-Cafe <at> haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>

Daniel Fischer

unread,

Dec 30, 2009, 5:44:54 AM12/30/09

to haskel...@haskell.org

Am Dienstag 29 Dezember 2009 20:16:59 schrieb Daniel Fischer:
> > especially the claim that going by primes squares
> > is "a pleasing but minor optimization",
>
> Which it is not. It is a major optimisation. It reduces the algorithmic

> complexity *and* reduces the constant factors significantly.

D'oh! Thinko while computing sum (takeWhile (<= n) primes) without paper.
It doesn't change the complexity, and the constant factors are reduced far less than I
thought. The huge performance differences I had must have been due to cache misses (unless
you use a segmented sieve, starting at the square reduces the number of cache misses
significantly).

Will Ness

unread,

Dec 30, 2009, 2:48:08 PM12/30/09

to haskel...@haskell.org

Daniel Fischer <daniel.is.fischer <at> web.de> writes:

>
>
> Am Dienstag 29 Dezember 2009 20:16:59 schrieb Daniel Fischer:
> > > especially the claim that going by primes squares
> > > is "a pleasing but minor optimization",
> >
> > Which it is not. It is a major optimisation. It reduces the algorithmic
> > complexity *and* reduces the constant factors significantly.
>
> D'oh! Thinko while computing sum (takeWhile (<= n) primes) without paper.
> It doesn't change the complexity, and the constant factors are reduced
> far less than I thought.

I do not understand. Turner's sieve is

primes = sieve [2..]
where
sieve (p:xs) = p : sieve [x | x<-xs, x `mod` p /= 0]

and the Postponed Filters is

primes = 2: 3: sieve (tail primes) [5,7..]
where

sieve (p:ps) xs = h ++ sieve ps [x | x<-t, x `rem` p /= 0]
where (h,~(_:t)) = span (< p*p) xs

Are you saying they both exhibit same complexity? I was under impression that
the first shows O(n^2) approx., and the second one O(n^1.5) (for n primes
produced).

Daniel Fischer

unread,

Dec 30, 2009, 7:30:56 PM12/30/09

to haskel...@haskell.org, Will Ness

Am Mittwoch 30 Dezember 2009 20:46:57 schrieb Will Ness:
> Daniel Fischer <daniel.is.fischer <at> web.de> writes:
> > Am Dienstag 29 Dezember 2009 20:16:59 schrieb Daniel Fischer:
> > > > especially the claim that going by primes squares
> > > > is "a pleasing but minor optimization",
> > >
> > > Which it is not. It is a major optimisation. It reduces the algorithmic
> > > complexity *and* reduces the constant factors significantly.
> >
> > D'oh! Thinko while computing sum (takeWhile (<= n) primes) without paper.
> > It doesn't change the complexity, and the constant factors are reduced
> > far less than I thought.
>
> I do not understand. Turner's sieve is
>
> primes = sieve [2..]
> where
> sieve (p:xs) = p : sieve [x | x<-xs, x `mod` p /= 0]
>
> and the Postponed Filters is
>
> primes = 2: 3: sieve (tail primes) [5,7..]
> where
> sieve (p:ps) xs = h ++ sieve ps [x | x<-t, x `rem` p /= 0]
> where (h,~(_:t)) = span (< p*p) xs
>
> Are you saying they both exhibit same complexity?

No. They don't.
But if you're looking at an imperative (mutable array) sieve (that's simpler to analyse
because you don't have to take the book-keeping costs of your priority queue, heap or
whatever into account), if you start crossing out the multiples of p with 2*p, you have

sum [bound `div` p - 1 | p <- takeWhile (<= sqrt bound) primes]

crossings-out, that is Theta(bound*log (log bound)). If you eliminate multiples of some
small primes a priori (wheel), you can reduce the constant factor significantly, but the
complexity remains the same (you drop a few terms from the front of the sum and multiply
the remaining terms with phi(n)/n, where n is the product of the excluded primes).

If you start crossing out at p^2, the number is

sum [bound `div` p - (p-1) | p <- takeWhile (<= sqrt bound) primes].

The difference is basically sum (takeWhile (<= sqrt bound) primes), which I stupidly - I
don't remember how - believed to cancel out the main term.
It doesn't, it's O(bound/log bound), so the complexity is the same.

Now if you take a stream of numbers from which you remove composites, having a priority
queue of multiples of primes, things are a little different.
If you start crossing out at 2*p, when you are looking at n, you have more multiples in
your PQ than if you start crossing out at p^2 (about pi(n/2) vs. pi(sqrt n)), so updating
the PQ will be more expensive. But updating the PQ is O(log size), I believe, and log
pi(n) is O(log pi(sqrt n)), so I think it shouldn't change the complexity here either. I
think this would have complexity O(bound*log bound*log (log bound)).

> I was under impression that the first shows O(n^2) approx., and the second one O(n^1.5)
(for n
> primes produced).

In Turner/postponed filters, things are really different. Actually, Ms. O'Neill is right,
it is a different algorithm. In the above, we match each prime only with its multiples
(plus the next composite in the PQ version). In Turner's algorithm, we match each prime
with all smaller primes (and each composite with all primes <= its smallest prime factor,
but that's less work than the primes). That gives indeed a complexity of O(pi(bound)^2).
In the postponed filters, we match each prime p with all primes <= sqrt p (again, the
composites contribute less). I think that gives a complexity of
O(pi(bound)^1.5*log (log bound)) or O(n^1.5*log (log n)) for n primes produced.

Steve

unread,

Dec 30, 2009, 10:55:45 PM12/30/09

to haskel...@haskell.org

On Wed, 2009-12-30 at 11:09 -0500, haskell-ca...@haskell.org
wrote:

> Am Mittwoch 30 Dezember 2009 01:23:32 schrieb Will Ness:
> > Daniel Fischer <daniel.is.fischer <at> web.de> writes:

...

> No, it's my own code. Nothing elaborate, just sieving numbers 6k1,

> twice as fast as the
> haskellwiki code (here) and uses only 1/3 the memory. For the record:
>

The speed difference varies with the value of 'bound' - the difference
between counting directly on the array and creating a list from the
array increases as 'bound' increases. I tested it for 10^8 and 10^9
(using Int not Integer) and using the array without creating a list is
50+% faster.
It looks like a very fast method for creating the 'set' of primes up to
bound.

Steve

Will Ness

unread,

Jan 2, 2010, 8:14:16 AM1/2/10

to haskel...@haskell.org

Daniel Fischer <daniel.is.fischer <at> web.de> writes:

>
>
> Am Mittwoch 30 Dezember 2009 20:46:57 schrieb Will Ness:

> > Daniel Fischer <daniel.is.fischer <at> web.de> writes:

> > > Am Dienstag 29 Dezember 2009 20:16:59 schrieb Daniel Fischer:

> > > > > especially the claim that going by primes squares
> > > > > is "a pleasing but minor optimization",
> > > >
> > > > Which it is not. It is a major optimisation. It reduces the algorithmic

> > > > complexity *and* reduces the constant factors significantly.
> > >
> > > D'oh! Thinko while computing sum (takeWhile (<= n) primes) without paper.
> > > It doesn't change the complexity, and the constant factors are reduced
> > > far less than I thought.
> >
> > I do not understand. Turner's sieve is
> >
> > primes = sieve [2..]
> > where
> > sieve (p:xs) = p : sieve [x | x<-xs, x `mod` p /= 0]
> >
> > and the Postponed Filters is
> >

> > primes = 2: 3: sieve (tail primes) [5,7..]
> > where

> > sieve (p:ps) xs = h ++ sieve ps [x | x<-t, x `rem` p /= 0]
> > where (h,~(_:t)) = span (< p*p) xs
> >
> > Are you saying they both exhibit same complexity?
>
> No. They don't.
> But if you're looking at an imperative (mutable array) sieve (that's simpler
> to analyse because you don't have to take the book-keeping costs of your
> priority queue, heap or whatever into account), if you start crossing out

The key question then, is _*WHEN*_ and not _*WHAT*_. As is clearly demonstrated
by the case of Turner/Postponed filters, the work that is done (of crossing
numbers off) is the same - _when_ it is actually done - but Turner's starts out
so _prematurely_ that it is busy doing nothing most of the time. Thus its
function call overhead costs pile up enormously, overstaging the actual
calculation.

So analyzing that calculation in the premature execution setting is missing the
point, although helpful after we fix this, with the Postponed Filters. _Only
then_ the finer points of this algorithm's analysis can be applied - namely, of
avoiding testing primes divisibility altogether. And _if_ a fast cheap primality
test were to have existed, the filtering versions would win over, because they
progressively cull the input sequence so there would be no double hits as we
have when merging the multiples (whether from lists or inside the PQ).

> the multiples of p with 2*p, you have
>
> sum [bound `div` p - 1 | p <- takeWhile (<= sqrt bound) primes]
>
> crossings-out, that is Theta(bound*log (log bound)). If you eliminate
> multiples of some small primes a priori (wheel), you can reduce the constant
> factor significantly, but the complexity remains the same (you drop a few
> terms from the front of the sum and multiply the remaining terms with
> phi(n)/n, where n is the product of the excluded primes).
>
> If you start crossing out at p^2, the number is
>
> sum [bound `div` p - (p-1) | p <- takeWhile (<= sqrt bound) primes].
>
> The difference is basically sum (takeWhile (<= sqrt bound) primes), which I
> stupidly - I don't remember how - believed to cancel out the main term.
> It doesn't, it's O(bound/log bound), so the complexity is the same.
>
> Now if you take a stream of numbers from which you remove composites, having
> a priority queue of multiples of primes, things are a little different.
> If you start crossing out at 2*p, when you are looking at n, you have more
> multiples in your PQ than if you start crossing out at p^2 (about pi(n/2)
> vs. pi(sqrt n)), so updating the PQ will be more expensive. But updating the
> PQ is O(log size), I believe, and log pi(n) is O(log pi(sqrt n)), so I think
> it shouldn't change the complexity here either. I think this would have
> complexity O(bound*log bound*log (log bound)).

There are two questions here - where to start crossing off numbers, and when. If
you'd start at 2*p maybe the overall complexity would remain the same but it'll
add enormous overhead with all those duplicate multiples. No, the question is
not where to start, but when. PQ might hide the problem until the memory blows
up. Anything that we add that won't have any chance of contributing to the final
result, is added for nothing and only drives the total cost up needlessly.

>
> > I was under impression that the first shows O(n^2) approx., and the second
> > one O(n^1.5) (for n primes produced).
>
> In Turner/postponed filters, things are really different. Actually, Ms.
> O'Neill is right, it is a different algorithm. In the above, we match each

what _is_ different is divisibility testing vs composites removal, which follows
from her in-depth analysis although is never quite formulated in such words in
her article. But nothing matters until the premature starting up is eliminated,
and that key observation is missing for the article either - worse, it is
brushed off with the casual remark that it is "a pleasing but minor
optimization". Which remark, as you show here, is true in the imperative,
mutable-storage setting, but is made in an article abut functional code, in
regard to the functional code of Turner's sieve. So the key understanding is
overlooked.

Her article even adds the primes into the PQ prematurely itself, as soon as the
prime is discovered (she fixes that in her ZIP package). With the PQ keeping
these prematurely added elements deep inside its guts, the problem might not
even manifest itself immediately, but the memory blow-up would eventually hit
the wall (having the PQ contain all the preceding primes, instead of just those
below the square root of a limit - all the entries above the square root not
contributing to the calculation at all).

And what we're after here is the insight anyway. Skipping steps of natural
development does not contribute to garnering an insight. Most prominent problem
with Turner's /code/ _specification_, is the premature start ups, and
divisibility testing of primes (as Melissa O'Neill's analysis shows). Fix one,
and you get Postponed Filters (which should really be used as a basis reference
point for all the rest). Fix the other - and you've got the Euler's sieve:

primes = 2: 3: sieve (tail primes) [5,7..]
where

sieve (p:ps) xs = h ++ sieve ps [t `minus` tail [p*p, p*p+2*p..]]
where (h,~(_:t)) = span (< p*p) xs

Clear, succinct, and plenty efficient for an introductory textbook functional
lazy code, of the same order of magnitude performance as PQ code on odds only.

Now comparing the PQ performance against _that_ would only be fare, and IMO
would only add to its value - it is faster, has better asymptotics, and is
greatly amenable to the wheel optimization right away.

> prime only with its multiples (plus the next composite in the PQ version).
> In Turner's algorithm, we match each prime with all smaller primes (and each
> composite with all primes <= its smallest prime factor, but that's less work
> than the primes). That gives indeed a complexity of O(pi(bound)^2).
>
> In the postponed filters, we match each prime p with all primes <= sqrt p
> (again, the composites contribute less). I think that gives a complexity of
> O(pi(bound)^1.5*log (log bound)) or O(n^1.5*log (log n)) for n primes
> produced.

Empirically, it was always _below_ 1.5, and above 1.4 , and PQ/merged multiples
removal around 1.25..1.17 .

Daniel Fischer

unread,

Jan 2, 2010, 11:10:57 PM1/2/10

to haskel...@haskell.org, Will Ness

Crossing off is only part of the work. Most of the work is checking whether to cross off
the number in this round. And Turner does a lot more of that than the postponed filters.

> - _when_ it is actually done - but
> Turner's starts out so _prematurely_ that it is busy doing nothing most of
> the time.

It's not doing nothing. It's just doing a lot of superfluous work. It is _achieveing_
nothing most of the time, though.
Take 7919, the thousandth prime. The postponed filters decide to keep it when fitering out
the multiples of 89, the twenty-fourth prime. Turner also divides it by all 975 primes in
between. That is a lot of real but futile work.

> Thus its function call overhead costs pile up enormously,
> overstaging the actual calculation.

It's not only the function calls. Division is expensive, too.

>
> So analyzing that calculation in the premature execution setting is missing
> the point, although helpful after we fix this, with the Postponed Filters.
> _Only then_ the finer points of this algorithm's analysis can be applied -
> namely, of avoiding testing primes divisibility altogether. And _if_ a fast
> cheap primality test were to have existed, the filtering versions would win

Sorry, I can't follow. What's the point of a primality test here? Every number whose
multiples we want to remove is prime, what's left to test?

> over, because they progressively cull the input sequence so there would be
> no double hits as we have when merging the multiples (whether from lists or
> inside the PQ).

But there's a lot of list constructuion and deconstruction necessary for the Euler sieve.
That may be more work than the multiple hits cause.

>
> > the multiples of p with 2*p, you have
> >
> > sum [bound `div` p - 1 | p <- takeWhile (<= sqrt bound) primes]
> >
> > crossings-out, that is Theta(bound*log (log bound)). If you eliminate
> > multiples of some small primes a priori (wheel), you can reduce the
> > constant factor significantly, but the complexity remains the same (you
> > drop a few terms from the front of the sum and multiply the remaining
> > terms with phi(n)/n, where n is the product of the excluded primes).
> >
> > If you start crossing out at p^2, the number is
> >
> > sum [bound `div` p - (p-1) | p <- takeWhile (<= sqrt bound) primes].
> >
> > The difference is basically sum (takeWhile (<= sqrt bound) primes), which
> > I stupidly - I don't remember how - believed to cancel out the main term.
> > It doesn't, it's O(bound/log bound), so the complexity is the same.
> >
> > Now if you take a stream of numbers from which you remove composites,
> > having a priority queue of multiples of primes, things are a little
> > different. If you start crossing out at 2*p, when you are looking at n,
> > you have more multiples in your PQ than if you start crossing out at p^2
> > (about pi(n/2) vs. pi(sqrt n)), so updating the PQ will be more
> > expensive. But updating the PQ is O(log size), I believe, and log pi(n)
> > is O(log pi(sqrt n)), so I think it shouldn't change the complexity here
> > either. I think this would have complexity O(bound*log bound*log (log
> > bound)).
>
> There are two questions here - where to start crossing off numbers, and
> when. If you'd start at 2*p maybe the overall complexity would remain the
> same but it'll add enormous overhead with all those duplicate multiples.

The additional duplicate multiples aren't the problem. Sure, the numbers having a prime
divisor larger than the square root would be crossed off one additional time, but that
isn't so much per se. The additional crossings off are O(bound), so they don't harm the
time complexity. But they can have effects which multiply the running time by a large
constant.

> No, the question is not where to start, but when. PQ might hide the problem
> until the memory blows up. Anything that we add that won't have any chance
> of contributing to the final result, is added for nothing and only drives
> the total cost up needlessly.

Well, if you start crossing off at p^2 and feed your PQ from a separate prime generator,
the memory blow-up is far away. E.g., when the main sieve is at 10^12, the PQ contains
less than 80000 multiples. No space problem in sight. At 10^14, the PQ's size is still
less than 700000. That's noticeable, but there's still plenty of space.
If, on the other hand, you start crossung off at 2*p, when the main sieve is at 10^7, the
size of the PQ is > 650000, at 10^8, the size is more than 5.5 million. That starts to
become a memory problem rather soon.

>
> > > I was under impression that the first shows O(n^2) approx., and the
> > > second one O(n^1.5) (for n primes produced).
> >
> > In Turner/postponed filters, things are really different. Actually, Ms.
> > O'Neill is right, it is a different algorithm. In the above, we match
> > each
>
> what _is_ different is divisibility testing vs composites removal, which
> follows from her in-depth analysis although is never quite formulated in
> such words in her article. But nothing matters until the premature starting
> up is eliminated, and that key observation is missing for the article
> either - worse, it is brushed off with the casual remark that it is "a
> pleasing but minor optimization". Which remark, as you show here, is true
> in the imperative, mutable-storage setting, but is made in an article abut
> functional code, in regard to the functional code of Turner's sieve.

I think that remark was meant to apply to composite removal, not Turner's sieve.
But even then, the 'minor' isn't adequate considering her PQ approach where, although it
doesn't change time complexity (in theory), it changes space complexity - and that may
affect running time drastically. Probably she was thinking of the typical array sieve when
she made that remark.

I should really stop doing those calculations in my head, it seems I'm getting too old for
that. Doing it properly, on paper, the cost for identifying p_k is pi(sqrt p_k) [trial
division by all primes less than sqrt p_k].
p_k is approximately k*log k, so pi(sqrt p_k) is approximately
(sqrt $ k* log k)/log (sqrt $ k*log k) ~ 2*sqrt (k/log k).
Summing that, the cost for identifying the first n primes is Theta(n^1.5/log n).
Using a feeder, i.e. primes = 2:3.sieve feederprimes [5, 7 .. ], where feederprimes are
generated as above, to reduce memory pressure and GC impact, that fits pretty well with my
measurements of the time to find p_k for several k between 500,000 and 20,000,000.

Will Ness

unread,

Jan 3, 2010, 3:55:42 AM1/3/10

to haskel...@haskell.org

Exactly the point I tried to make. :)

> > - _when_ it is actually done - but
> > Turner's starts out so _prematurely_ that it is busy doing nothing most of
> > the time.
>
> It's not doing nothing. It's just doing a lot of superfluous work. It is
> _achieveing_ nothing most of the time, though.

again, yes. :)

> Take 7919, the thousandth prime. The postponed filters decide to keep it when
> fitering out the multiples of 89, the twenty-fourth prime. Turner also
> divides it by all 975 primes in between. That is a lot of real but futile
> work.

yes.

> > Thus its function call overhead costs pile up enormously,
> > overstaging the actual calculation.
>
> It's not only the function calls. Division is expensive, too.

yes, that's what I meant - the cost of calling all the fuctions that - we know
in advance will - have nothing to do eventually.

> > So analyzing that calculation in the premature execution setting is missing
> > the point, although helpful after we fix this, with the Postponed Filters.
> > _Only then_ the finer points of this algorithm's analysis can be applied -
> > namely, of avoiding testing primes divisibility altogether. And _if_ a fast
> > cheap primality test were to have existed, the filtering versions would win
>
> Sorry, I can't follow. What's the point of a primality test here? Every
> number whose multiples we want to remove is prime, what's left to test?

sorry for being obstruse. I meant in a filtering sieve (i.e. postponed filters)
vs the multiples removing sieves, if only the cheap primality test have existed
(by some _magic_) which _could_ be run on _every_ numer (unlike the costly
divisiility test), than the filtering sieves would win. Hypothetically.

> > over, because they progressively cull the input sequence so there would be
> > no double hits as we have when merging the multiples (whether from lists or
> > inside the PQ).
>
> But there's a lot of list constructuion and deconstruction necessary for the
> Euler sieve.

yes. Unless, of course, s smart compiler recognizes there's only one consumer
for the values each multiples-list is producing, and keeps it stripped down to
a generator function, and its current value. I keep thinkig a smart compiler
could eliminate all these "span" calls and replace them with just some pointers
manipulating...

> That may be more work than the multiple hits cause.

so that too would make filters win; only _if_ the cheap primality test
existed!..

> >
> > > the multiples of p with 2*p, you have

> > > ...

> >
> > There are two questions here - where to start crossing off numbers, and
> > when. If you'd start at 2*p maybe the overall complexity would remain the
> > same but it'll add enormous overhead with all those duplicate multiples.
>
> The additional duplicate multiples aren't the problem. Sure, the numbers
> having a prime divisor larger than the square root would be crossed off one
> additional time, but that isn't so much per se. The additional crossings off
> are O(bound), so they don't harm the time complexity. But they can have
> effects which multiply the running time by a large constant.

yes, exactly what I wanted to say. :)

> > No, the question is not where to start, but when. PQ might hide the problem
> > until the memory blows up. Anything that we add that won't have any chance
> > of contributing to the final result, is added for nothing and only drives
> > the total cost up needlessly.
>
> Well, if you start crossing off at p^2 and feed your PQ from a separate prime
> generator, the memory blow-up is far away. E.g., when the main sieve is at
> 10^12, the PQ contains less than 80000 multiples. No space problem in sight.
> At 10^14, the PQ's size is still less than 700000. That's noticeable, but
> there's still plenty of space.

again, what I mean is, not _where_ I start crossing them off in a PQ, but
_when_. The article's code starts crossing them off _at_ p^2 - by adding p^2+2p
into the PQ - _as_ _soon_ as p itself is reached. It won't surface until p^2
will be considered for a prime; it'll lay dormant deep inside the queue's guts.
When reaching 7919, the thousand (i.e. pi(7919) ) entries will hang out inside
the PQ - instead of just 24. A memory blowup. (this is of course fixed in
Melissa's ZIP package). Of course due to the nature of PQ it might actually not
hurt the performance for a while, depending on partcular PQ implementation.
Complexity _and_ constant factors.

> If, on the other hand, you start crossung off at 2*p, when the main sieve is
> at 10^7, the size of the PQ is > 650000, at 10^8, the size is more than 5.5
> million. That starts to become a memory problem rather soon.

here you don't have a choice or when to add it - you have to add it at p
itself - so the problem is clear. But even when you cross at p^2, the question
remains, of when you add the p^2 entry into the PQ. That was my point.

Postponed Filters code makes this clear, and thus hard to err on.
Unfortunately, it wasn't present the article.

> >
> > > > I was under impression that the first shows O(n^2) approx., and the
> > > > second one O(n^1.5) (for n primes produced).
> > >
> > > In Turner/postponed filters, things are really different. Actually, Ms.
> > > O'Neill is right, it is a different algorithm. In the above, we match
> > > each
> >
> > what _is_ different is divisibility testing vs composites removal, which
> > follows from her in-depth analysis although is never quite formulated in
> > such words in her article. But nothing matters until the premature starting
> > up is eliminated, and that key observation is missing for the article
> > either - worse, it is brushed off with the casual remark that it is "a
> > pleasing but minor optimization". Which remark, as you show here, is true
> > in the imperative, mutable-storage setting, but is made in an article abut
> > functional code, in regard to the functional code of Turner's sieve.
>
> I think that remark was meant to apply to composite removal, not Turner's
> sieve.

It is right there on page 2, right when the Turner's sieve is presented and
discussed. The only explanation that I see is that she thought of it in regards
to the imperative code, just as her analysis concentrates only on calculation
aspects of the imperative code itself.

> But even then, the 'minor' isn't adequate considering her PQ approach where,
> although it doesn't change time complexity (in theory), it changes space
> complexity - and that may affect running time drastically. Probably she was
> thinking of the typical array sieve when she made that remark.

In short, that remark was not referring to what it seemingly refers to, in the
article, and as such was very confusing. It was a big STOP sign on the way to
Postponed Filters - Euler's - Bird's merged multiples - tree-merging (with
wheel) road of little steps, and used as a justification for her to make a big
leap across the chasm towards the PQ code.

So to speak. :)

> > So the key understanding is overlooked.
> >

> > And what we're after here is the insight anyway. Skipping steps of natural
> > development does not contribute to garnering an insight.

. and that is of course a matter of personal preference. A genius is perfectly
capable of making big leaps across chasms. Heck they might even be able to
fly :)

> > > In the postponed filters, we match each prime p with all primes <= sqrt p
> > > (again, the composites contribute less). I think that gives a complexity
> > > of O(pi(bound)^1.5*log (log bound)) or O(n^1.5*log (log n)) for n primes
> > > produced.
> >
> > Empirically, it was always _below_ 1.5, and above 1.4 , and PQ/merged
> > multiples removal around 1.25..1.17 .
>
> I should really stop doing those calculations in my head, it seems I'm
> getting too old for that. Doing it properly, on paper, the cost for
> identifying p_k is pi(sqrt p_k) [trial division by all primes less than
> sqrt p_k].
> p_k is approximately k*log k, so pi(sqrt p_k) is approximately
> (sqrt $ k* log k)/log (sqrt $ k*log k) ~ 2*sqrt (k/log k).
> Summing that, the cost for identifying the first n primes is
> Theta(n^1.5/log n).

that would correspond to what I've seen much better.

> Using a feeder, i.e. primes = 2:3.sieve feederprimes [5, 7 .. ], where
> feederprimes are generated as above, to reduce memory pressure and GC
> impact, that fits pretty well with my measurements of the time to find
> p_k for several k between 500,000 and 20,000,000.

The quesion of a memory blowup with the treefolding merge still remains. For
some reason using its second copy for a feeder doesn't reduce the memory (as
reported by standalone compiled program, GHCi reported values are useless) - it
causes it to increase twice.....

Will Ness

unread,

Jan 3, 2010, 7:17:29 AM1/3/10

to haskel...@haskell.org

Will Ness <will_n48 <at> yahoo.com> writes:

>
> ... It was a big STOP sign on the way to

> Postponed Filters - Euler's - Bird's merged multiples - tree-merging (with
> wheel) road of little steps, and used as a justification for her to make a
> big leap across the chasm towards the PQ code.

correction: "across the /supposed/ chasm".

There is no chasm. There is a nice straight freeway, with rest stops and
gas stations, and exits to local roads going across the county in every
which way. :)

> .. and that is of course a matter of personal preference. A genius is

> perfectly capable of making big leaps across chasms. Heck they might even
> be able to fly :)

_______________________________________________

Daniel Fischer

unread,

Jan 3, 2010, 7:31:23 PM1/3/10

to haskel...@haskell.org, Will Ness

Am Sonntag 03 Januar 2010 09:54:37 schrieb Will Ness:
>
> Exactly the point I tried to make. :)
>

>
> again, yes. :)
>

>
> yes.
>

>
> yes, that's what I meant - the cost of calling all the fuctions that - we
> know in advance will - have nothing to do eventually.
>

8-)

> >
> > But there's a lot of list constructuion and deconstruction necessary for
> > the Euler sieve.
>
> yes. Unless, of course, s smart compiler recognizes there's only one
> consumer for the values each multiples-list is producing, and keeps it
> stripped down to a generator function, and its current value. I keep
> thinkig a smart compiler could eliminate all these "span" calls and replace
> them with just some pointers manipulating...
>

Of course I'm no smart compiler, but I don't see how it could be even possible to replace
the span calls with pointer manipulation when dealing with lazily generated (infinite, if
we're really mean) lists. Even when you're dealing only with strict finite lists, it's not
trivial to do efficiently.

> > That may be more work than the multiple hits cause.
>
> so that too would make filters win; only _if_ the cheap primality test
> existed!..
>
> > > > the multiples of p with 2*p, you have
> > > > ...
> > >
> > > There are two questions here - where to start crossing off numbers, and
> > > when. If you'd start at 2*p maybe the overall complexity would remain
> > > the same but it'll add enormous overhead with all those duplicate
> > > multiples.
> >
> > The additional duplicate multiples aren't the problem. Sure, the numbers
> > having a prime divisor larger than the square root would be crossed off
> > one additional time, but that isn't so much per se. The additional
> > crossings off are O(bound), so they don't harm the time complexity. But
> > they can have effects which multiply the running time by a large
> > constant.
>
> yes, exactly what I wanted to say. :)
>

*g* I wondered what you meant by the distinction between where and when to cross off.

> > > No, the question is not where to start, but when. PQ might hide the
> > > problem until the memory blows up. Anything that we add that won't have
> > > any chance of contributing to the final result, is added for nothing
> > > and only drives the total cost up needlessly.
> >
> > Well, if you start crossing off at p^2 and feed your PQ from a separate
> > prime generator, the memory blow-up is far away. E.g., when the main
> > sieve is at 10^12, the PQ contains less than 80000 multiples. No space
> > problem in sight. At 10^14, the PQ's size is still less than 700000.
> > That's noticeable, but there's still plenty of space.
>
> again, what I mean is, not _where_ I start crossing them off in a PQ, but
> _when_. The article's code starts crossing them off _at_ p^2 - by adding
> p^2+2p into the PQ - _as_ _soon_ as p itself is reached. It won't surface
> until p^2 will be considered for a prime; it'll lay dormant deep inside the
> queue's guts. When reaching 7919, the thousand (i.e. pi(7919) ) entries
> will hang out inside the PQ - instead of just 24. A memory blowup. (this is
> of course fixed in Melissa's ZIP package). Of course due to the nature of
> PQ it might actually not hurt the performance for a while, depending on
> partcular PQ implementation. Complexity _and_ constant factors.
>

It will start to have an impact pretty soon. Assuming at least one of the relevant PQ
operations to be Theta(log size), each composite between ~400 and ~400000 (rough
estimates) will take something like twice as long to handle. It will start to hurt really
badly only a while later, though, as a guesstimate, with more than a million primes in the
PQ, memory will have a hard time.

To be fair, she writes:

"Let us ﬁrst describe the original “by hand” sieve algorithm as practiced by
Eratosthenes.
..
The starting point of p^2 is a pleasing but minor optimization, which can be made
because lower multiples will have already been crossed oﬀ when we found the primes
prior to p.
.. (This optimization does not aﬀect the
time complexity of the sieve, however, so its absence from the code in Section 1 is
not our cause for worry.)"

So it's in context of the imperative code (although rather numbers on paper than bits in
RAM).
For imperative (be it array or paper), it is indeed minor; for her PQ sieve, it can be
considered minor from a theoretical point of view (doesn't change time complexity), but
practically, anything that changes performance by a factor of two or more is only 'minor'
for hopelessly inadequate algorithms (whether a computation would take ten or twenty
billion years is indeed a minor difference; one hour or two is a major difference).
However, as you say, the important point is not whether p's multiples get crossed off
starting from 2*p or from p^2, but whether p's multiples get inserted into the queue when
you look at p or at p^2.

> > But even then, the 'minor' isn't adequate considering her PQ approach
> > where, although it doesn't change time complexity (in theory), it changes
> > space complexity - and that may affect running time drastically. Probably
> > she was thinking of the typical array sieve when she made that remark.
>
> In short, that remark was not referring to what it seemingly refers to, in
> the article, and as such was very confusing. It was a big STOP sign on the
> way to Postponed Filters - Euler's - Bird's merged multiples - tree-merging
> (with wheel) road of little steps, and used as a justification for her to
> make a big leap across the chasm towards the PQ code.
>
> So to speak. :)
>
> > > So the key understanding is overlooked.
> > >
> > > And what we're after here is the insight anyway. Skipping steps of
> > > natural development does not contribute to garnering an insight.
>

> .. and that is of course a matter of personal preference. A genius is

> perfectly capable of making big leaps across chasms. Heck they might even
> be able to fly :)
>
> > > > In the postponed filters, we match each prime p with all primes <=
> > > > sqrt p (again, the composites contribute less). I think that gives a
> > > > complexity of O(pi(bound)^1.5*log (log bound)) or O(n^1.5*log (log
> > > > n)) for n primes produced.
> > >
> > > Empirically, it was always _below_ 1.5, and above 1.4 , and PQ/merged
> > > multiples removal around 1.25..1.17 .
> >
> > I should really stop doing those calculations in my head, it seems I'm
> > getting too old for that. Doing it properly, on paper, the cost for
> > identifying p_k is pi(sqrt p_k) [trial division by all primes less than
> > sqrt p_k].
> > p_k is approximately k*log k, so pi(sqrt p_k) is approximately
> > (sqrt $ k* log k)/log (sqrt $ k*log k) ~ 2*sqrt (k/log k).
> > Summing that, the cost for identifying the first n primes is
> > Theta(n^1.5/log n).
>
> that would correspond to what I've seen much better.
>

It should. Though empirical measurements don't always adhere exactly to the theoretical
results, the deviation is usually small (for large enough input).

> > Using a feeder, i.e. primes = 2:3.sieve feederprimes [5, 7 .. ], where
> > feederprimes are generated as above, to reduce memory pressure and GC
> > impact, that fits pretty well with my measurements of the time to find
> > p_k for several k between 500,000 and 20,000,000.
>
> The quesion of a memory blowup with the treefolding merge still remains.
> For some reason using its second copy for a feeder doesn't reduce the
> memory (as reported by standalone compiled program, GHCi reported values
> are useless) - it causes it to increase twice.....
>

I have a partial solution. The big problem is that the feeder holds on to the beginning of
comps while the main runner holds on to a later part. Thus the entire segment between
these two points must be in memory.
So have two lists of composites (yeah, you know that, but it didn't work so far).
But you have to force the compiler not to share them: enter -fno-cse.
The attached code does that (I've also expanded the wheel), it reduces the memory
requirements much (a small part is due to the larger wheel, a factor of ~5 due to the non-
sharing).
It still uses much more memory than the PQ, and while the PQ's memory requirements grow
very slowly, the tree-fold merge's still grow rather fast (don't go much beyond the
10,000,000th prime), I'm not sure why.

V13Primes.hs

Will Ness

unread,

Jan 4, 2010, 7:26:33 AM1/4/10

to haskel...@haskell.org

Daniel Fischer <daniel.is.fischer <at> web.de> writes:

>
>
> Am Sonntag 03 Januar 2010 09:54:37 schrieb Will Ness:
>

> > Daniel Fischer <daniel.is.fischer <at> web.de> writes:
> >

> > > But there's a lot of list constructuion and deconstruction necessary for
> > > the Euler sieve.
> >
> > yes. Unless, of course, s smart compiler recognizes there's only one
> > consumer for the values each multiples-list is producing, and keeps it
> > stripped down to a generator function, and its current value. I keep
> > thinkig a smart compiler could eliminate all these "span" calls and replace
> > them with just some pointers manipulating...
> >
>
> Of course I'm no smart compiler, but I don't see how it could be even
> possible to replace the span calls with pointer manipulation when dealing
> with lazily generated (infinite, if we're really mean) lists. Even when
> you're dealing only with strict finite lists, it's not trivial to do
> efficiently.

I keep thinking that storage duplication with span, filter etc. is not really
necessary, and can be replaced with some pointer chasing - especially when
there's only one consumer present for the generated values.

What I mean is thinking of lists in terms of produce/consumer paradigm, as
objects supporting the { pull, peek } interface, keeping the generator inside
that would produce the next value on 'pull' request and keep it ready for
any 'peek's.

Euler's sieve is

sieve (p:ps) xs = h ++ sieve ps (t `minus` map (p*) [p,p+2..])
where (h,t) = span (< p*p) xs

Everything lives only through access, so (sieve (tail primes) [5,7]) would
create an object with the generator which has the 'span' logic inlined:

sieve ps xs = make producer such that
p := pull ps -- alter ps as well (actually pull a value from it)
q := p*p
peek = x := peek xs
if x < q then x else peek (remake self)
pull = x := peek xs
if x < q then pull xs else pull (remake self)
remake = ys := minus xs (intsFromBy q (2*p))
self := sieve ps ys

Here the only thing that gets created are the 'minus' nodes which essentially
maintain pointers into the two streams that they consume. 'intsFromBy' only has
to maintain two integers inside it (currentVal and step) as there's no need for
it to maintain any storage for its results, as they are immediately consumed. A
persistent list would be represented by a different kind of producer which
would be given a storage to operate on, upon creation (as would the top level
variable like 'primes').

The real difference here is between those producers whose values will only be
consumed once, by one specific consumer, and those which values may be needed
more than once, so need really to be maintained in some storage. If not - span,
filter, map, whatever - they all are just little modifiers on top of the real
producers, which may or may not also have an actual storage maintained by them.

> > again, what I mean is, not _where_ I start crossing them off in a PQ, but
> > _when_. The article's code starts crossing them off _at_ p^2 - by adding
> > p^2+2p into the PQ - _as_ _soon_ as p itself is reached. It won't surface
> > until p^2 will be considered for a prime; it'll lay dormant deep inside the
> > queue's guts. When reaching 7919, the thousand (i.e. pi(7919) ) entries
> > will hang out inside the PQ - instead of just 24. A memory blowup. (this is
> > of course fixed in Melissa's ZIP package). Of course due to the nature of
> > PQ it might actually not hurt the performance for a while, depending on
> > partcular PQ implementation. Complexity _and_ constant factors.
> >
>
> It will start to have an impact pretty soon. Assuming at least one of the
> relevant PQ operations to be Theta(log size), each composite between ~400
> and ~400000 (rough estimates) will take something like twice as long to
> handle. It will start to hurt really badly only a while later, though, as
> a guesstimate, with more than a million primes in the PQ, memory will have
> a hard time.

Exactly!

> > > If, on the other hand, you start crossung off at 2*p, when the main sieve
> > > is at 10^7, the size of the PQ is > 650000, at 10^8, the size is more
> > > than 5.5 million. That starts to become a memory problem rather soon.
> >
> > here you don't have a choice or when to add it - you have to add it at p
> > itself - so the problem is clear. But even when you cross at p^2, the
> > question remains, of when you add the p^2 entry into the PQ. That was my
> > point.
> >
> > Postponed Filters code makes this clear, and thus hard to err on.

> > Unfortunately, it wasn't present _in_ the article.

> > > I think that remark was meant to apply to composite removal, not Turner's
> > > sieve.
> >
> > It is right there on page 2, right when the Turner's sieve is presented and
> > discussed. The only explanation that I see is that she thought of it in
> > regards to the imperative code, just as her analysis concentrates only on
> > calculation aspects of the imperative code itself.
> >
>
> To be fair, she writes:
>
> "Let us ﬁrst describe the original “by hand” sieve algorithm as practiced by
> Eratosthenes.

> ...

> The starting point of p^2 is a pleasing but minor optimization, which can
> be made
> because lower multiples will have already been crossed oﬀ when we found
> the primes prior to p.

> ... (This optimization does not aﬀect the time complexity of the sieve,

> however, so its absence from the code in Section 1

> *is not our cause for worry*.)"

A-HA!

But its absense from _*that*_ _*code*_ WAS the *major* cause for worry, as
dealing with it worked wonders on its complexity and constant factors.

This remark only makes sense in the imperative, mutable-storage setting, as
we've already estalished.

> So it's in context of the imperative code (although rather numbers on paper
> than bits in RAM).
> For imperative (be it array or paper), it is indeed minor; for her PQ sieve,
> it can be considered minor from a theoretical point of view (doesn't change
> time complexity), but practically, anything that changes performance by a
> factor of two or more is only 'minor' for hopelessly inadequate algorithms
> (whether a computation would take ten or twenty billion years is indeed a
> minor difference; one hour or two is a major difference).

++

> However, as you say, the important point is not whether p's multiples get
> crossed off starting from 2*p or from p^2, but whether p's multiples get

> inserted into the queue *when* you look at p or at p^2.

Exactly! Not _where_, but _when_ (in a lifetime of a computational process).

And this (I would say, important) observation was missing from the article
precisely because of her focusing on the analysis of the _imperative_ algorithm.

But if we focus on functional, it is a logical thing to realise, and when
implemented shows us the road to the next improvement by a gradual stepwise
development. A matter of personal preference. :)

> > > ... the cost for

> > > identifying p_k is pi(sqrt p_k) [trial division by all primes less than
> > > sqrt p_k].
> > > p_k is approximately k*log k, so pi(sqrt p_k) is approximately
> > > (sqrt $ k* log k)/log (sqrt $ k*log k) ~ 2*sqrt (k/log k).
> > > Summing that, the cost for identifying the first n primes is
> > > Theta(n^1.5/log n).
> >
> > that would correspond to what I've seen much better.
> >
>
> It should.

:)

> > The quesion of a memory blowup with the treefolding merge still remains.
> > For some reason using its second copy for a feeder doesn't reduce the
> > memory (as reported by standalone compiled program, GHCi reported values
> > are useless) - it causes it to increase twice.....
> >
>
> I have a partial solution. The big problem is that the feeder holds on to
> the beginning of comps while the main runner holds on to a later part.
> Thus the entire segment between these two points must be in memory.
> So have two lists of composites (yeah, you know that, but it didn't
> work so far).

I did. I duplicated everything. The reported memory was twice bigger. :|

> But you have to force the compiler not to share them: enter -fno-cse.
> The attached code does that (I've also expanded the wheel), it reduces the
> memory requirements much (a small part is due to the larger wheel, a factor
> of ~5 due to the non-sharing).
> It still uses much more memory than the PQ, and while the PQ's memory
> requirements grow very slowly, the tree-fold merge's still grow rather
> fast (don't go much beyond the 10,000,000th prime), I'm not sure why.
>

Great! I will look into the code. Thanks! :) :)

>
>
>
> Attachment (V13Primes.hs): text/x-haskell, 3621 bytes

Heinrich Apfelmus

unread,

Jan 4, 2010, 8:30:56 AM1/4/10

to haskel...@haskell.org

Will Ness wrote:
>
> I keep thinking that storage duplication with span, filter etc. is not really
> necessary, and can be replaced with some pointer chasing - especially when
> there's only one consumer present for the generated values.
>
> What I mean is thinking of lists in terms of produce/consumer paradigm, as
> objects supporting the { pull, peek } interface, keeping the generator inside
> that would produce the next value on 'pull' request and keep it ready for
> any 'peek's.
>
> Euler's sieve is
>
> sieve (p:ps) xs = h ++ sieve ps (t `minus` map (p*) [p,p+2..])
> where (h,t) = span (< p*p) xs
>

> [...]

>
> The real difference here is between those producers whose values will only be
> consumed once, by one specific consumer, and those which values may be needed
> more than once, so need really to be maintained in some storage. If not - span,
> filter, map, whatever - they all are just little modifiers on top of the real
> producers, which may or may not also have an actual storage maintained by them.

(I haven't followed the whole thread, but hopefully I have enough grasp
of it to make a useful remark. :))

Concerning lists as producer/consumer, I think that's exactly what lazy
evaluation is doing. Neither filter , map or span evaluate and
store more list elements that strictly necessary.

Sure, creating a list head only to immediately consume it is somewhat
inefficient -- and the target of stream fusion[1] -- but this is an
overhead of how list elements are stored, not how many.

You can try to implement the Euler sieve with producers by using a type like

data Producer a = forall s. Producer {
state :: !s, next :: s -> s, value :: s -> a }

but I think this will be quite difficult; it's not clear what and thus
how big the state will be. (See [1] for choosing a good type.)

Concerning the sieves, there is a fundamental difference between the
imperative sieve and the functional sieves, regardless of whether the
latter start at p or p^2 or use a priority queue. Namely, the imperative
sieve makes essential use of *pointer arithmetic*. The key point is that
in order to cross off the multiples

p, 2*p, 3*p, ...

of a prime, the algorithm can directly jump from the (k*p)-th to the
(k*p+p)-th array element by adding p to the index. The functional
versions can never beat that because they can't just jump over p
constructors of a data structure in O(1) time.

[1]: http://www.cse.unsw.edu.au/~dons/papers/CLS07.html

Regards,
Heinrich Apfelmus

--
http://apfelmus.nfshost.com

Will Ness

unread,

Jan 4, 2010, 10:31:16 AM1/4/10

to haskel...@haskell.org

I laways suspected as much, but was once told that Chris Okasaki has shown that
any filter etc must allocate its own storage. With the peek/pull they don't
have to, if they are nested, and the deepest one from the real storage gets
pulled through some pointer chasing eventually. Span isn't so easily compiled
out too or is it? But that might be a minor point.

For me, a real smart compiler is one that would take in e.g. (sum $ take n $
cycle $ [1..m]) and spill out a straight up math formula, inside a few ifs
maybe (just an aside).

Such a smart compiler might even be able to derive a well performing code right
from the Turner's sieve. :)

> Sure, creating a list head only to immediately consume it is somewhat
> inefficient -- and the target of stream fusion[1] -- but this is an
> overhead of how list elements are stored, not how many.

it might be equivalent to the (imagined) producer's storing its 'current' value
inside its frame.

How much can we rely on the run-time to actually destroy all the passed-over
elements and not hang on to them for some time? Is this that compiler switch
that Daniel mentioned? Is it reliable?

>
> You can try to implement the Euler sieve with producers by using a type like
>
> data Producer a = forall s. Producer {
> state :: !s, next :: s -> s, value :: s -> a }
>
> but I think this will be quite difficult; it's not clear what and thus
> how big the state will be. (See [1] for choosing a good type.)

I did that once in Scheme, as per SICP, with 'next' hanging in a stream's tail.
Put filters and maps on top of that (inside the new 'next' actually). But that
used the Scheme's lists as sorage. Another one was actual producers/modifiers
with {pull,peek} interface. It even produced some primes, and some Hamming
numbers. Then I saw Haskell, and thought I'd get all that for free with its
equational reasoning.

But I get the impression that GHC isn't working through equational reasoning?..
I see all this talk about thunks etc.

> Concerning the sieves, there is a fundamental difference between the
> imperative sieve and the functional sieves, regardless of whether the
> latter start at p or p^2 or use a priority queue. Namely, the imperative
> sieve makes essential use of *pointer arithmetic*. The key point is that
> in order to cross off the multiples
>
> p, 2*p, 3*p, ...
>
> of a prime, the algorithm can directly jump from the (k*p)-th to the
> (k*p+p)-th array element by adding p to the index. The functional
> versions can never beat that because they can't just jump over p
> constructors of a data structure in O(1) time.

We can directy jump to the next multiple too, it is called (+). :) :)

But seriously, the real issue is that we have to merge the produced streams of
multiples, while the mutable-storage code works on same one, so its "merging
cost" is zero. And even if we are smart to merge them in a tree-like fashion,
we still have no (or little) control over the compiler's representation of
lists and retention of their content and whether it performs stream fusion or
not (if we use lists).

If you could take a look at the tree-merging primes and why it uses too much
memory, it would be great. The code is in Daniel's post to which I replied, or
on haselwiki Prime_numbers page (there in its rudimentary form). It's a tangent
to your VIP code, where instead of People structure an ordered list is just
maintained as a split pair, of its known (so far, finite) prefix and the rest
of it. Then under merging these split pairs form a monoid, s can be rearranged
in a tree. If you have'nt seen it yet, it uses a different folding structure to
your code, with a lower total cost of multiples production (estimated as Sum
(1/p)*depth):

tfold f (x:y:z:xs) = (x `f` (y `f` z)) `f` pairwise f xs
comps = tfold $ pairwise mergeSP multips

But aside from the memory problem (about 50M vs Melissa's 2M), for the first
few million primes produced it has almost exactly the same asymptotics as her
PQ code and runs on par with it, compiled (and 2.5x faster when interpreted, in
GHCi). It is also clear and concise. :) :)

I think I'll have to try out your code (amended with a new folding structure)
and see if it has less memory problems maybe. I assume it is your code on the
haskellwiki page. (?)

Cheers,

Emil Axelsson

unread,

Jan 4, 2010, 11:42:17 AM1/4/10

to Will Ness, haskel...@haskell.org

> For me, a real smart compiler is one that would take in e.g. (sum $ take n $
> cycle $ [1..m]) and spill out a straight up math formula, inside a few ifs
> maybe (just an aside).

(Also an aside, I couldn't resist...)

Then I'm sure you'd say that Feldspar [1] has a smart compiler :)

The above expression written in Feldspar and the resulting C code can be
found here:

http://hpaste.org/fastcgi/hpaste.fcgi/view?id=15592#a15593

The C code is somewhat complicated by the fact that Feldspar doesn't
have infinite vectors.

Feldspar usually works well on small examples like this one, but we're
very much lacking bigger examples, so I can't advise you to use it for
prime numbers just yet.

/ Emil

[1] http://feldspar.sourceforge.net/

Daniel Fischer

unread,

Jan 4, 2010, 12:04:35 PM1/4/10

to haskel...@haskell.org, Will Ness

Not quite. That's a variant of the postponed filters, it crosses off e.g. 45 twice, once
as 3*15 and once as 5*9 (okay, it has already been removed by the first, so let's say 45
appears in two lists of numbers to be removed if present).
Euler's sieve is never attempting to remove a number more than once, that's the
outstanding feature of it. Unfortunately, that also makes it hard to implement
efficiently. The problem is that you can't forget the primes between p and p^2, you must
remember them for the removal of multiples of p later on.

An (inefficient but faithful) implementation would be

euler ks@(p:rs) = p : euler (rs `minus` map (*p) ks)

Its performance is really horrible though. A much better implementation is

primes = 2:3:euler hprs [] (scanl (+) 5 (cycle [2,4]))
where
hprs = 5:drop 3 primes
euler (p:ps) acc cs = h ++ euler ps (tail (acc ++ h)) (t `minus` comps)
where
(h,t) = span (< p*p) cs
comps = map (*p) (acc ++ cs)

still really bad. I don't yet see how to eat the cake and have it here.

> > > > I think that remark was meant to apply to composite removal, not
> > > > Turner's sieve.
> > >
> > > It is right there on page 2, right when the Turner's sieve is presented
> > > and discussed. The only explanation that I see is that she thought of
> > > it in regards to the imperative code, just as her analysis concentrates
> > > only on calculation aspects of the imperative code itself.
> >
> > To be fair, she writes:
> >
> > "Let us ﬁrst describe the original “by hand” sieve algorithm as practiced
> > by Eratosthenes.
> > ...
> > The starting point of p^2 is a pleasing but minor optimization, which can
> > be made
> > because lower multiples will have already been crossed oﬀ when we found
> > the primes prior to p.
> > ... (This optimization does not aﬀect the time complexity of the sieve,
> > however, so its absence from the code in Section 1
> > *is not our cause for worry*.)"
>
> A-HA!
>
> But its absense from _*that*_ _*code*_ WAS the *major* cause for worry, as
> dealing with it worked wonders on its complexity and constant factors.

I think you're overinterpreting it. Sure, it's not perfectly worded, but her concern was
primarily that it's not an Eratosthenes sieve but trial division (and woefully inefficient
at that), I think. Now, because it's trial division and not Eratosthenes, it has a major
impact, thus the absence of the optimisation is a reinforcing consequence of the original
cause for worry.

I may be overdefending her, though, we can't really know what she meant.

Will Ness

unread,

Jan 4, 2010, 12:09:27 PM1/4/10

to haskel...@haskell.org

Daniel Fischer <daniel.is.fischer <at> web.de> writes:

I don't understand. What is there to be shared? Each multiples list is consumed
only at one point; there's nothing to be shared. Do you mean the compiler still
hangs on to them? If so, why??

I used the switch; it didn't help at all. The only thing I can see is different
is that all my interim data which I named with inner vars you moved out to the
top level as functions. Is that what did the trick? What would be the reason to
hang on to the already consumed data that is inaccessible to any active
consumer? Why not make the forgetful behaviour the norm - especially where
remembering is pointless??

> It still uses much more memory than the PQ, and while the PQ's memory
> requirements grow very slowly, the tree-fold merge's still grow rather fast
> (don't go much beyond the 10,000,000th prime), I'm not sure why.

You did it! It's now 7M for 1,000,000th prime, instead of 52M before. Making
the pattern lazy in mergeSP was probably an important fix too. :)

Unfortunately it grows, as you've said - 23MB for 2 mln. :|

PQ stays at just 2MB.

>
> Attachment (V13Primes.hs): text/x-haskell, 3621 bytes
>

Will Ness

unread,

Jan 4, 2010, 1:17:18 PM1/4/10

to haskel...@haskell.org

Daniel Fischer <daniel.is.fischer <at> web.de> writes:

>
>
> Am Montag 04 Januar 2010 13:25:47 schrieb Will Ness:
>
> > Euler's sieve is
> >
> > sieve (p:ps) xs = h ++ sieve ps (t `minus` map (p*) [p,p+2..])
> > where (h,t) = span (< p*p) xs
>
> Not quite. That's a variant of the postponed filters, it crosses off e.g.
> 45 twice, once as 3*15 and once as 5*9 (okay, it has already been removed by
> the first, so let's say 45 appears in two lists of numbers to be removed if
> present).

there won't be any such. whatever is removed first, is just skipped second (and
third, etc). 45 does appear twice on the two multiples ists (3- and 5-). But it
is "removed" by first, and skipped by second. And there's no removal here in
the first place. There is no pre-filled storage. All there is, is some lists
comparison, and lists are just generators (I so keep hoping).

I don't see any problem here. As Melissa (and yourself, I think) have shown,
double hits on multiples are few and far between.

Also, it uses no filters, i.e. no individual number divisibility testing.
The "filters" refers first of all to testing an individual number to decide
whether to keep it in or not. Euler's sieve removes multiples in advance, so
there's no testing and no filtering, only comparison. It follows the
_Postponed_ Filter's framework in postponing the action until the right moment;
the action itself is two lists comparison and skipping of the equals (i.e.
the "minus" action).

> Euler's sieve is never attempting to remove a number more than once, that's

How's that possible? On wikipedia is says, it removes multiples of 3; then
multiples of 5; etc. So it just looks for each number it is removing, if it is
there, and if so, removes it. I would argue that looking whether to remove or
not, is part of attempting to remove. It can't have foresight, right?

> the outstanding feature of it. Unfortunately, that also makes it hard to
> implement efficiently. The problem is that you can't forget the primes
> between p and p^2, you must remember them for the removal of multiples of p
> later on.

you're right, every formulation of these algorithms is always done in the
imperative, mutable-storage setting. They always speak of "removing" numbers
etc.

The more pressing problem with that code is its linear structure of course
which gets addressed by the tree-folding merge etc.

> An (inefficient but faithful) implementation would be
>
> euler ks@(p:rs) = p : euler (rs `minus` map (*p) ks)

I think it does exactly the same thing, computationally, except it does so,
again, _prematurely_ exactly in Turner's sieve's fashion - for each prime
found, _as_ _soon_ as it is found. If it is faithful, so is my code. :)

> Its performance is really horrible though.

makes sense, as for Turner's. See, the _when_ thing really helps to see what's
going on here.

> A much better implementation is
>
> primes = 2:3:euler hprs [] (scanl (+) 5 (cycle [2,4]))
> where
> hprs = 5:drop 3 primes
> euler (p:ps) acc cs = h ++ euler ps (tail (acc ++ h)) (t `minus` comps)
> where
> (h,t) = span (< p*p) cs
> comps = map (*p) (acc ++ cs)

this look like {2,3} wheel reimplemented and inlined. No point in improving
anything until the linear structure isn't turned into a tree.

> > >
> > > To be fair, she writes:
> > >
> > > ... (This optimization does not aﬀect the time complexity of the sieve,

> > > however, so its absence from _the_ _code_ in Section 1
> > > is _not_ our cause for worry.)"

> >
> > A-HA!
> >
> > But its absense from _*that*_ _*code*_ WAS the *major* cause for worry, as
> > dealing with it worked wonders on its complexity and constant factors.
>
> I think you're overinterpreting it.

I might have. I don't mean to neatpick, honest. It's just how it looked to
me: "Turner's? - bad. Squares? - won't help, it's a _minor_ thing; don't even
go there!" etc.

As I've said, geniuses leap, and fly. I just wanted to _walk_ down that road,
not skipping any steps. And it turned out, the very first step _really_ pays up
_big_. So big in fact, it might be argued that any subsequent improvement is
just an advanced optimization, in terms of presenting an introductory code
which isn't horribly inefficient and yet is short and clear.

Will Ness

unread,

Jan 4, 2010, 1:26:15 PM1/4/10

to haskel...@haskell.org

Emil Axelsson <emax <at> chalmers.se> writes:

>
> > For me, a real smart compiler is one that would take in e.g. (sum $
> > take n $
> > cycle $ [1..m]) and spill out a straight up math formula, inside a few ifs
> > maybe (just an aside).
>
> (Also an aside, I couldn't resist...)
>
> Then I'm sure you'd say that Feldspar [1] has a smart compiler :)

but it didn't produce

f n m = if n < m then n*(n+1)/2 else
let (q,r)=quotRem n m
in q*(m*(m+1)/2) + r*(r+1)/2

:)

> The above expression written in Feldspar and the resulting C code can be
> found here:
>
> http://hpaste.org/fastcgi/hpaste.fcgi/view?id=15592#a15593
>
>

Daniel Fischer

unread,

Jan 4, 2010, 4:27:30 PM1/4/10

to haskel...@haskell.org, Will Ness

Am Montag 04 Januar 2010 18:08:42 schrieb Will Ness:
> Daniel Fischer <daniel.is.fischer <at> web.de> writes:
> > Am Sonntag 03 Januar 2010 09:54:37 schrieb Will Ness:
> > > The quesion of a memory blowup with the treefolding merge still
> > > remains. For some reason using its second copy for a feeder doesn't
> > > reduce the memory (as reported by standalone compiled program, GHCi
> > > reported values are useless) - it causes it to increase twice.....
> >
> > I have a partial solution. The big problem is that the feeder holds on to
> > the beginning of comps while the main runner holds on to a later part.
> > Thus the entire segment between these two points must be in memory.
> >
> > So have two lists of composites (yeah, you know that, but it didn't work
> > so far).
> >
> > But you have to force the compiler not to share them: enter -fno-cse.
> > The attached code does that (I've also expanded the wheel), it reduces
> > the memory requirements much (a small part is due to the larger wheel, a
> > factor of ~5 due to the non-sharing).
>
> I don't understand. What is there to be shared? Each multiples list is
> consumed only at one point; there's nothing to be shared. Do you mean the
> compiler still hangs on to them? If so, why??

Okay, let's look at the code.
Your Primes.hs says

{-# SPECIALIZE primes :: () -> [Int] #-}
{-# SPECIALIZE primes :: () -> [Integer] #-}
primes :: Integral a => () -> [a]
primes () = 2:3:5:7:primes'
�� where
�� primes' = [11,13] ++ drop 2 (rollFrom 11) `minus` comps
�� (comps,_)� = tfold mergeSP (pairwise mergeSP mults)
�� mults�� = map (\p-> fromList $ map (p*) $ rollFrom p) $ primes'

�� fromList (x:xs) = ([x],xs)

What's going on evaluating primes'?
First, we know primes' is 11 : 13 : more.
Then we can start looking at 'more'. For that, we need
rollFrom 11 `minus` comps.
rollFrom 11 is easy. What about comps? We need the start of primes' for that. Cool, we
already know the first two elements, start tfolding, we know comps is
121:143:169:notYetKnown

So the rollFrom 11 produces happily, minus doesn't veto anything for a while, then says no
to 121, 143, 169. Now rollFrom 11 says "look at 173". Fine, what's next in comps?
To find that out, we need the multiples of 17 and 19 (and 23 and 29, but those can still
be deferred for a while). That gives ([289,323,361],whatever), then the mergeSP with the
merged multiples of 11 and 13 produces (187:209:221:...) for notYetKnown.
Everything's going on fine, but the problem is, when primes' produces primes in the region
of n, it needs the comps in that region. Those contain the multiples of primes in the
region of sqrt n, thus the small primes cannot yet be released.
Bottom line, to produce p, all the primes from about sqrt p to p must be in memory. Oops.

Enter the feeder. Say

primes :: Integral a => () -> [a]
primes () = 2:3:5:7:primes'
�� where

�� primes' = (rollFrom 11) `minus` comps
�� primes'' = [11,13] ++ drop 2 (rollFrom 11) `minus` comps

�� (comps,_)� = tfold mergeSP (pairwise mergeSP mults)

�� mults�� = map (\p-> fromList $ map (p*) $ rollFrom p) $ primes''
�� fromList (x:xs) = ([x],xs)

Now primes'', the feeder of composites advances much more slowly and for primes' to
produce p, primes'' must produce the next prime after sqrt p, thus holds on to all primes
between sqrt (sqrt p) and sqrt p - not much of a problem. primes' doesn't hold on to
smaller primes, great.
But! primes' needs the part of comps around p. primes'' needs the part of comps around
sqrt p. That means we must keep the part of the list of composites from sqrt p to p in
memory. Even after removing the multiples of 2, 3, 5 and 7, it doesn't take long until
there are far more composites between sqrt p and p than primes. Double oops.

So we must make sure that the list of composites that primes' consumes is not the same as
that which primes'' consumes.

You can try

primes :: Integral a => () -> [a]
primes () = 2:3:5:7:primes'
�� where

�� primes' = (rollFrom 11) `minus` comps
�� primes'' = [11,13] ++ drop 2 (rollFrom 11) `minus` comps2
�� (comps,_)� = tfold mergeSP (pairwise mergeSP mults)
�� mults�� = map (\p-> fromList $ map (p*) $ rollFrom p) $ primes''
�� (comps2,_)� = tfold mergeSP (pairwise mergeSP mults2)
�� mults2�� = map (\p-> fromList $ map (p*) $ rollFrom p) $ primes''
�� fromList (x:xs) = ([x],xs)

If you compile without optimisations, it works. It just is rather slow. Unfortunately, the
optimiser sees that comps2 and comps are the same and thinks, "why should I produce it
twice if they're the same?" - whoops, back to square two.
Tell the compiler that you do *not* want it shared via -fno-cse and it isn't shared, you
really have two distinct lists of composites, happiness ensues - almost.

>
> I used the switch; it didn't help at all. The only thing I can see is
> different is that all my interim data which I named with inner vars you
> moved out to the top level as functions. Is that what did the trick?

Not really, the above works too, with inner local bindings.
If you introduce separate names for the lists of multiples and for the lists of
composites, -fno-cse ought to do the trick. Maybe it would even work with one list of
multiples. Can I see what you did that -fno-cse didn't manage to separate the lists?

> What would be the reason to hang on to the already consumed data that is
> inaccessible to any active consumer?

No reason. And that's not what happened. We always had two consumers gnawing on the same
list, one proceeding fast, the other slow. The part the slow consumer was done with could
be dropped (and probably was) - the part between the slow and the fast: Not.

> Why not make the forgetful behaviour the norm -
> especially where remembering is pointless??

It is.

>
> > It still uses much more memory than the PQ, and while the PQ's memory
> > requirements grow very slowly, the tree-fold merge's still grow rather
> > fast (don't go much beyond the 10,000,000th prime), I'm not sure why.
>
> You did it! It's now 7M for 1,000,000th prime, instead of 52M before.
> Making the pattern lazy in mergeSP was probably an important fix too. :)

Did I forget to remove them? (checks) Yes.
No, I put them in without thinking, then realised "Hey, stupid, where-bindings *are*
already lazy!" But somehow I have forgotten to remove them.
Since let- and where-bindings are lazy, all those could do is make the compiler say
"Sheesh".

>
> Unfortunately it grows, as you've said - 23MB for 2 mln. :|

And I've found out why. Change the definition of tfold to

tfold f (a: ~(b: ~(c:xs)))
�� = (a `f` (b `f` c)) `f` tfold f xs

and memory stays low (things are going much slower, though).
By always doing a (pairwise f xs), you're getting the composites faster, but at the
expense of pretty huge thunks (I think, if the merging isn't lazy enough, you get some
pretty long lists).

You can make a compromise by using the above tfold (which is no longer a tree-fold) and
grouping (and merging) the multiples in a slower-growing manner, e.g.

compos ps = fst (tfold mergeSP $ nwise 1 mergeSP (pairwise mergeSP (multip ps)))

tfold f (a: ~(b: ~(c:xs)))
�� = (a `f` (b `f` c)) `f` tfold f xs

nwise k f xs = let (ys,zs) = splitAt k xs in rfold f ys : nwise (k+1) f zs

rfold f [x] = x
rfold f (x:xs) = x `f` rfold f xs

memory still grows, but much slower, in my tests, due to the much smaller GC time, it's a
bit faster than the version with the original tfold.

>
> PQ stays at just 2MB.
>

No doubling the size of merge trees, no big thunks, just occasionally add a filter to the
PQ. Memory requirements are O(pi(sqrt n)) [off the top of my head, may be wrong again],

Daniel Fischer

unread,

Jan 4, 2010, 6:35:01 PM1/4/10

to haskel...@haskell.org, Will Ness

Am Montag 04 Januar 2010 19:16:32 schrieb Will Ness:
> Daniel Fischer <daniel.is.fischer <at> web.de> writes:
> > Am Montag 04 Januar 2010 13:25:47 schrieb Will Ness:
> > > Euler's sieve is
> > >
> > > sieve (p:ps) xs = h ++ sieve ps (t `minus` map (p*) [p,p+2..])
> > > where (h,t) = span (< p*p) xs
> >
> > Not quite. That's a variant of the postponed filters, it crosses off e.g.
> > 45 twice, once as 3*15 and once as 5*9 (okay, it has already been removed
> > by the first, so let's say 45 appears in two lists of numbers to be
> > removed if present).
>
> there won't be any such. whatever is removed first, is just skipped second
> (and third, etc). 45 does appear twice on the two multiples ists (3- and
> 5-). But it is "removed" by first, and skipped by second. And there's no
> removal here in the first place. There is no pre-filled storage. All there
> is, is some lists comparison, and lists are just generators (I so keep
> hoping).

((45:(offer 47 when demanded)) `minus` (45:(next will be 51 when demanded))) `minus` (45:
(next will be 55 when demanded))

So there are two attempts to tell the generator to not output 45. To the second, it
answers "I already knew that", but the request is made nevertheless.

Lists sometimes are data structures occupying physical storage (cons cell, point to value,
pointer to next...), sometimes generators (when asked for the next value, I'll answer
this). I say a value is removed from a list regardless of whether it takes relinking
pointers in a list of cons cells or it involves telling the generator not to give out that
value. I could also say 'eliminated'. There are two attempts to eliminate 45.

>
> I don't see any problem here. As Melissa (and yourself, I think) have
> shown, double hits on multiples are few and far between.

It's not a problem, it just means it's not Euler's sieve, because that attempts to
eliminate each composite exactly once.

>
> Also, it uses no filters, i.e. no individual number divisibility testing.
> The "filters" refers first of all to testing an individual number to decide
> whether to keep it in or not.

Umm, the postponed filters says keep everything until p^2, then eliminate (filter out,
remove) multiples of p in the remainder, after that, pick next prime.
That's precisely what the above does. It doesn't do the filtering out by divisibility
testing but by minus (hence more efficiently). I would say that's a variant of the
postponed filters.

> Euler's sieve removes multiples in advance,
> so there's no testing and no filtering, only comparison. It follows the
> _Postponed_ Filter's framework in postponing the action until the right
> moment; the action itself is two lists comparison and skipping of the
> equals (i.e. the "minus" action).
>
> > Euler's sieve is never attempting to remove a number more than once,
> > that's
>
> How's that possible?

http://en.wikipedia.org/wiki/Sieve_of_Eratosthenes#Euler.27s_Sieve

"Euler in his Proof of the Euler product formula for the Riemann zeta function came up
with a version of the sieve of Eratosthenes, *better in the sense that each number was
eliminated exactly once*."

C) The number after the previous prime is also a prime. *Multiply each number in the list
starting from this prime by this prime and discard the products*.

> On wikipedia is says, it removes multiples of 3; then
> multiples of 5; etc. So it just looks for each number it is removing, if it
> is there, and if so, removes it. I would argue that looking whether to
> remove or not, is part of attempting to remove. It can't have foresight,
> right?
>

But it has :) By only trying to eliminates products of the prime p currently under
consideration *with numbers (>= p) which have not yet been eliminated from the list*, it
is known in advance that all these products are still in the list.

When p is under consideration, the list contains (apart from the primes < p) precisely the
numbers whose smallest prime factor is >= p.

> > the outstanding feature of it. Unfortunately, that also makes it hard to
> > implement efficiently. The problem is that you can't forget the primes
> > between p and p^2, you must remember them for the removal of multiples of
> > p later on.
>
> you're right, every formulation of these algorithms is always done in the
> imperative, mutable-storage setting. They always speak of "removing"
> numbers etc.

sed s/remove/eliminate/ ?

>
> The more pressing problem with that code is its linear structure of course
> which gets addressed by the tree-folding merge etc.
>

Which unfortunately introduces its own space problem :(

> > An (inefficient but faithful) implementation would be
> >
> > euler ks@(p:rs) = p : euler (rs `minus` map (*p) ks)
>
> I think it does exactly the same thing, computationally,

No, in map (*p) ks, there are no numbers with a smaller prime factor than p, while in
map (*p) [p, p+2 .. ] there are (for p > 3)

> except it does so,
> again, _prematurely_ exactly in Turner's sieve's fashion - for each prime
> found, _as_ _soon_ as it is found. If it is faithful, so is my code. :)
>

It is faithful in that it tries to eliminate each composite exactly once.

Try a different minus:

xxs@(x:xs) `minus` yys@(y:ys)

= case compare x y of

LT -> x : xs `minus` yys

EQ -> xs `minus` ys

GT -> error ("trying to remove " ++ show y ++ " a second time")

Your code is not. It is, however, much faster.

> > Its performance is really horrible though.
>
> makes sense, as for Turner's.

It's far worse than Turner's. The "map (*p) ks" holds on to things really really long.
Measure it, just for teh lulz.

> See, the _when_ thing really helps to see
> what's going on here.
>
> > A much better implementation is
> >
> > primes = 2:3:euler hprs [] (scanl (+) 5 (cycle [2,4]))
> > where
> > hprs = 5:drop 3 primes
> > euler (p:ps) acc cs = h ++ euler ps (tail (acc ++ h)) (t `minus`
> > comps) where
> > (h,t) = span (< p*p) cs
> > comps = map (*p) (acc ++ cs)
>
> this look like {2,3} wheel reimplemented and inlined.

The {2,3} wheel thing isn't important. It could equally well (well, a bit slower) be

primes = euler hprs [] [2 .. ]
where
hprs = 2:tail primes

It just occured to me that the accumulation list is just (takeWhile (< head cs) (p:ps)),
so we could improve it as

euler (p:ps) cs = h ++ euler ps (t `minus` comps)

where
(h,t) = span (< p*p) cs

comps = map (*p) (takeWhile (< head cs) (p:ps) ++ cs)

Still Bad.

> No point in improving
> anything until the linear structure isn't turned into a tree.
>

You're a tree hugger? :D
No, the point isn't linear vs. tree, it's that Euler's sieve is a) great for theoretical
reasoning (look at a proof of Euler's product formula for the (Riemann) Zeta-function to
appreciate it) b) reasonably good for sieving with a list of numbers written on paper, but
c) not good to implement on a computer. For the computer, keeping track of which numbers
are left is so much more complicated than just ticking off numbers a couple of times that
it isn't even worth attempting it.

Daniel Fischer

unread,

Jan 4, 2010, 7:51:57 PM1/4/10

to haskel...@haskell.org, Will Ness

Am Montag 04 Januar 2010 16:30:18 schrieb Will Ness:
> Heinrich Apfelmus <apfelmus <at> quantentunnel.de> writes:
> >
> > (I haven't followed the whole thread, but hopefully I have enough grasp
> > of it to make a useful remark. :))
> >
> > Concerning lists as producer/consumer, I think that's exactly what lazy
> > evaluation is doing. Neither filter , map or span evaluate and
> > store more list elements that strictly necessary.
>
> I laways suspected as much, but was once told that Chris Okasaki has shown
> that any filter etc must allocate its own storage. With the peek/pull they
> don't have to, if they are nested, and the deepest one from the real
> storage gets pulled through some pointer chasing eventually. Span isn't so
> easily compiled out too or is it? But that might be a minor point.
>
> For me, a real smart compiler is one that would take in e.g. (sum $ take n
> $ cycle $ [1..m]) and spill out a straight up math formula, inside a few
> ifs maybe (just an aside).

Such things just aren't common enough. If you start letting the compiler look for patterns
such as this which can be transformed into a simple formula, compile times would explode.

>
> Such a smart compiler might even be able to derive a well performing code
> right from the Turner's sieve. :)
>
> > Sure, creating a list head only to immediately consume it is somewhat
> > inefficient -- and the target of stream fusion[1] -- but this is an
> > overhead of how list elements are stored, not how many.
>
> it might be equivalent to the (imagined) producer's storing its 'current'
> value inside its frame.
>
> How much can we rely on the run-time to actually destroy all the
> passed-over elements and not hang on to them for some time?

If they're really passed over, they very probably won't survive the next GC.

> Is this that compiler switch that Daniel mentioned? Is it reliable?
>

No, it's something different. Not entirely unrelated.
The -fno-cse turns off Common Subexpression Elimination (rather sharing than elimination).
That is, if you have

f x = g (expensive x) y z (h (expensive x))

the compiler can see the common subexpression (expensive x) on the RHS and
decide to share it, i.e. calculate it only once:

f x = let e = expensive x in g e y z (h e)

That may or may not be a good idea. If e is small (say an Int) and expensive isn't a
blatant lie, it is a *very* good idea. But if e is a loong list that expensive x lazily
produces and that is consumed by g and h in the right manner, it is often a bad idea
because h might not start to consume before g has demanded a long prefix of the list and
kaboom, Heap exhausted.

If you turn on optimisations, the compiler does some looking for common subexpressions
to share. There are some heuristics to decide when to share, but they're far from
infallible. So, if you have a situation like above, and the compiler heuristics indicate
that sharing would be good although it isn't, you're hosed. So you have the option to tell
the compiler "try hard to optimise, but don't do CSE (because I know it would be bad
here)" {- that can be done on a per-file basis, it would be cool if it could be done on a
per-function basis -}.

Now if you have a list producer and a consumer, without fusion, it goes like
Consumer: Gimme
Producer: creates cons cell, puts value, next cons cell (how to produce more)
Consumer: takes value, does something with it, gimme more.

Stream fusion is about eliminating the cons cell creating and value passing, to fuse
production and consumption into a nice loop. That is of course impossible if the produced
list is shared between two (or more) consumers.

Daniel Fischer

unread,

Jan 4, 2010, 8:45:42 PM1/4/10

to haskel...@haskell.org

Am Montag 04 Januar 2010 22:25:28 schrieb Daniel Fischer:
> compos ps = fst (tfold mergeSP $ nwise 1 mergeSP (pairwise mergeSP (multip
> ps)))
>
> tfold f (a: ~(b: ~(c:xs)))

> ï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œ = (a `f` (b `f` c)) `f` tfold f xs

>
> nwise k f xs = let (ys,zs) = splitAt k xs in rfold f ys : nwise (k+1) f zs
>
> rfold f [x] = x
> rfold f (x:xs) = x `f` rfold f xs
>
> memory still grows, but much slower, in my tests, due to the much smaller
> GC time, it's a bit faster than the version with the original tfold.

Not for larger inputs (but not so large that the tree-fold dies OOM).
Fix rfold:

rfold f [x] = x

rfold f xs = rfold f (pairwise f xs)

and it's faster also for those.

Will Ness

unread,

Jan 5, 2010, 4:34:13 AM1/5/10

to haskel...@haskell.org

Daniel Fischer <daniel.is.fischer <at> web.de> writes:

>
>
> Am Montag 04 Januar 2010 16:30:18 schrieb Will Ness:
> >
> > For me, a real smart compiler is one that would take in e.g. (sum $ take n
> > $ cycle $ [1..m]) and spill out a straight up math formula, inside a few
> > ifs maybe (just an aside).
>
> Such things just aren't common enough. If you start letting the compiler
> look for patterns such as this which can be transformed into a simple
> formula, compile times would explode.

I was thinking more along the lines of inferencing compiler, proving new
theorems about the types and applying that knowledge in simplifying the
expressions. This would take time, so it should be a part of some interactive
system, maybe kind of like Lisp has.

In such a setting, the underlying compiler could first produce quick-n-dirty
version, and would continue working in the background whenever the system is
not busy, trying to improve the executable. Such a system would probably have
to distinguish, at the type level, between [1..m] ; cycle [1..m] ; take n
[1..m] ; etc. These would all be not just fuctions, but parts of a type's (here
list) behaviour with automatically deduced semantics.

What would such a type system be called?

> The -fno-cse turns off Common Subexpression Elimination (rather sharing
> than elimination).

> That is, if you have
>
> f x = g (expensive x) y z (h (expensive x))
>
> the compiler can see the common subexpression (expensive x) on the RHS and
> decide to share it, i.e. calculate it only once:
>
> f x = let e = expensive x in g e y z (h e)
>

> ........

thanks for the in-depth explanation! :)

> Now if you have a list producer and a consumer, without fusion, it goes like
> Consumer: Gimme
> Producer: creates cons cell, puts value, next cons cell (how to produce more)
> Consumer: takes value, does something with it, gimme more.
>
> Stream fusion is about eliminating the cons cell creating and value
> passing, to fuse production and consumption into a nice loop. That is of
> course impossible if the produced list is shared between two (or more)
> consumers.

I would imagine so. Do I get this fusion on lists for free from the compiler,
or do I have to recode for that? (haven't yet time to look into the article
mentioned).

>
>
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe <at> haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>

Daniel Fischer

unread,

Jan 5, 2010, 5:33:12 AM1/5/10

to haskel...@haskell.org, Will Ness

Am Dienstag 05 Januar 2010 10:33:19 schrieb Will Ness:
> Such a
> system would probably have to distinguish, at the type level, between
> [1..m] ; cycle [1..m] ; take n [1..m] ; etc. These would all be not just
> fuctions, but parts of a type's (here list) behaviour with automatically
> deduced semantics.
>
> What would such a type system be called?
>

Seriously complicated, I think.
Don't get me wrong, a system with such capabilities would be A.W.E.S.O.M.E.
I just can't see it happening anytime soon.

>
> I would imagine so. Do I get this fusion on lists for free from the
> compiler, or do I have to recode for that? (haven't yet time to look into
> the article mentioned).
>

Without optimisations, hardly ever if at all (needs a compiler expert to know).
With optimisations, sometimes. But it doesn't always see the possibility where it would
with recoding in the right style (dons is expert in that).
With stream fusion, more often. But I think it would again be not very hard to hide
opportunities for fusion by your coding style.

Will Ness

unread,

Jan 5, 2010, 8:52:59 AM1/5/10

to haskel...@haskell.org

Daniel Fischer <daniel.is.fischer <at> web.de> writes:

>
>
> Am Montag 04 Januar 2010 19:16:32 schrieb Will Ness:
> > Daniel Fischer <daniel.is.fischer <at> web.de> writes:
> > > Am Montag 04 Januar 2010 13:25:47 schrieb Will Ness:
> > > > Euler's sieve is
> > > >
> > > > sieve (p:ps) xs = h ++ sieve ps (t `minus` map (p*) [p,p+2..])
> > > > where (h,t) = span (< p*p) xs
> > >
> > > Not quite. That's a variant of the postponed filters, it crosses off e.g.
> > > 45 twice, once as 3*15 and once as 5*9 (okay, it has already been removed
> > > by the first, so let's say 45 appears in two lists of numbers to be
> > > removed if present).
> >
> > there won't be any such. whatever is removed first, is just skipped second
> > (and third, etc).
>

> ((45:(offer 47 when demanded)) `minus` (45:(next will be 51 when demanded)))
> `minus` (45:(next will be 55 when demanded))
>
> So there are two attempts to tell the generator to not output 45. To the
> second, it answers "I already knew that", but the request is made
> nevertheless.

yes, of course.

> ... There are two attempts to eliminate 45.

I would say there are two requests to not have 45 in the output.

> > I don't see any problem here. As Melissa (and yourself, I think) have
> > shown, double hits on multiples are few and far between.
>
> It's not a problem, it just means it's not Euler's sieve, because that
> attempts to eliminate each composite exactly once.

yes I see now. My bad. Wasn't reading that wikipedia page with enough attention
to detail. It uses the modified (culled, so far) numbers to find out the next
multiples to be eliminated, not just the prime itself like my code does.

You solution is indeed, exponential:

euler ks@(p:rs) = p : euler (rs `minus` map (*p) ks)

primes = 2:euler [3,5..]

primes
= 2:euler (as@[3,5..])
3:euler (bs@(tail as `minus` map (3*) as))
5:euler (cs@(tail bs `minus` map (5*) bs))

There are two separate look-back pointers to /as/ in /bs/, and there are two
such in /cs/, to /bs/. The book-keeping machinery explodes.

> > Also, it uses no filters, i.e. no individual number divisibility testing.
> > The "filters" refers first of all to testing an individual number to decide
> > whether to keep it in or not.
>
> Umm, the postponed filters says keep everything until p^2, then eliminate
> (filter out, remove) multiples of p in the remainder, after that, pick next
> prime.
> That's precisely what the above does. It doesn't do the filtering out by
> divisibility testing but by minus (hence more efficiently). I would say
> that's a variant of the postponed filters.
>

Filter is usually (as in Haskell's 'filter') is about testing individual
elements by a predicate function. There is of course a filtering effect in two
lists elts' comparison that 'minus' performs, so that's debatable. Even the PQ
code performs "filtering" in this wider sense.

> > > Euler's sieve is never attempting to remove a number more than once,
> > > that's
> >
> > How's that possible?
>
> http://en.wikipedia.org/wiki/Sieve_of_Eratosthenes#Euler.27s_Sieve
>

> C) The number after the previous prime is also a prime. *Multiply each

> number /that's left/

> in the list starting from this prime by this prime and discard the products*.

yes. Wasn't paying attention to that, more to the "intent" of it.

There's of course enourmous vagueness in how exactly it is to be performed, in
the unbounded case, which you uncovered here.

> > It can't have foresight, right?
> >
>
> But it has :) By only trying to eliminates products of the prime p

> currently under consideration *with numbers (>= p) /which have not/
> /yet been eliminated/ from the list*, it is known in advance that all these

> products are still in the list.

missed that.

> When p is under consideration, the list contains (apart from the primes < p)
precisely the numbers whose smallest prime factor is >= p.
>
> > > the outstanding feature of it. Unfortunately, that also makes it hard to
> > > implement efficiently. The problem is that you can't forget the primes
> > > between p and p^2, you must remember them for the removal of multiples of
> > > p later on.

not just primes - all the composites not yet removed, too. So it can't even be
implemented on shared mutable storage if we want to delay the removal (as we
must, in unbounded case) - the composites will get removed so their multiples
must actually all be produced first!

> > The more pressing problem with that code is its linear structure of course
> > which gets addressed by the tree-folding merge etc.
> >
>
> Which unfortunately introduces its own space problem :(

> Try a different minus:
>
> xxs <at> (x:xs) `minus` yys <at> (y:ys)

> = case compare x y of
> LT -> x : xs `minus` yys
> EQ -> xs `minus` ys
> GT -> error ("trying to remove " ++ show y ++ " a second time")
>
> Your code is not. It is, however, much faster.

I understand now. Thanks!

> > > Its performance is really horrible though.

exponential, empyrically as well.

> It just occured to me that the accumulation list is just
> (takeWhile (< head cs) (p:ps)), so we could improve it as
>
> euler (p:ps) cs = h ++ euler ps (t `minus` comps)
> where
> (h,t) = span (< p*p) cs
> comps = map (*p) (takeWhile (< head cs) (p:ps) ++ cs)
>
> Still Bad.

yes, _both_ /t/ and /comps/ have their back-pointers into cs.

> > No point in improving
> > anything until the linear structure isn't turned into a tree.
> >
> You're a tree hugger? :D
> No, the point isn't linear vs. tree, it's that Euler's sieve is a) great for
> theoretical reasoning (look at a proof of Euler's product formula for the
> (Riemann) Zeta-function to appreciate it) b) reasonably good for sieving
> with a list of numbers written on paper, but c) not good to implement on a
> computer. For the computer, keeping track of which numbers are left is so
> much more complicated than just ticking off numbers a couple of times that
> it isn't even worth attempting it.

>
>
>
>

Daniel Fischer

unread,

Jan 5, 2010, 9:38:11 AM1/5/10

to haskel...@haskell.org, Will Ness

Am Dienstag 05 Januar 2010 14:49:58 schrieb Will Ness:
> > ... There are two attempts to eliminate 45.
>
> I would say there are two requests to not have 45 in the output.
>

Thers are many possible ways to phrase it.

> > > I don't see any problem here. As Melissa (and yourself, I think) have
> > > shown, double hits on multiples are few and far between.
> >
> > It's not a problem, it just means it's not Euler's sieve, because that
> > attempts to eliminate each composite exactly once.
>
> yes I see now. My bad. Wasn't reading that wikipedia page with enough
> attention to detail. It uses the modified (culled, so far) numbers to find
> out the next multiples to be eliminated,

Minor factual error, no big deal.

> not just the prime itself like my code does.

Which is operationally far better because it's much simpler :)

>
> You solution is indeed, exponential:
>
> euler ks@(p:rs) = p : euler (rs `minus` map (*p) ks)
> primes = 2:euler [3,5..]
>
>
> primes
> = 2:euler (as@[3,5..])
> 3:euler (bs@(tail as `minus` map (3*) as))
> 5:euler (cs@(tail bs `minus` map (5*) bs))
>
>
> There are two separate look-back pointers to /as/ in /bs/, and there are
> two such in /cs/, to /bs/. The book-keeping machinery explodes.
>
> > > Also, it uses no filters, i.e. no individual number divisibility
> > > testing. The "filters" refers first of all to testing an individual
> > > number to decide whether to keep it in or not.
> >
> > Umm, the postponed filters says keep everything until p^2, then eliminate
> > (filter out, remove) multiples of p in the remainder, after that, pick
> > next prime.
> > That's precisely what the above does. It doesn't do the filtering out by
> > divisibility testing but by minus (hence more efficiently). I would say
> > that's a variant of the postponed filters.
>
> Filter is usually (as in Haskell's 'filter') is about testing individual
> elements by a predicate function. There is of course a filtering effect in
> two lists elts' comparison that 'minus' performs, so that's debatable. Even
> the PQ code performs "filtering" in this wider sense.
>

I understood the 'filters' in postponed filters in that wider sense. And I would also find
it perfectly acceptable to say that the PQ code filters out the composites from the
stream. Of course if you're used to using 'filter' only in the stricter sense, you
wouldn't call the `minus` thing a variant of the postponed filters.

>
> not just primes - all the composites not yet removed, too.

Between p and p^2, there are only primes left, fortunately.

> So it can't even be implemented on shared mutable storage if we
> want to delay the removal (as we must, in unbounded case) -

Yes. And it's not nice in the bounded case either.

Will Ness

unread,

Jan 5, 2010, 3:44:13 PM1/5/10

to haskel...@haskell.org

Daniel Fischer <daniel.is.fischer <at> web.de> writes:

>
>
> Am Dienstag 05 Januar 2010 14:49:58 schrieb Will Ness:
> > > ... There are two attempts to eliminate 45.
> >
> > I would say there are two requests to not have 45 in the output.
> >
> Thers are many possible ways to phrase it.
>
> >

> > You solution is indeed, exponential:
> >
> > euler ks@(p:rs) = p : euler (rs `minus` map (*p) ks)
> > primes = 2:euler [3,5..]
> >
> >
> > primes
> > = 2:euler (as@[3,5..])
> > 3:euler (bs@(tail as `minus` map (3*) as))
> > 5:euler (cs@(tail bs `minus` map (5*) bs))
> >
> >

Re-write:

euler s = head s:euler (tail s `minus` map(head s*) s)
primes = euler [2..]

primes = euler $ rollFrom [2] 1
= 2:euler ( rollFrom [3] 1 `minus` map(2*) (rollFrom [2] 1)) )
rollFrom [3,4] 2 `minus` rollFrom [4] 2
-- rollFrom [3] 2 --
= 2:3:euler (rollFrom [5] 2 `minus` map(3*) (rollFrom [3] 2))
rollFrom [5,7,9] 6 `minus` rollFrom [9] 6
-- rollFrom [5,7] 6 --
= 2:3:5:euler (rollFrom [7,11] 6 `minus` rollFrom [25,35] 30)
[7,11, 13,17, 19,23, 25,29, 31,35] 30
-- rollFrom [7,11,13,17,19,23,29,31] 30 --
= .....

where
rollOnce (x:xs) by = (x, tail xs ++ [x+by])
rollFrom xs by = concat $ iterate (map (+ by)) (xs)
multRoll xs@(x:_) by p = takeWhile (< (x+p*by)) $ rollFrom xs by

so, reifying, we get

data Roll a = Roll [a] a

rollOnce (Roll (x:xs) by)
= (x,Roll (xs ++ [x+by]) by)

rollFrom (Roll xs by)
= concat $ iterate (map (+ by)) (xs)

multRoll r@(Roll (x:_) by) p
= Roll (takeWhile (< (x+p*by)) $ rollFrom r) (by*p)

primes = euler $ Roll [2] 1
euler r@(Roll xs _)
= x:euler (Roll (mxs `minus` map (x*) xs) mby)
where
(x,r') = rollOnce r
(Roll mxs mby) = multRoll r' x

There's much extra primes pre-calculated inside the Roll, of course (upto p^2 in
fact, for p = primes!!n ), so this needs to be somehow tapped into, by writing a
specialized

nthPrime n = ....

to be called instead of (!! n), and also

primesUpTo n = ....

This calculated primes !! 1000 == 7927 in 3 seconds for me, interpreted, on my
old slowish laptop. It's supposed to have all the primes upto 7927^2 = 62837329
inside the Roll (if I'm not mistaken - or is it?). That's about 3.72 millionth
prime, according to WolframAlpha. (nah, it cant be that much). But it is surely
not horribly slow.

Is this, in fact, the wheels' "spiral"?

>
> >
> > not just primes - all the composites not yet removed, too.
>
> Between p and p^2, there are only primes left, fortunately.

but (map (*p) ks) needs access to the original, non-culled numbers - primes,
composites and all. (?)

>
> > So it can't even be implemented on shared mutable storage if we
> > want to delay the removal (as we must, in unbounded case) -
>
> Yes. And it's not nice in the bounded case either.
>
>
>

Will Ness

unread,

Jan 5, 2010, 5:17:17 PM1/5/10

to haskel...@haskell.org

Daniel Fischer <daniel.is.fischer <at> web.de> writes:

> So we must make sure that the list of composites that primes' consumes is
> not the same as that which primes'' consumes.

yes that is what I had done too. Duplicated everything. Turns out, it works
exactly as you told it would when using the compiler switch, -fno-cse, thanks!

> > I used the switch; it didn't help at all. The only thing I can see is

wrong. I didn't. When I did, it worked.

> > Unfortunately it grows, as you've said - 23MB for 2 mln. :|
>
> And I've found out why. Change the definition of tfold to
>
> tfold f (a: ~(b: ~(c:xs)))
> = (a `f` (b `f` c)) `f` tfold f xs
>
> and memory stays low (things are going much slower, though).

(forced by gmane poster to delete unusually many of your comments today...)

Interesting... As for the structure, I chose it trying to minimize the
estimated average cost of a composite production, Sum (1/p)*depth.

> You can make a compromise by using the above tfold (which is no longer a
> tree-fold) and grouping (and merging) the multiples in a slower-growing
> manner,

> .......

>
> memory still grows, but much slower, in my tests, due to the much smaller
> GC time, it's a bit faster than the version with the original tfold.

Great! :)

Will Ness

unread,

Jan 5, 2010, 6:09:54 PM1/5/10

to haskel...@haskell.org

Daniel Fischer <daniel.is.fischer <at> web.de> writes:

Niiice!!!! This is just great! :)

I tried a two-step feed BTW (that's three separate sets of lists) , with the
original structure. It ran with same speed as your new version (10..20% faster)
but with the memory of the previous one :) (80M for 8 mil primes vs the new
one's 10M). But your new structure is just great! I hoped there is something
better, that's why I posted it here in the first place.

'pairwise' puts odd leafs higher on the right. It might be better if it was so
on the left, for the frequency of production is higher.

Thanks a lot for your comments!

Daniel Fischer

unread,

Jan 5, 2010, 9:25:54 PM1/5/10

to haskel...@haskell.org, Will Ness

Am Mittwoch 06 Januar 2010 00:09:07 schrieb Will Ness:
> Daniel Fischer <daniel.is.fischer <at> web.de> writes:
> > Am Montag 04 Januar 2010 22:25:28 schrieb Daniel Fischer:
> > > memory still grows, but much slower, in my tests, due to the much
> > > smaller GC time, it's a bit faster than the version with the original
> > > tfold.
> >
> > Not for larger inputs (but not so large that the tree-fold dies OOM).
> > Fix rfold:
> >
> > rfold f [x] = x
> > rfold f xs = rfold f (pairwise f xs)
> >
> > and it's faster also for those.
>
> Niiice!!!! This is just great! :)
>
> I tried a two-step feed BTW (that's three separate sets of lists) , with
> the original structure. It ran with same speed as your new version (10..20%
> faster) but with the memory of the previous one :) (80M for 8 mil primes vs
> the new one's 10M).

The memory is almost completely due to the tree-merging of the multiples for the fastest
runner. While it produces faster than flat merging, the exponential growth of the trees
makes a bad memory citizen.
With the nwise and rfold, a two-step (or even three-step) feeding doesn't gain nearly as
much (I found about 1% speedup).

> But your new structure is just great! I hoped there is
> something better, that's why I posted it here in the first place.
>

I have two more goodies :)
1. now that the trees don't grow so fast, don't use lazy patterns in the definition of
tfold:

tfold f (a:b:c:xs) = (a `f` (b `f` c)) `f` tfold f xs

gains something like 6-7% here (and uses a little less memory).

2. Now we have a big wheel, 5760 differences per period. Then dropping a few thousand
elements from the wheel after calculating how many in rollFrom is slow:

rollFrom n = go ((n-17) `rem` 30030) wheel13
where
go r xxs@(x:xs)
| r < x = roll n xxs
| otherwise = go (r-x) xs

gains another couple of percents for large targets (~1% for the 10M prime, ~2% for 20M, I
expect that to stay in th region of 1.5-3% for larger targets).

> 'pairwise' puts odd leafs higher on the right. It might be better if it was
> so on the left, for the frequency of production is higher.

Maybe. But how would you do it? I tried passing the length to rfold, so when there was an
odd numberof trees in the list, it would move the first out of the recursion. Any possible
gains in production have been more than eaten up by the control code (not a big
difference, but it was there).

Will Ness

unread,

Jan 6, 2010, 9:25:39 AM1/6/10

to haskel...@haskell.org

Will Ness <will_n48 <at> yahoo.com> writes:

>
> Daniel Fischer <daniel.is.fischer <at> web.de> writes:
>
> > Am Dienstag 05 Januar 2010 14:49:58 schrieb Will Ness:
> > >
> > > euler ks@(p:rs) = p : euler (rs `minus` map (*p) ks)
> > > primes = 2:euler [3,5..]
> > >
> > >
>

> Re-write:

>
> primes = euler $ rollFrom [2] 1
> = 2:euler ( rollFrom [3] 1 `minus` map(2*) (rollFrom [2] 1)) )
> rollFrom [3,4] 2 `minus` rollFrom [4] 2
> -- rollFrom [3] 2 --
> = 2:3:euler (rollFrom [5] 2 `minus` map(3*) (rollFrom [3] 2))
> rollFrom [5,7,9] 6 `minus` rollFrom [9] 6
> -- rollFrom [5,7] 6 --
> = 2:3:5:euler (rollFrom [7,11] 6 `minus` rollFrom [25,35] 30)
> [7,11, 13,17, 19,23, 25,29, 31,35] 30
> -- rollFrom [7,11,13,17,19,23,29,31] 30 --
> = .....
>

correction:

where
rollOnce (x:xs) by = (x, xs ++ [x+by])

rollFrom xs by = concat $ iterate (map (+ by)) (xs)
multRoll xs@(x:_) by p = takeWhile (< (x+p*by)) $ rollFrom xs by

> so, reifying, we get
>
> data Roll a = Roll [a] a
>
> rollOnce (Roll (x:xs) by) = (x,Roll (xs ++ [x+by]) by)
> rollFrom (Roll xs by) = concat $ iterate (map (+ by)) (xs)
> multRoll r@(Roll (x:_) by) p
> = Roll (takeWhile (< (x+p*by)) $ rollFrom r) (by*p)
>
> primes = euler $ Roll [2] 1
> euler r@(Roll xs _)
> = x:euler (Roll (mxs `minus` map (x*) xs) mby)
> where
> (x,r') = rollOnce r
> (Roll mxs mby) = multRoll r' x
>

There's much extra primes pre-calculated inside the Roll, of course.

For any (Roll xs@(x:_) _), (takeWhile (< x*x) xs) are all primes too.

When these are used, the code's complexity is around O(n^1.5), and it runs
about 1.8x slower than Postponed Filters.

The "faithful sieve"'s empirical complexity is above 2.10..2.25 and rising. So
it might not be exponential, bbut is worse than power it seems anyway.

Will Ness

unread,

Jan 6, 2010, 1:13:51 PM1/6/10

to haskel...@haskell.org

Daniel Fischer <daniel.is.fischer <at> web.de> writes:

>
>
> Am Mittwoch 06 Januar 2010 00:09:07 schrieb Will Ness:
> > Daniel Fischer <daniel.is.fischer <at> web.de> writes:
> > > Am Montag 04 Januar 2010 22:25:28 schrieb Daniel Fischer:
> > > Fix rfold:
> > >
> > > rfold f [x] = x
> > > rfold f xs = rfold f (pairwise f xs)
> > >
> > > and it's faster also for those.
> >
>

> The memory is almost completely due to the tree-merging of the multiples for
> the fastest runner. While it produces faster than flat merging, the
> exponential growth of the trees makes a bad memory citizen.

Isn't the number of nodes the same in any two trees with the same number of
leafs?

BTW using

compos ps = fst $ tfold mergeSP $ nwise 1 mergeSP $ map pmults ps

instead of

compos ps = fst $ tfold mergeSP $ nwise 1 mergeSP
$ pairwise mergeSP $ map pmults ps

brings down memory consumption further by 25%..45% on 1..8 mln primes produced,
while slowing it down by about 0%..2% (that's after eliminating the lazy
pattern in tfold as per your advice).

> > 'pairwise' puts odd leafs higher on the right. It might be better if it was
> > so on the left, for the frequency of production is higher.
>
> Maybe. But how would you do it? I tried passing the length to rfold, so when
> there was an odd numberof trees in the list, it would move the first out of
> the recursion. Any possible gains in production have been more than eaten up
> by the control code (not a big difference, but it was there).

yes I've seen this too now. BTW, at a price of further slowing down, memory can
be lowered yet more with

compos ps = fst $ tfold mergeSP $ nwise 1 0.4 mergeSP $ map pmults ps
nwise k d f xs = let (ys,zs) = splitAt (round k) xs
in rfold f ys : nwise (k+d) d f zs

It really looks like the nearer the structure is to linear list, the lower the
memory consumption becomes. Of course using 0.0 in place of 0.4 would make it
into a plain list.

Daniel Fischer

unread,

Jan 6, 2010, 8:01:13 PM1/6/10

to haskel...@haskell.org, Will Ness

Am Mittwoch 06 Januar 2010 19:13:01 schrieb Will Ness:
> Daniel Fischer <daniel.is.fischer <at> web.de> writes:
> > Am Mittwoch 06 Januar 2010 00:09:07 schrieb Will Ness:
> > > Daniel Fischer <daniel.is.fischer <at> web.de> writes:
> > > > Am Montag 04 Januar 2010 22:25:28 schrieb Daniel Fischer:
> > > > Fix rfold:
> > > >
> > > > rfold f [x] = x
> > > > rfold f xs = rfold f (pairwise f xs)
> > > >
> > > > and it's faster also for those.
> >
> > The memory is almost completely due to the tree-merging of the multiples
> > for the fastest runner. While it produces faster than flat merging, the
> > exponential growth of the trees makes a bad memory citizen.
>
> Isn't the number of nodes the same in any two trees with the same number of
> leafs?

Sure. And I don't know why it takes *that much* memory, but since a flat merge
doesn't consume much memory, it must be the trees.

>
> BTW using
>
> compos ps = fst $ tfold mergeSP $ nwise 1 mergeSP $ map pmults ps
>
> instead of
>
> compos ps = fst $ tfold mergeSP $ nwise 1 mergeSP
> $ pairwise mergeSP $ map pmults ps
>
> brings down memory consumption further by 25%..45% on 1..8 mln primes
> produced, while slowing it down by about 0%..2% (that's after eliminating
> the lazy pattern in tfold as per your advice).
>

So much? Wow. I forgot the numbers, but I had tried that too, I thought the memory gain
wasn't worth the speed loss. Another thing that reduces memory usage is

compos :: [a] -> [a]
compos ps = case pairwise mergeSP (multip ps) of
((a,b):cs) -> a ++ funMerge b (nwise 1 mergeSP $ pairwise mergeSP cs)

funMerge b (x:y:zs) = let (c,d) = mergeSP x y
in mfun b c d zs

mfun u@(x:xs) w@(y:ys) d l = case compare x y of
LT -> x:mfun xs w d l
EQ -> x:mfun xs ys d l
GT -> y:mfun u ys d l
mfun u [] d l = funMerge (merge u d) l

That's about the same speed as the latter of the above here and uses about 25% less
memory. Removing one or two 'pairwise's brings memory down further, but makes it slower.

Heinrich Apfelmus

unread,

Jan 7, 2010, 5:44:30 AM1/7/10

to haskel...@haskell.org

Will Ness wrote:

> Heinrich Apfelmus writes:
>
>> Concerning lists as producer/consumer, I think that's exactly what lazy
>> evaluation is doing. Neither filter , map or span evaluate and
>> store more list elements that strictly necessary.
>
> I laways suspected as much, but was once told that Chris Okasaki has shown that
> any filter etc must allocate its own storage. With the peek/pull they don't
> have to, if they are nested, and the deepest one from the real storage gets
> pulled through some pointer chasing eventually. Span isn't so easily compiled
> out too or is it? But that might be a minor point.

Hm? In my world view, there is only reduction to normal form and I don't
see how "allocate its own storage" fits in there. Okasaki having shown
something to that end would be new to me?

> I did that once in Scheme, as per SICP, with 'next' hanging in a stream's tail.
> Put filters and maps on top of that (inside the new 'next' actually). But that
> used the Scheme's lists as sorage. Another one was actual producers/modifiers
> with {pull,peek} interface. It even produced some primes, and some Hamming
> numbers. Then I saw Haskell, and thought I'd get all that for free with its
> equational reasoning.
>
> But I get the impression that GHC isn't working through equational reasoning?..
> I see all this talk about thunks etc.

Sure it does. Concerning the thunks, they're part of the implementation
of the reduction model (call-by-need aka lazy evaluation).

>> Concerning the sieves, there is a fundamental difference between the
>> imperative sieve and the functional sieves, regardless of whether the

>> latter start at p or p^2 or use a priority queue. [...]

>
> We can directy jump to the next multiple too, it is called (+). :) :)
>
> But seriously, the real issue is that we have to merge the produced streams of
> multiples, while the mutable-storage code works on same one, so its "merging
> cost" is zero. And even if we are smart to merge them in a tree-like fashion,
> we still have no (or little) control over the compiler's representation of
> lists and retention of their content and whether it performs stream fusion or
> not (if we use lists).

Not quite, I think there are two things going on:

1. In contrast to the standard imperative version, we are implementing
an on-line algorithm that produces arbitrarily many primes on demand.
Any imperative on-line version will need to think about storage for
pending prime filters as well.

2. Even the imperative algorithm can be interpreted as "merging" arrays,
just that the composites are magically merged at the right place (at
distance p from each other) because arrays offer O(1) "jumps". In
contrast, the functional version needs O(log something) time to
calculate where to put the next composite.

> If you could take a look at the tree-merging primes and why it uses too much
> memory, it would be great.

Fortunately, Daniel already took a detailed look. :) But I also have a
few remarks.

> The code is in Daniel's post to which I replied, or

> on haskellwiki Prime_numbers page (there in its rudimentary form). It's a tangent

> to your VIP code, where instead of People structure an ordered list is just
> maintained as a split pair, of its known (so far, finite) prefix and the rest
> of it.

Aww, why remove the cutesy name? The VIPs will be angry for being ignored!

Jest aside, note that putting the crowd at the end of the list where it
belongs has a reason: you have a guarantee that crowds won't take any
memory until they actually appear after all the VIPs. If you put both
VIPs and the crowd into a pair, it's easier to get space leaks.

In particular, I think that your mergeSP has a laziness bug, it should be

mergeSP ~(a,b) ~(c,d) = ...

This is a cure for the (potential) problem that spMerge first performs
a pattern match / comparison and then returns a pair constructor instead
of the other way round. For instance, your second case

spMerge u [] = ([], u)

should actually behave like

spMerge u v = (a,b)
where
a = if null v then [] else ...
b = if null v then u else ...

(I don't recommend rewriting spMerge in this fashion, the lazy pattern
in mergeSP will do the trick.)

Now, I'm not sure whether this is actually a problem or not. But the
version shown here, with two lazy patterns instead of just one, is much
closer to how the original People structure behaves.

> Then under merging these split pairs form a monoid, s can be rearranged

> in a tree. If you haven't seen it yet, it uses a different folding structure to

> your code, with a lower total cost of multiples production (estimated as Sum
> (1/p)*depth):
>
> tfold f (x:y:z:xs) = (x `f` (y `f` z)) `f` pairwise f xs
> comps = tfold $ pairwise mergeSP multips

The idea being that each prime produces 1/p composites each "turn" and
it takes time depth to trickle it to the top? Sounds reasonable, but
remember that a prime p does not start to generate composites until
the "turn count" has reached p^2, so it might occupy a place of low
depth in the tree that is better spent on the previous primes. But your
experiments seem to tell that it works anyway.

By the way, I think that both tfold and pairwise need to be lazier.

tfold f ~(x: ~(y: ~(z:zs))) = (x `f` (y `f` z)) `f` pairwise f zs

pairwise f ~(x: ~(y:ys)) = f x y : pairwise f ys

Otherwise, large parts of the tree might be generated too early and hog
memory.

Also, a small remark on style: I don't like the repetition of mergeSP in

compos = fst $ tfold mergeSP (pairwise mergeSP (multip ps))

After all, the right hand side is just another tree fold and I would
clearly mark it as such:

compos = fst . superSmartTreeFold mergeSP . multip $ ps

superSmartTreeFold f = tfold f . pairwise f

;)

> I think I'll have to try out your code (amended with a new folding structure)
> and see if it has less memory problems maybe. > I assume it is your code on the haskellwiki page. (?)

Yes, I put the code snippet on the wiki. And never tested it on large
numbers. ;)

Daniel Fischer

unread,

Jan 7, 2010, 11:15:48 AM1/7/10

to haskel...@haskell.org, Heinrich Apfelmus, Will Ness

Am Donnerstag 07 Januar 2010 11:43:44 schrieb Heinrich Apfelmus:
> Will Ness wrote:
> > Heinrich Apfelmus writes:
> >> Concerning lists as producer/consumer, I think that's exactly what lazy
> >> evaluation is doing. Neither filter , map or span evaluate and
> >> store more list elements that strictly necessary.
> >
> > I laways suspected as much, but was once told that Chris Okasaki has
> > shown that any filter etc must allocate its own storage. With the
> > peek/pull they don't have to, if they are nested, and the deepest one
> > from the real storage gets pulled through some pointer chasing
> > eventually. Span isn't so easily compiled out too or is it? But that
> > might be a minor point.
>
> Hm? In my world view, there is only reduction to normal form and I don't
> see how "allocate its own storage" fits in there. Okasaki having shown
> something to that end would be new to me?
>

Perhaps what was meant was "storage must be allocated for each filter" (which, however,
seems trivial).

> >
> > We can directy jump to the next multiple too, it is called (+). :) :)
> >
> > But seriously, the real issue is that we have to merge the produced
> > streams of multiples, while the mutable-storage code works on same one,
> > so its "merging cost" is zero. And even if we are smart to merge them in
> > a tree-like fashion, we still have no (or little) control over the
> > compiler's representation of lists and retention of their content and
> > whether it performs stream fusion or not (if we use lists).
>
> Not quite, I think there are two things going on:
>
> 1. In contrast to the standard imperative version, we are implementing
> an on-line algorithm that produces arbitrarily many primes on demand.
> Any imperative on-line version will need to think about storage for
> pending prime filters as well.

If you consider a segmented array sieve an on-line algorithm, at the expense of speed, you
can omit storage for prime filters. It will not be a real Eratosthenes sieve anymore, but
it won't be too horribly slow, I think. If you're using an Atkin sieve, the performance
penalty should be even lower.

>
> 2. Even the imperative algorithm can be interpreted as "merging" arrays,
> just that the composites are magically merged at the right place (at

And they're merged asynchronously. Yet more magic :)

> distance p from each other) because arrays offer O(1) "jumps". In
> contrast, the functional version needs O(log something) time to
> calculate where to put the next composite.
>
> > If you could take a look at the tree-merging primes and why it uses too
> > much memory, it would be great.
>
> Fortunately, Daniel already took a detailed look. :) But I also have a
> few remarks.
>
> > The code is in Daniel's post to which I replied, or
> > on haskellwiki Prime_numbers page (there in its rudimentary form). It's a
> > tangent to your VIP code, where instead of People structure an ordered
> > list is just maintained as a split pair, of its known (so far, finite)
> > prefix and the rest of it.
>
> Aww, why remove the cutesy name? The VIPs will be angry for being ignored!

You'll be happy to learn that they return, albeit in a different form, then :)

>
> Jest aside, note that putting the crowd at the end of the list where it
> belongs has a reason: you have a guarantee that crowds won't take any
> memory until they actually appear after all the VIPs. If you put both
> VIPs and the crowd into a pair, it's easier to get space leaks.
>

Indeed. It's the combination of pairs and the bushy merge trees that lets the memory
explode. I still don't completely understand the behaviour, but I managed to squash the
space leak :D (see below). Thanks for the idea.

> In particular, I think that your mergeSP has a laziness bug, it should be
>
> mergeSP ~(a,b) ~(c,d) = ...
>

That doesn't help. The pattern match on the first argument doesn't hurt.

> This is a cure for the (potential) problem that spMerge first performs
> a pattern match / comparison and then returns a pair constructor instead
> of the other way round. For instance, your second case
>
> spMerge u [] = ([], u)
>
> should actually behave like
>
> spMerge u v = (a,b)
> where
> a = if null v then [] else ...
> b = if null v then u else ...
>
> (I don't recommend rewriting spMerge in this fashion, the lazy pattern
> in mergeSP will do the trick.)

It didn't. Probably rewriting spMerge thus would do, but I think I'm too lazy to find out.

>
> Now, I'm not sure whether this is actually a problem or not. But the
> version shown here, with two lazy patterns instead of just one, is much
> closer to how the original People structure behaves.
>
> > Then under merging these split pairs form a monoid, s can be rearranged
> > in a tree. If you haven't seen it yet, it uses a different folding
> > structure to your code, with a lower total cost of multiples production
> > (estimated as Sum (1/p)*depth):
> >
> > tfold f (x:y:z:xs) = (x `f` (y `f` z)) `f` pairwise f xs
> > comps = tfold $ pairwise mergeSP multips
>
> The idea being that each prime produces 1/p composites each "turn" and
> it takes time depth to trickle it to the top? Sounds reasonable, but
> remember that a prime p does not start to generate composites until
> the "turn count" has reached p^2, so it might occupy a place of low
> depth in the tree that is better spent on the previous primes. But your
> experiments seem to tell that it works anyway.
>
>
> By the way, I think that both tfold and pairwise need to be lazier.
>
> tfold f ~(x: ~(y: ~(z:zs))) = (x `f` (y `f` z)) `f` pairwise f zs
>
> pairwise f ~(x: ~(y:ys)) = f x y : pairwise f ys
>
> Otherwise, large parts of the tree might be generated too early and hog
> memory.
>

As it turns out, I don't know why, it's the other way round. Each lazy pattern increases
memory usage. Not very intuitive.

>
> Also, a small remark on style: I don't like the repetition of mergeSP in
>
> compos = fst $ tfold mergeSP (pairwise mergeSP (multip ps))

Noted and acted upon.

>
> After all, the right hand side is just another tree fold and I would
> clearly mark it as such:
>
> compos = fst . superSmartTreeFold mergeSP . multip $ ps
>
> superSmartTreeFold f = tfold f . pairwise f
>
> ;)
>
> > I think I'll have to try out your code (amended with a new folding
> > structure) and see if it has less memory problems maybe. > I assume it is
> > your code on the haskellwiki page. (?)
>
> Yes, I put the code snippet on the wiki. And never tested it on large
> numbers. ;)
>
>
> Regards,
> Heinrich Apfelmus

The below code is now a well-behaved memory citizen (3MB for the 100 millionth prime,
about the same as the PQ code). It still is considerably slower than the PQ code.
In terms of MUT times as reported by +RTS -sstderr - as well as (MUT+GC) times - (measured
once for prime No. 5*10^5, 10^6, 2*10^6, 4*10^6, 10^7 to get a rough tendency), it seems
to scale a wee bit better than any of the other tfold versions I created, but a little
worse than the PQ versions.
The relation of MUT times isn't too bad, but the GC times are rather abysmal (30-40%).

----------------------------------------------------------------------
{-# LANGUAGE ScopedTypeVariables #-}
{-# OPTIONS_GHC -O2 -fno-cse #-}
module VippyPrimes (primes) where

data People a = P { vips :: [a], dorks :: [a] }

{-# SPECIALISE celebrate :: Int -> People Int -> People Int #-}
{-# SPECIALISE celebrate :: Integer -> People Integer -> People Integer #-}
celebrate :: a -> People a -> People a
celebrate x p = P (x:vips p) (dorks p)

{-# SPECIALISE primes :: () -> [Int] #-}
{-# SPECIALISE primes :: () -> [Integer] #-}
primes :: forall a. Integral a => () -> [a]
primes () = 2:3:5:7:11:13:primes'
where
primes' = roll 17 wheel13 `minus` compos primes'''
primes'' = 17:19:23:29:31:37:rollFrom 41 `minus` compos primes''
primes''' = 17:19:23:29:31:37:rollFrom 41 `minus` compos primes''

pmults :: a -> People a
pmults p = case map (*p) (rollFrom p) of
(x:xs) -> P [x] xs

multip :: [a] -> [People a]
multip ps = map pmults ps

compos :: [a] -> [a]
compos = vips . itfold mergeSP . multip

rollFrom n = go ((n-17) `rem` 30030) wheel13
where
go r xxs@(x:xs)
| r < x = roll n xxs
| otherwise = go (r-x) xs

smartfold f = tfold f . pairwise f

tfold f (a:b:c:xs) = (a `f` (b `f` c)) `f` smartfold f xs

pairwise f (x:y:ys) = f x y : pairwise f ys

{-# SPECIALISE mergeSP :: People Int -> People Int -> People Int #-}
{-# SPECIALISE mergeSP :: People Integer -> People Integer -> People Integer #-}
mergeSP :: Integral a => People a -> People a -> People a
mergeSP p1@(P a _) p2 = P (a ++ vips mrgd) (dorks mrgd)
where
mrgd = spMerge (dorks p1) (vips p2) (dorks p2)
spMerge l1 [] l3 = P [] (merge l1 l3)
spMerge ~l1@(x:xs) l2@(y:ys) l3 = case compare x y of
LT -> celebrate x (spMerge xs l2 l3)
EQ -> celebrate x (spMerge xs ys l3)
GT -> celebrate y (spMerge l1 ys l3)

-- do the sepcialisations below make a difference at all?
{-# SPECIALISE roll :: Int -> [Int] -> [Int] #-}
{-# SPECIALISE roll :: Integer -> [Integer] -> [Integer] #-}
roll :: Integral a => a -> [a] -> [a]
roll = scanl (+)

{-# SPECIALISE wheel :: [Int] #-}
{-# SPECIALISE wheel :: [Integer] #-}
wheel :: Integral a => [a]
wheel = 2:4:2:4:6:2:6:4:2:4:6:6:2:6:4:2:6:4:6:8:4:2:4:2:
4:8:6:4:6:2:4:6:2:6:6:4:2:4:6:2:6:4:2:4:2:10:2:10:wheel

{-# SPECIALISE wheel11 :: [Int] #-}
{-# SPECIALISE wheel11 :: [Integer] #-}
wheel11 :: Integral a => [a]
wheel11 = res
where
snms = scanl (+) 11 (take 47 wheel)
nums = tail $ scanl (+) 11 (take 529 wheel)
cops = nums `minus` map (*11) snms
diffs = zipWith (-) (tail cops) cops
res = foldr (:) res diffs

{-# SPECIALISE wheel13 :: [Int] #-}
{-# SPECIALISE wheel13 :: [Integer] #-}
wheel13 :: Integral a => [a]
wheel13 = res
where
snms = take 480 $ scanl (+) 13 wheel11
nums = take (480*13+1) . tail $ scanl (+) 13 wheel11
cops = nums `minus` map (*13) snms
diffs = zipWith (-) (tail cops) cops
res = foldr (:) res diffs

{-# SPECIALISE minus :: [Int] -> [Int] -> [Int] #-}
{-# SPECIALISE minus :: [Integer] -> [Integer] -> [Integer] #-}
minus :: Integral a => [a] -> [a] -> [a]
minus a@(x:xs) b@(y:ys) = case compare x y of
LT -> x: xs `minus` b
GT -> a `minus` ys

EQ -> xs `minus` ys

minus a b = a

merge a@(x:xs) b@(y:ys) = case compare x y of
       LT -> x: merge xs b
       EQ -> x: merge xs ys
       GT -> y: merge a ys
merge a b = if null b then a else b
----------------------------------------------------------------------

Daniel Fischer

unread,

Jan 7, 2010, 2:27:56 PM1/7/10

to haskel...@haskell.org

Am Donnerstag 07 Januar 2010 17:13:38 schrieb Daniel Fischer:
> compos :: [a] -> [a]
> compos = vips . itfold mergeSP . multip

Sigh! That's what I get for editing the code in the mail editor. I decided
to change the really stupid 'itfold' to 'smartfold' and forgot this
occurrence.

Will Ness

unread,

Jan 8, 2010, 12:38:11 PM1/8/10

to haskel...@haskell.org

Daniel Fischer <daniel.is.fischer <at> web.de> writes:

>
>
> Am Donnerstag 07 Januar 2010 11:43:44 schrieb Heinrich Apfelmus:
> > Will Ness wrote:
> >
> > Hm? In my world view, there is only reduction to normal form and I don't
> > see how "allocate its own storage" fits in there. Okasaki having shown
> > something to that end would be new to me?
> >
> Perhaps what was meant was "storage must be allocated for each filter"
> (which, however, seems trivial).

I still contend that in case of nested filters, where each filter has only one
consumer, even that isn't ultimately necessary (a chain of pointers could be
formed). That's because there's no "peek"s, only "pull"s there. If the filters
are used in merge situation, then there will be some "peek"s so current value
must be somehow stored, though it's better to be made explicit and thus static
(the place for it, I mean, instead of the "rolling" cons cell). Such
implementation technique would prevent some extra gc.

> > > Then under merging these split pairs form a monoid, so can be rearranged

> > > in a tree. If you haven't seen it yet, it uses a different folding
> > > structure to your code, with a lower total cost of multiples production
> > > (estimated as Sum (1/p)*depth):

correction:

> > > tfold f (x:y:z:xs) = (x `f` (y `f` z)) `f` tfold f (pairwise f xs)
> > > comps = tfold mergeSP $ pairwise mergeSP multips

> >
> > The idea being that each prime produces 1/p composites each "turn" and
> > it takes time depth to trickle it to the top? Sounds reasonable, but
> > remember that a prime p does not start to generate composites until
> > the "turn count" has reached p^2, so it might occupy a place of low
> > depth in the tree that is better spent on the previous primes.

That might be why Daniel's structure is better: it plunges down faster than
mine.

"treefold" structure was:
(2+4) + ( (4+8) + ( (8+16) + ( (16+32) + ( (32+64) + ....... ))))
dpths: 3 4 4 5 5 6 6 7 7 8

daniel's:
(2+(4+6)) + ( (8+(10+12)) + ( (14+(16+18)) + ( (20+(22+24)) + .... ))
3 5 5.4 6 7.8 7.9 8 9 9.5 9.6 10.7 10.8

> primes () = 2:3:5:7:11:13:primes'
> where
> primes' = roll 17 wheel13 `minus` compos primes'''
> primes'' = 17:19:23:29:31:37:rollFrom 41 `minus` compos primes''
> primes''' = 17:19:23:29:31:37:rollFrom 41 `minus` compos primes''

Haven't read through the whole thing yet. :) :) I thought there was a typo
there. There isn't.

BTW using the no-share switch isn't necessary if we just write it down twice
with some variations, like

primes''' = let (h,t)=span (< 17^2) roll 17 wheel13
in h++t `minus` compos primes''

As we've found out, compilers aren't _that_ smart yet.

Will Ness

unread,

Jan 8, 2010, 1:46:36 PM1/8/10

to haskel...@haskell.org

Daniel Fischer <daniel.is.fischer <at> web.de> writes:

>
>
> The below code is now a well-behaved memory citizen (3MB for the 100
millionth prime, about the same as the PQ code). It still is considerably
slower than the PQ code.
> In terms of MUT times as reported by +RTS -sstderr - as well as (MUT+GC)
times - (measured once for prime No. 5*10^5, 10^6, 2*10^6, 4*10^6, 10^7 to get
a rough tendency), it seems to scale a wee bit better than any of the other
tfold versions I created, but a little worse than the PQ versions.
> The relation of MUT times isn't too bad, but the GC times are rather abysmal
(30-40%).
>
> ----------------------------------------------------------------------

> data People a = P { vips :: [a], dorks :: [a] }
>

> celebrate :: a -> People a -> People a
> celebrate x p = P (x:vips p) (dorks p)
>

> primes :: forall a. Integral a => () -> [a]
> primes () = 2:3:5:7:11:13:primes'
> where
> primes' = roll 17 wheel13 `minus` compos primes'''

primes'' = 17:19:23:29:31:rollFrom 37 `minus` compos primes''

> primes''' = 17:19:23:29:31:37:rollFrom 41 `minus` compos primes''
>
> pmults :: a -> People a
> pmults p = case map (*p) (rollFrom p) of
> (x:xs) -> P [x] xs
>
> multip :: [a] -> [People a]
> multip ps = map pmults ps
>
> compos :: [a] -> [a]

compos = vips . smartfold mergeSP . multip

>
>
> smartfold f = tfold f . pairwise f
>
> tfold f (a:b:c:xs) = (a `f` (b `f` c)) `f` smartfold f xs
>
> pairwise f (x:y:ys) = f x y : pairwise f ys
>

> mergeSP :: Integral a => People a -> People a -> People a
> mergeSP p1@(P a _) p2 = P (a ++ vips mrgd) (dorks mrgd)
> where
> mrgd = spMerge (dorks p1) (vips p2) (dorks p2)
> spMerge l1 [] l3 = P [] (merge l1 l3)
> spMerge ~l1@(x:xs) l2@(y:ys) l3 = case compare x y of
> LT -> celebrate x (spMerge xs l2 l3)
> EQ -> celebrate x (spMerge xs ys l3)
> GT -> celebrate y (spMerge l1 ys l3)
>

> ----------------------------------------------------------------------
>

Hi Daniel,

Is it so that you went back to my fold structure? Was it better for really big
numbers of primes?

I had the following for ages (well, at least two weeks) but I thought it was
slower and took more memory (that was _before_ the 'no-share' and 'feeder'
stuff). I can see the only difference in that you've re-written spMerge in a
tail-recursive style with explicitly deconstructed parts; mine was relying on
the compiler (8-L) to de-couple the two pipes and recognize that the second
just passes along the final result, unchanged.

The two versions seem to me to be _exactly_ operationally equivalent. All this
searching for the code better understood by the compiler is _*very*_
frustrating, as it doesn't reflect on the semantics of the code, or even the
operational semantics of the code. :-[

Weren't the P's supposed to disappear completely in the compiled code? Aren't
types just a _behavioral_ definitions??? Aren't we supposed to be able to
substitute equals for equals dammit??

Is this the state of our _best_ Haskell compiler????

module Primes8 where

import Data.Monoid

data (Ord a) => SplitList a = P [a] [a]

instance (Ord a) => Monoid (SplitList a) where
mempty = P [] []
-- {x | x::SplitList a} form a monoid under mappend
mappend (P a b) ~(P c d) = let P bc b' = spMerge b c
in P (a ++ bc) (merge b' d)
where
spMerge :: (Ord a) => [a] -> [a] -> SplitList a
spMerge u@(x:xs) w@(y:ys) = case compare x y of
LT -> P (x:c) d where (P c d) = spMerge xs w
EQ -> P (x:c) d where (P c d) = spMerge xs ys
GT -> P (y:c) d where (P c d) = spMerge u ys
spMerge u [] = P [] u
spMerge [] w = P [] w
mconcat ms = fold mappend (pairwise mappend ms)
where

fold f (a: ~(b: ~(c:xs)))

= (a `f` (b `f` c)) `f` fold f (pairwise f xs)

pairwise f (x:y:ys) = f x y:pairwise f ys

primes :: Integral a => () -> [a]

primes () = 2:3:5:7:primes'
where
primes' = [11,13] ++ drop 2 (rollFrom 11) `minus` comps

mults = map (\p-> P [p*p] [p*n | n<- tail $ rollFrom p]) $ primes'
P comps _ = mconcat mults

Will Ness

unread,

Jan 8, 2010, 3:06:09 PM1/8/10

to haskel...@haskell.org

Daniel Fischer <daniel.is.fischer <at> web.de> writes:

> roll = scanl (+)

> wheel = 2:4:2:4:6:2:6:4:2:4:6:6:2:6:4:2:6:4:6:8:4:2:4:2:
> 4:8:6:4:6:2:4:6:2:6:6:4:2:4:6:2:6:4:2:4:2:10:2:10:wheel

> wheel11 = res
> where
> snms = scanl (+) 11 (take 47 wheel)
> nums = tail $ scanl (+) 11 (take 529 wheel)
> cops = nums `minus` map (*11) snms
> diffs = zipWith (-) (tail cops) cops
> res = foldr (:) res diffs

> wheel13 = res
> where
> snms = take 480 $ scanl (+) 13 wheel11
> nums = take (480*13+1) . tail $ scanl (+) 13 wheel11
> cops = nums `minus` map (*13) snms
> diffs = zipWith (-) (tail cops) cops
> res = foldr (:) res diffs
>

BTW have you seen my take on the "faithful Euler's sieve"? It shows another way
to look at the wheels, which for me was really the first time I really
understood what's going on there.

It also makes for easier wheel extention (IMO):

> euler ks@(p:rs) = p : euler (rs `minus` map (*p) ks)

> primes = euler [2..]

The essence of Euler's sieve is the wheel: after each step we're left with
lists of non-multiples of the preceding primes:

primes = euler $ listFrom [2] 1
= 2:euler ( listFrom [3] 1 `minus` map(2*) (listFrom [2] 1)) )
listFrom [3,4] 2 `minus` listFrom [4] 2
-- listFrom [3] 2 --
= 2:3:euler (listFrom [5] 2 `minus` map(3*) (listFrom [3] 2))
listFrom [5,7,9] 6 `minus` listFrom [9] 6
-- listFrom [5,7] 6 --
= 2:3:5:euler (listFrom [7,11] 6 `minus` listFrom [25,35] 30)

[7,11, 13,17, 19,23, 25,29, 31,35] 30

-- listFrom [7,11,13,17,19,23,29,31] 30 --
= .....

where

listFrom xs by = concat $ iterate (map (+ by)) xs

so

-- startRoll = ([2],1)

nextRoll r@(xs@(x:xt),b)
= ( (x,r') , r')
where
ys@(y:_) = xt ++ [x + b]
r' = (xs',b')
b' = x*b
xs' = takeWhile (< y + b') (listFrom ys b)
`minus` map (x*) xs

rolls = unfoldr (Just . nextRoll) ([2],1)

nthWheel n = let (ps,rs) = unzip $ take n rolls
(x:xs,b) = last rs
in ((ps, x), zipWith (-) (xs++[x+b]) (x:xs))

{-
*Main> mapM print $ take 4 rolls
(2,([3],2))
(3,([5,7],6))
(5,([7,11,13,17,19,23,29,31],30))
(7,([11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,
101,103,107,109,113,121,127,131,137,139,143,149,151,157,163,167,169,
173,179,181,187,191,193,197,199,209,211],210))

*Main> nthWheel 3
(([2,3,5],7),[4,2,4,2,4,6,2,6])

*Main> nthWheel 4
(([2,3,5,7],11),[2,4,2,4,6,2,6,4,2,4,6,6,2,6,4,2,6,4,6,8,4,2,4,2,
4,8,6,4,6,2,4,6,2,6,6,4,2,4,6,2,6,4,2,4,2,10,2,10])
-}

Coincidentally, the function specified by

eulerPrimes n = let (ps,rs) = unzip $ take n rolls
(qs@(q:_),b) = last rs
in ps ++ takeWhile (< q^2) qs

can be used to write the specialized nthEulerPrime etc., whose complexity seems
to be about O(n^1.5).

Maybe this reinvents some spiraling wheels or somethn'. :)

Will Ness

unread,

Jan 8, 2010, 3:12:40 PM1/8/10

to haskel...@haskell.org

Will Ness <will_n48 <at> yahoo.com> writes:

>
>
> That might be why Daniel's structure is better: it plunges down faster than
> mine.
>
> "treefold" structure was:
> (2+4) + ( (4+8) + ( (8+16) + ( (16+32) + ( (32+64) + ....... ))))
> dpths: 3 4 4 5 5 6 6 7 7 8

this should of course have been

dpths: 3 4 5 6 7 8 9 10 11 12

>
> daniel's:
> (2+(4+6)) + ( (8+(10+12)) + ( (14+(16+18)) + ( (20+(22+24)) + .... ))
> 3 5 5.4 6 7.8 7.9 8 9 9.5 9.6 10.7 10.8
>

hmm. :|

Daniel Fischer

unread,

Jan 8, 2010, 10:54:50 PM1/8/10

to haskel...@haskell.org, Will Ness

Am Freitag 08 Januar 2010 19:45:47 schrieb Will Ness:
> Daniel Fischer <daniel.is.fischer <at> web.de> writes:

> >
> > mergeSP :: Integral a => People a -> People a -> People a
> > mergeSP p1@(P a _) p2 = P (a ++ vips mrgd) (dorks mrgd)
> > where
> > mrgd = spMerge (dorks p1) (vips p2) (dorks p2)
> > spMerge l1 [] l3 = P [] (merge l1 l3)
> > spMerge ~l1@(x:xs) l2@(y:ys) l3 = case compare x y of
> > LT -> celebrate x (spMerge xs l2 l3)
> > EQ -> celebrate x (spMerge xs ys l3)
> > GT -> celebrate y (spMerge l1 ys l3)
> >
> > ----------------------------------------------------------------------
>
> Hi Daniel,
>
> Is it so that you went back to my fold structure?

Yes.

> Was it better for really big numbers of primes?

Yes, it is slower for <= 20 million primes (and some beyond), but faster
for >= 50 million (and some before), according to the few tests I made.

>
> I had the following for ages (well, at least two weeks) but I thought it
> was slower and took more memory (that was _before_ the 'no-share' and
> 'feeder' stuff). I can see the only difference in that you've re-written
> spMerge in a tail-recursive style with explicitly deconstructed parts;

It's not tail-recursive, the recursive call is inside a celebrate.

As it turns out, the important things are

1. a feeder and separate lists of multiples for the feeder and the runner,
for the reasons detailed earlier (two-step feeding and larger wheel are
pleasing but minor optimisations).

2. a strict pattern in tfold

3. moving the merge inside spMerge (you can have

mergeSP (a,b) ~(c,d) = (a ++ bc, m)
where
(bc,m) = spMerge b c
spMerge u [] = ([],merge u d)
...

without exploding memory, but it's *much* slower than letting spMerge take
three arguments)

Now, I have to admit, the only point I _really_ understand is 1. (and why
the three-argument spMerge is faster than the two-argument one taking the
merge-partner from mergeSP's binding :).

Why has

mergeSP (a,b) ~(c,d)
= let (bc,b') = spMerge b c in (a ++ bc, merge b' d)

a memory leak, but

mergeSP (a,b) ~(c,d)
= let (bc,m) = spMerge' b c d in (a ++ bc, m)

not?

Well, looking at the core for mergeSP, the fog clears somewhat. The former
is translated roughly to

mergeSP (a,b) pair
= let sm = spMerge b (fst pair)
in (a ++ fst sm, merge (snd sm) (snd pair))

It holds on to the pair until the result of the merge is demanded, that is
until the entire (a ++ fst sm) is consumed. Only then is the pair released
and can be collected. On top of that, as soon as a is consumed and (fst sm)
[or bc] is demanded, spMerge forces the evaluation of (fst pair) [c]. After
a few levels, the evaluated list will take more space than the thunk. It
cannot yet be collected, because pair is still alive. The elements have to
be duplicated for (fst sm), because they're intertwined with those of b.
On the next level of merging, they have to be duplicated again.

The latter is translated roughly to

mergeSP (a,b) pair
= let sm = spMerge' b (fst pair) (snd pair)
in (a ++ fst sm, snd sm)

The pair is released as soon as the result of the spMerge' is demanded,
that is, when a is consumed. Then the elements of (fst pair) need not be
duplicated and they can be discarded almost immediately after they have
been produced [for small primes, multiples of larger primes live a little
longer, but those are fewer].

So, no duplication, shorter lifespan => no leak.
Having seen that, the question is, why can't the compiler see that
deconstructing the pair when the first component is needed is better? The
first component of the pair is used in no other place, so keeping the pair
can't have any advantage, can it?

And why does

tfold f (a: ~(b: ~(c:xs))) = ...

leak, but not

tfold f (a:b:c:xs) = ...

?

I guess it's similar.

tfold f (a: ~(b: ~(c:xs)))
= (a `f` (b `f` c)) `f` tfold f ([pairwise f] xs)

is

tfold f (a:zs)
= (a `f` (head zs `f` (head $ tail zs)))
`f` tfold f (pairwise f $ drop 2 zs)

the latter part holds on to the beginning of zs, leading again to data
duplication and too long lifespans.

> mine was relying on the compiler (8-L) to de-couple the two pipes and
> recognize that the second just passes along the final result, unchanged.
>
> The two versions seem to me to be _exactly_ operationally equivalent.

Well, they're not. The main difference is what we told the compiler _when_
to deconstruct the pairs. You said "at the latest possible moment", I said
"as soon as we need the first component".

It's not entirely obvious, but it is frequently said that understanding the
space (and time) behaviour of lazy evaluation isn't always easy.

> All this searching for the code better understood by the compiler is
> _*very*_ frustrating, as it doesn't reflect on the semantics of the
> code, or even the operational semantics of the code. :-[
>
> Weren't the P's supposed to disappear completely in the compiled code?

No. Constructors for data types can't disappear completely, only newtype
constructors can (and do).

> Aren't types just a _behavioral_ definitions??? Aren't we supposed to be
> able to substitute equals for equals dammit??

We can. But substituting equals for equals can alter space and time
complexity.

fromInteger 0 ==
fromInteger (let xs = [1 .. 10^8] in (product xs + sum xs) `mod` 10)

>
> Is this the state of our _best_ Haskell compiler????
>

Yes. It's still a "do what I tell you to" compiler, even if a pretty slick
one, not a "do what I mean" compiler. Sometimes, what you tell the compiler
isn't what you wanted.
It's easier to predict when you give detailed step by step instructions.

Will Ness

unread,

Jan 9, 2010, 2:05:11 AM1/9/10

to haskel...@haskell.org

Daniel Fischer <daniel.is.fischer <at> web.de> writes:

>
> Am Freitag 08 Januar 2010 19:45:47 schrieb Will Ness:
> > Daniel Fischer <daniel.is.fischer <at> web.de> writes:
>
>
> It's not tail-recursive, the recursive call is inside a celebrate.

It is (spMerge that is). It calls tail-recursive celebrate in a tail position.
What you've done, is to eliminate the outstanding context, buy moving it
inward. Your detailed explanation is more clear than that. :)

BTW when I run VIP code it is consistently slower than using just pairs,
modified with wheel and feeder and all. So what's needed is to re-implement
your approach for pairs:

mergeSP (a,b) ~(c,d) = let (bc,bd) = spMerge b c d
in (a ++ bc, bd)
where
spMerge u [] d = ([], merge u d)
spMerge u@(x:xs) w@(y:ys) d = case compare x y of
LT -> consSP x $ spMerge xs w d
EQ -> consSP x $ spMerge xs ys d
GT -> consSP y $ spMerge u ys d

consSP x ~(a,b) = (x:a,b) -- don't forget that magic `~` !!!

BTW I'm able to eliminate sharing without a compiler switch by using

mtwprimes () = 2:3:5:7:primes
where
primes = doPrimes 121 primes

doPrimes n prs = let (h,t) = span (< n) $ rollFrom 11
in h ++ t `diff` comps prs
doPrimes2 n prs = let (h,t) = span (< n) $ rollFrom (12-1)
in h ++ t `diff` comps prs

mtw2primes () = 2:3:5:7:primes
where
primes = doPrimes 26 primes2
primes2 = doPrimes2 121 primes2

Using 'splitAt 26' in place of 'span (< 121)' didn't work though.

How about them wheels? :)

Will Ness

unread,

Jan 9, 2010, 2:43:43 AM1/9/10

to haskel...@haskell.org

Heinrich Apfelmus <apfelmus <at> quantentunnel.de> writes:

>
> Will Ness wrote:
> > But I get the impression that GHC isn't working through equational
> > reasoning?..
> > I see all this talk about thunks etc.
>
> Sure it does. Concerning the thunks, they're part of the implementation
> of the reduction model (call-by-need aka lazy evaluation).

At run-time? I meant to eliminate as much calculation as possible, pre-run-time.
I would expect the best efforts of the best minds to go into searching for ways
how to eliminate computations altogether, instead of how to perform them better.

> >> Concerning the sieves, there is a fundamental difference between the
> >> imperative sieve and the functional sieves, regardless of whether the
> >> latter start at p or p^2 or use a priority queue. [...]
> >
> > We can directy jump to the next multiple too, it is called (+). :) :)
> >
> > But seriously, the real issue is that we have to merge the produced
> > streams of multiples, while the mutable-storage code works on same one,
> > so its "merging cost" is zero.
>

> Not quite, I think there are two things going on:
>
> 1. In contrast to the standard imperative version, we are implementing
> an on-line algorithm that produces arbitrarily many primes on demand.
> Any imperative on-line version will need to think about storage for
> pending prime filters as well.

True.

> 2. Even the imperative algorithm can be interpreted as "merging" arrays,
> just that the composites are magically merged at the right place (at
> distance p from each other) because arrays offer O(1) "jumps".

i.e. with a merging cost of zero. :)

> In contrast, the functional version needs O(log something) time to
> calculate where to put the next composite.

when thinking it terms of the finally produced sequence, yes. We have to
produce numbers one by one and take care of their ordering ourselves; _they_
just /throw/ the numbers at the shared canvas and let _it_ take care of
ordering them automagically, _later_, on the final sweep through. ISWYM.

>
> > If you could take a look at the tree-merging primes and why it uses
> > too much memory, it would be great.
>
> Fortunately, Daniel already took a detailed look. :)

Yes he really came through! He finally found, and fixed, the space leak. It was
hiding in

mergeSP_BAD (a,b) ~(c,d) = let (bc,b') = spMerge b c

in (a ++ bc, merge b' d)

where
spMerge :: (Ord a) => [a] -> [a] -> ([a],[a])
spMerge a@(x:xs) b@(y:ys) = case compare x y of
LT -> (x:c,d) where (c,d) = spMerge xs b
EQ -> (x:c,d) where (c,d) = spMerge xs ys
GT -> (y:c,d) where (c,d) = spMerge a ys
spMerge a [] = ([] ,a)
spMerge [] b = ([] ,b)

which really ought to have been

mergeSP (a,b) ~(c,d) = let ~(bc,bd) = spMerge b c d

in (a ++ bc, bd)
where
spMerge u [] d = ([], merge u d)
spMerge u@(x:xs) w@(y:ys) d = case compare x y of

LT -> spCons x $ spMerge xs w d
EQ -> spCons x $ spMerge xs ys d
GT -> spCons y $ spMerge u ys d
spCons x ~(a,b) = (x:a,b)

Can you spot the difference? :) :)

> Aww, why remove the cutesy name? The VIPs will be angry for being ignored!

It runs faster on plain pairs, and on less memory, equals for equals. For some
reason. :)

Will Ness

unread,

Jan 9, 2010, 4:52:32 AM1/9/10

to haskel...@haskell.org

Daniel Fischer <daniel.is.fischer <at> web.de> writes:

>
>
> Am Donnerstag 07 Januar 2010 11:43:44 schrieb Heinrich Apfelmus:
> > Will Ness wrote:
> > > Heinrich Apfelmus writes:
>
> The below code is now a well-behaved memory citizen (3MB for the 100
millionth prime, about the same as the PQ code). It still is considerably
slower than the PQ code.
> In terms of MUT times as reported by +RTS -sstderr - as well as (MUT+GC)
times - (measured once for prime No. 5*10^5, 10^6, 2*10^6, 4*10^6, 10^7 to get
a rough tendency), it seems to scale a wee bit better than any of the other
tfold versions I created, but a little worse than the PQ versions.
> The relation of MUT times isn't too bad, but the GC times are rather abysmal
(30-40%).
>
> ----------------------------------------------------------------------
>

> data People a = P { vips :: [a], dorks :: [a] }
>

> celebrate x p = P (x:vips p) (dorks p)
>

> mergeSP p1@(P a _) p2 = P (a ++ vips mrgd) (dorks mrgd)
> where
> mrgd = spMerge (dorks p1) (vips p2) (dorks p2)
> spMerge l1 [] l3 = P [] (merge l1 l3)
> spMerge ~l1@(x:xs) l2@(y:ys) l3 = case compare x y of
> LT -> celebrate x (spMerge xs l2 l3)
> EQ -> celebrate x (spMerge xs ys l3)
> GT -> celebrate y (spMerge l1 ys l3)
>

I forgot to say something *very* important. :) Here it is.

Yippee-hurray!

:)

Daniel Fischer

unread,

Jan 9, 2010, 9:01:06 AM1/9/10

to haskel...@haskell.org, Will Ness

Am Samstag 09 Januar 2010 08:04:20 schrieb Will Ness:
> Daniel Fischer <daniel.is.fischer <at> web.de> writes:
> > Am Freitag 08 Januar 2010 19:45:47 schrieb Will Ness:
> > > Daniel Fischer <daniel.is.fischer <at> web.de> writes:
> >
> > It's not tail-recursive, the recursive call is inside a celebrate.
>
> It is (spMerge that is).

No.
"In computer science, tail recursion (or tail-end recursion) is a special
case of recursion in which the last operation of the function, the tail
call, is a recursive call."

The last operation of spMerge is a call to celebrate or the pair
constructor (be that P or (,)). Doesn't matter, though, as for lazy
languages, tail recursion isn't very important.

> It calls tail-recursive celebrate in a tail

> position. What you've done, is to eliminate the outstanding context, by

> moving it inward. Your detailed explanation is more clear than that. :)
>
> BTW when I run VIP code it is consistently slower than using just pairs,

I can't reproduce that. Ceteris paribus, I get the exact same allocation
and GC figures whether I use People or (,), running times identical enough
(difference between People and (,) is smaller than the difference between
runs of the same; the difference between the fastest and the slowest run of
the two is less than 0.5%). I think it must be the other changes you made.

> modified with wheel and feeder and all. So what's needed is to
> re-implement your approach for pairs:
>
> mergeSP (a,b) ~(c,d) = let (bc,bd) = spMerge b c d
> in (a ++ bc, bd)
> where
> spMerge u [] d = ([], merge u d)
> spMerge u@(x:xs) w@(y:ys) d = case compare x y of
> LT -> consSP x $ spMerge xs w d
> EQ -> consSP x $ spMerge xs ys d
> GT -> consSP y $ spMerge u ys d
>
> consSP x ~(a,b) = (x:a,b) -- don't forget that magic `~` !!!

I called that (<:).

>
>
> BTW I'm able to eliminate sharing without a compiler switch by using
>

Yes, I can too. But it's easy to make a false step and trigger sharing.

I can get a nice speedup (~15%, mostly due to much less garbage collecting)
by doing the final merge in a function without unnecessarily wrapping the
result in a pair (whose secondcomponent is ignored):

-- Doesn't need -fno-cse anymore,
-- but it needs -XScopedTypeVariables for the local type signatures

primes :: forall a. Integral a => () -> [a]

primes () = 2:3:5:7:11:13:calcPrimes 17 primes''
�� where
calcPrimes s cs = rollFrom s `minus` compos cs
bootstrap = 17:19:23:29:31:37:calcPrimes 41 bootstrap
primes' = calcPrimes 17 bootstrap
primes'' = calcPrimes 17 primes'

pmults :: a -> ([a],[a])

pmults p = case map (*p) (rollFrom p) of

(x:xs) -> ([x],xs)

multip :: [a] -> [([a],[a])]

multip ps = map pmults ps

compos :: [a] -> [a]

compos ps = case pairwise mergeSP (multip ps) of

((a,b):cs) -> a ++ funMerge b (pairwise mergeSP cs)

funMerge b (x:y:zs) = let (c,d) = mergeSP x y

in mfun b c d (pairwise mergeSP zs)

mfun u@(x:xs) w@(y:ys) d l = case compare x y of
LT -> x:mfun xs w d l
EQ -> x:mfun xs ys d l
GT -> y:mfun u ys d l
mfun u [] d l = funMerge (merge u d) l

This uses a different folding structure again, which seems to give slightly
better performance than the original tree-fold structure. In contrast to
the VippyPrimes, it profits much from a larger allocation area, running
with +RTS -A2M gives a >10% speedup for prime # 10M/20M, +RTS -A8M nearly
20%. -A16M and -A32M buy a little more, but in that range at least, it's
not much (may be significant for larger targets).
Still way slower than PQ, but the gap is narrowing.

>
> mtwprimes () = 2:3:5:7:primes
> where
> primes = doPrimes 121 primes
>
> doPrimes n prs = let (h,t) = span (< n) $ rollFrom 11
> in h ++ t `diff` comps prs
> doPrimes2 n prs = let (h,t) = span (< n) $ rollFrom (12-1)
> in h ++ t `diff` comps prs
>
> mtw2primes () = 2:3:5:7:primes
> where
> primes = doPrimes 26 primes2
> primes2 = doPrimes2 121 primes2
>
>
> Using 'splitAt 26' in place of 'span (< 121)' didn't work though.
>
>
> How about them wheels? :)
>

Well, what about them?

Daniel Fischer

unread,

Jan 9, 2010, 9:08:17 AM1/9/10

to haskel...@haskell.org, Will Ness

Am Freitag 08 Januar 2010 18:37:21 schrieb Will Ness:
> Daniel Fischer <daniel.is.fischer <at> web.de> writes:
> > Am Donnerstag 07 Januar 2010 11:43:44 schrieb Heinrich Apfelmus:
> > > Will Ness wrote:
> > >
> > > Hm? In my world view, there is only reduction to normal form and I
> > > don't see how "allocate its own storage" fits in there. Okasaki
> > > having shown something to that end would be new to me?
> >
> > Perhaps what was meant was "storage must be allocated for each filter"
> > (which, however, seems trivial).
>
> I still contend that in case of nested filters, where each filter has
> only one consumer, even that isn't ultimately necessary (a chain of
> pointers could be formed). That's because there's no "peek"s, only
> "pull"s there. If the filters are used in merge situation, then there
> will be some "peek"s so current value must be somehow stored, though
> it's better to be made explicit and thus static (the place for it, I
> mean, instead of the "rolling" cons cell). Such implementation technique
> would prevent some extra gc.
>

Well, for each filter we need to store

1. a pointer to where in the wheel we are
2. the prime by which to multiply the steps of the wheel
3. the current value of the filter (which number to reject next)

I don't see how to use less. It's not much, but it's not zero storage.

Will Ness

unread,

Jan 9, 2010, 11:25:48 AM1/9/10

to haskel...@haskell.org

Daniel Fischer <daniel.is.fischer <at> web.de> writes:

>
> Am Samstag 09 Januar 2010 08:04:20 schrieb Will Ness:
> > Daniel Fischer <daniel.is.fischer <at> web.de> writes:
> > > Am Freitag 08 Januar 2010 19:45:47 schrieb Will Ness:
> > > > Daniel Fischer <daniel.is.fischer <at> web.de> writes:
> > >
> > > It's not tail-recursive, the recursive call is inside a celebrate.
> >
> > It is (spMerge that is).
>
> No.
> "In computer science, tail recursion (or tail-end recursion) is a special
> case of recursion in which the last operation of the function, the tail
> call, is a recursive call."

As far as I understand it, when a function makes a tail call to a tail
recursive function (be it itself or some other function) it is itself tail
recursive. I.e. that call may be replaced with a direct jump, with no new
context to be created. That is what your version accomplishes, too. Mine really
held to that pair ~(c,d) when it wanted to call (merge _ d) _after_ the call to
spMerge. By passing the pre-decomposed part of a pair, 'd', into the process
environment of spMerge, you've made it tail recursive - it carried along all
the context it needed. That's what've eliminated the space leak, so I'd say
tail recursion does play a role under lazy evaluation - when a compiler isn't
smart enough to do _that_ for us by itself. _Were_ it reliably smart, even non-
recursive functions like my initial variant would work just as well.

> The last operation of spMerge is a call to celebrate or the pair
> constructor (be that P or (,)). Doesn't matter, though, as for lazy
> languages, tail recursion isn't very important.
>
> > It calls tail-recursive celebrate in a tail
> > position. What you've done, is to eliminate the outstanding context, by
> > moving it inward. Your detailed explanation is more clear than that. :)
> >
> > BTW when I run VIP code it is consistently slower than using just pairs,
>
> I can't reproduce that. Ceteris paribus, I get the exact same allocation
> and GC figures whether I use People or (,), running times identical enough
> (difference between People and (,) is smaller than the difference between
> runs of the same; the difference between the fastest and the slowest run of
> the two is less than 0.5%). I think it must be the other changes you made.

I just take the VIP code as it is on a web page, and my intial variant without
the wheel, and compare. Then I add the wheel in the same fashion, and then
feeder, and compare again. When I tested that Monid instance code I too
compared it to the straight pairs, and it was slower. Don't know why.

> > modified with wheel and feeder and all. So what's needed is to
> > re-implement your approach for pairs:
> >
> > mergeSP (a,b) ~(c,d) = let (bc,bd) = spMerge b c d
> > in (a ++ bc, bd)
> > where

> > spMerge b [] d = ([], merge b d)
> > spMerge b@(x:xs) c@(y:ys) d = case compare x y of
> > LT -> consSP x $ spMerge xs c d

> > EQ -> consSP x $ spMerge xs ys d

> > GT -> consSP y $ spMerge b ys d
> >
> > consSP x ~(bc,bd) = (x:bc,bd) -- don't forget that magic `~` !!!

>
> I called that (<:).
>
> >
> > BTW I'm able to eliminate sharing without a compiler switch by using
> >
>
> Yes, I can too. But it's easy to make a false step and trigger sharing.

yes indeed. It's seems unpredictable. Fortunately GHC couldn't tell that (12-1)
was 11 by the looks of it. :) Your idea certainly seems right, that there ought
to be some control over sharing on a per-function basis somehow without these
ridiculous code tricks.

> I can get a nice speedup (~15%, mostly due to much less garbage collecting)
by doing the final merge in a function without unnecessarily wrapping the
result in a pair

Will have to wrap my head around _that_. But that would be fighting with the
compiler again. I don't like that, I much rather fight with the problem at
hand. There shouldn't be any pairs in the compiled code in the first place.
They just guide the staging of (++) and (merge) intertwined between the
producer streams. At each finite step, when the second part of a pair comes
into play, it is only after its first part was completely consumed. I guess the
next thing to try would be to actually create a data type MergeNode and arrange
_those_ in a tree and see if that helps. That would be the next half-step
towards the PQ itself.

> This uses a different folding structure again,

which I am yet to decipher. :)

> > How about them wheels? :)
>
> Well, what about them?

I dunno, it makes for a real easy wheel derivation, and coming out of our
discussion of euler's sieve it's a nice cross-pollination. :) Having yet
another list representation suddenly cleared up the whole issue (two of them).
I'll repost it one last time as I've made some corrections to it:

> euler ks@(p:rs) = p : euler (rs `minus` map (*p) ks)
> primes = euler [2..]

> primes = euler $ listFrom [2] 1

> = 2:euler ( listFrom [3] 1 `minus` map(2*) (listFrom [2] 1)) )
> listFrom [3,4] 2 `minus` listFrom [4] 2

> == listFrom [3] 2 ==

> = 2:3:euler (listFrom [5] 2 `minus` map(3*) (listFrom [3] 2))
> listFrom [5,7,9] 6 `minus` listFrom [9] 6

> == listFrom [5,7] 6 ==

listFrom xs by = concat $ iterate (map (+ by)) xs

rolls = unfoldr (Just . nextRoll) ([2],1)

nextRoll r@(xs@(p:xt),b) = ( (p,r') , r')
where
r' = (xs',p*b)
xs' = (concat $ take p $ iterate (map (+ b)) $ xt ++ [p+b])
`minus` map (p*) xs

nthWheel n = let (ps,rs) = unzip $ take n rolls
(x:xs,b) = last rs
in ((ps, x), zipWith (-) (xs++[x+b]) (x:xs))

eulerPrimes n = let (ps,rs) = unzip $ take n rolls

(qs@(q:_),b) = last rs
in ps ++ takeWhile (< q^2) qs

(BTW I've noticed that when I reply in Gmane, all the text below double hyphen,
if present in the post to which I'm replying, just disappears. This may be an
artefact of some specific browser.)

Will Ness

unread,

Jan 10, 2010, 11:28:48 AM1/10/10

to haskel...@haskell.org

Daniel Fischer <daniel.is.fischer <at> web.de> writes:

>
> Am Freitag 08 Januar 2010 19:45:47 schrieb Will Ness:
> > Daniel Fischer <daniel.is.fischer <at> web.de> writes:
>
> > >
> > > mergeSP :: Integral a => People a -> People a -> People a
> > > mergeSP p1@(P a _) p2 = P (a ++ vips mrgd) (dorks mrgd)
> > > where
> > > mrgd = spMerge (dorks p1) (vips p2) (dorks p2)
> > > spMerge l1 [] l3 = P [] (merge l1 l3)
> > > spMerge ~l1@(x:xs) l2@(y:ys) l3 = case compare x y of
> > > LT -> celebrate x (spMerge xs l2 l3)
> > > EQ -> celebrate x (spMerge xs ys l3)
> > > GT -> celebrate y (spMerge l1 ys l3)
> > >
> > > ----------------------------------------------------------------------

Actually, the minimal edit that does the trick (of eliminating the space leak
that you've identified) for my original code is just

mergeSP (a,b) ~(c,d) = let (bc,bd) = spMerge b c d
in (a ++ bc, bd)
where

spMerge b [] d = ([] ,merge b d)
spMerge b@(x:xs) c@(y:ys) d = case compare x y of
LT -> (x:u,v) where (u,v) = spMerge xs c d
EQ -> (x:u,v) where (u,v) = spMerge xs ys d
GT -> (y:u,v) where (u,v) = spMerge b ys d
spMerge [] c d = ([] ,merge c d)

which hardly looks at all different at the first glance. Just for reference, it
was

{-

mergeSP (a,b) ~(c,d) = let (bc,b') = spMerge b c
in (a ++ bc, merge b' d)

where
spMerge b@(x:xs) c@(y:ys) = case compare x y of
LT -> (x:u,v) where (u,v) = spMerge xs c
EQ -> (x:u,v) where (u,v) = spMerge xs ys
GT -> (y:u,v) where (u,v) = spMerge b ys

spMerge b [] = ([] ,b)

spMerge [] c = ([] ,c)
-}

spMerge of course is not tail recursive here in both versions if seen through
the imperative eyes. But lazy evaluation makes it effectively so. The important
thing is, when the final point is reached, there's no outstanding context -
everything is present. There should be a name for such concept. This is very
similar to late instantiation in Prolog (programming with "holes"), and I think
this *would* pass as a tail-recursive function /there/.

Even in the new code the compiler could've internally held on to the original
pair and only deconstructed the 'd' out of it at the final call to merge,
recreating the space leak. It could just as well have recognized that 'd' isn't
changed inside spMerge (we're pure in Haskell after all) and deconstructed the
pair in the original code. Something is missing here.

> As it turns out, the important things are
>
> 1. a feeder and separate lists of multiples for the feeder and the runner,
> for the reasons detailed earlier (two-step feeding and larger wheel are
> pleasing but minor optimisations).
>
> 2. a strict pattern in tfold
>
> 3. moving the merge inside spMerge

Heinrich Apfelmus

unread,

Jan 12, 2010, 5:31:25 AM1/12/10

to haskel...@haskell.org

Tricky stuff. It is known that pairs/records are prone to unwanted
retention, see for example the recent thread

http://thread.gmane.org/gmane.comp.lang.haskell.cafe/66903/focus=67871

or

Jan Sparud. Fixing some space leaks without a garbage collector.
http://citeseer.ist.psu.edu/240848.html

It is exactly because these troubles that I'm advocating the original
VIP data structure that buries the dorks (that name is awesome :D) deep
inside the structure. :)

In any case, it appears to me that the lazy pattern match in mergeSP
is actually unnecessary! This is because mergeSP returns a pair
constructor immediately, so infinite nesting works even when the second
argument is demanded. In other words,

mergeSP :: Integral a => People a -> People a -> People a

mergeSP (P a b) (P c d) = P (a ++ bc) (merge b' d)
where
P bc b' = spMerge b c
spMerge [] ys = P [] ys
spMerge xs [] = P [] xs
spMerge xs@(x:xt) ys@(y:yt) = case compare x y of
LT -> celebrate x (spMerge xt ys)
EQ -> celebrate x (spMerge xt yt)
GT -> celebrate y (spMerge xs yt)

should work fine and do away with the memory issue because the pair is
now deconstructed immediately.

Hm, a strict second argument might however mean that the tree produced
by tfold is demanded too early.

> And why does
>
> tfold f (a: ~(b: ~(c:xs))) = ...
>
> leak, but not
>
> tfold f (a:b:c:xs) = ...
>
> ?
>
> I guess it's similar.
>
> tfold f (a: ~(b: ~(c:xs)))
> = (a `f` (b `f` c)) `f` tfold f ([pairwise f] xs)
>
> is
>
> tfold f (a:zs)
> = (a `f` (head zs `f` (head $ tail zs)))
> `f` tfold f (pairwise f $ drop 2 zs)
>
> the latter part holds on to the beginning of zs, leading again to data
> duplication and too long lifespans.

Same issue with retaining pairs while only the components are needed.

Ideally, we ought to be able to "couple" the two components, i.e. the
idea is that tail zs is reduced to a pointer to the tail whenever
head zs is demanded and vice versa. I think that's what Sparud does in
his paper, not sure how well it's implemented in GHC, I vaguely remember
a bug reports. Alternatively, retaining zs in any way might already be
too much.

Regards,
Heinrich Apfelmus

--
http://apfelmus.nfshost.com

_______________________________________________

Daniel Fischer

unread,

Jan 12, 2010, 8:02:30 AM1/12/10

to haskel...@haskell.org, Heinrich Apfelmus

Am Dienstag 12 Januar 2010 11:30:07 schrieb Heinrich Apfelmus:
> Tricky stuff. It is known that pairs/records are prone to unwanted
> retention, see for example the recent thread
>
> http://thread.gmane.org/gmane.comp.lang.haskell.cafe/66903/focus=67871
>
> or
>
> Jan Sparud. Fixing some space leaks without a garbage collector.
> http://citeseer.ist.psu.edu/240848.html

"CiteSeerx is momentarily busy, Please try us again, in a few moments.
We apologize for any inconvenience."

:( tried and re-tried.

Took http://bert.md.chalmers.se/pub/cs-reports/papers/sparud/spaceleak-
fix.ps.gz finally

>
> It is exactly because these troubles that I'm advocating the original
> VIP data structure that buries the dorks (that name is awesome :D) deep
> inside the structure. :)
>
>
> In any case, it appears to me that the lazy pattern match in mergeSP
> is actually unnecessary! This is because mergeSP returns a pair
> constructor immediately, so infinite nesting works even when the second
> argument is demanded. In other words,
>
> mergeSP :: Integral a => People a -> People a -> People a
> mergeSP (P a b) (P c d) = P (a ++ bc) (merge b' d)
> where
> P bc b' = spMerge b c
> spMerge [] ys = P [] ys
> spMerge xs [] = P [] xs
> spMerge xs@(x:xt) ys@(y:yt) = case compare x y of
> LT -> celebrate x (spMerge xt ys)
> EQ -> celebrate x (spMerge xt yt)
> GT -> celebrate y (spMerge xs yt)
>
> should work fine and do away with the memory issue because the pair is
> now deconstructed immediately.

No, <<loop>>. For the reason you stated below.
In

tfold f (a:b:c:xs) = (a `f` (b `f` c)) `f` smartfold f xs

, it must compute smartfold f xs too early to match it against P c d. The
compiler can't see that smartfold mergeSP always returns a P [well, it
might return _|_, mightn't it?].

>
> Hm, a strict second argument might however mean that the tree produced
> by tfold is demanded too early.
>

>
>

Heinrich Apfelmus

unread,

Jan 13, 2010, 4:44:37 AM1/13/10

to haskel...@haskell.org

Daniel Fischer wrote:

> Heinrich Apfelmus wrote:
>>
>> It is exactly because these troubles that I'm advocating the original
>> VIP data structure that buries the dorks (that name is awesome :D) deep
>> inside the structure. :)

In fact, your transformation that fixes the space leaks pretty much
emulates the behavior of the old

data People a = VIP a (People a) | Dorks [a]

structure. This becomes apparent if we put the last two arguments of
spMerge back into a pair:

mergeSP (P a b) ~(P c d) =
let P bc m = spMerge b (P c d) in P (a ++ bc) m
where
spMerge b (P [] d) = P [] (merge b d)
spMerge xs@(x:xt) cd@(P (y:yt) d) = case compare x y of
LT -> celebrate x (spMerge xt cd )
EQ -> celebrate x (spMerge xt (P yt d))
GT -> celebrate y (spMerge xs (P yt d))

In particular, forcing dorks (mergeSP ...) also forces the complete
VIP list.

I wonder whether it's really the liveness of pair in

mergeSP (a,b) pair
= let sm = spMerge b (fst pair)
in (a ++ fst sm, merge (snd sm) (snd pair))

that is responsible for the space leak, for chances are that Sparud's
technique applies and pair is properly disposed of. Rather, it could
be that we need the stronger property that forcing the second component
will evaluate the first to NF.

>> In any case, it appears to me that the lazy pattern match in mergeSP
>> is actually unnecessary! This is because mergeSP returns a pair
>> constructor immediately, so infinite nesting works even when the second

>> argument is demanded. [...]

>
> No, <<loop>>. For the reason you stated below.
> In
>
> tfold f (a:b:c:xs) = (a `f` (b `f` c)) `f` smartfold f xs
>
> , it must compute smartfold f xs too early to match it against P c d. The
> compiler can't see that smartfold mergeSP always returns a P [well, it
> might return _|_, mightn't it?].

Oops, silly me! Mea culpa, of course mergeSP a (mergeSP b (mergeSP c
.))) or any infinite nesting is not going to work with a strict
pattern. >_> <_<

Daniel Fischer

unread,

Jan 13, 2010, 6:35:24 AM1/13/10

to haskel...@haskell.org, Heinrich Apfelmus

Am Mittwoch 13 Januar 2010 10:43:42 schrieb Heinrich Apfelmus:
> I wonder whether it's really the liveness of pair in
>
> mergeSP (a,b) pair
> = let sm = spMerge b (fst pair)
> in (a ++ fst sm, merge (snd sm) (snd pair))
>
> that is responsible for the space leak, for chances are that Sparud's
> technique applies and pair is properly disposed of. Rather, it could
> be that we need the stronger property that forcing the second component
> will evaluate the first to NF.

I think that is responsible. At least that's how I understand the core:

mergeSP (a,b) ~(c,d) = (a ++ bc, merge b' d)
where
(bc, b') = spMerge b c
spMerge ...
----------------------------------------------------------------------
OldMerge.$wmergeSP :: [GHC.Types.Int]
-> [GHC.Types.Int]
-> ([GHC.Types.Int], [GHC.Types.Int])
-> (# [GHC.Types.Int], [GHC.Types.Int] #)
GblId
[Arity 3
Str: DmdType LLL]
OldMerge.$wmergeSP =
\ (ww_sny :: [GHC.Types.Int])
(ww1_snz :: [GHC.Types.Int])
(w_snB :: ([GHC.Types.Int], [GHC.Types.Int])) ->
let {
ds_so7 [ALWAYS Just D(SS)] :: ([GHC.Types.Int], [GHC.Types.Int])
LclId
[Str: DmdType]
ds_so7 =
case w_snB of _ { (c_adj, _) ->
case OldMerge.$wspMerge ww1_snz c_adj
of _ { (# ww3_snH, ww4_snI #) ->
(ww3_snH, ww4_snI)
}
} } in
(# GHC.Base.++
@ GHC.Types.Int
ww_sny
(case ds_so7 of _ { (bc_ajQ, _) -> bc_ajQ }),
case ds_so7 of _ { (_, b'_ajS) ->
case w_snB of _ { (_, d_adk) -> OldMerge.merge b'_ajS d_adk }

-- Here, in the second component of the result,
-- we reference the entire pair to get the dorks

} #)

OldMerge.mergeSP :: ([GHC.Types.Int], [GHC.Types.Int])
-> ([GHC.Types.Int], [GHC.Types.Int])
-> ([GHC.Types.Int], [GHC.Types.Int])
GblId
[Arity 2
Worker OldMerge.$wmergeSP
Str: DmdType U(LL)Lm]
OldMerge.mergeSP =
__inline_me (\ (w_snw :: ([GHC.Types.Int], [GHC.Types.Int]))
(w1_snB :: ([GHC.Types.Int], [GHC.Types.Int])) ->
case w_snw of _ { (ww_sny, ww1_snz) ->
case OldMerge.$wmergeSP ww_sny ww1_snz w1_snB
of _ { (# ww3_snN, ww4_snO #) ->
(ww3_snN, ww4_snO)
}
})
----------------------------------------------------------------------

vs.

mergeSP (a,b) ~(c,d) = (a ++ bc, m)
where

(bc,m) = spMerge b c d

spMerge ...
----------------------------------------------------------------------
NewMerge.$wmergeSP :: [GHC.Types.Int]
-> [GHC.Types.Int]
-> ([GHC.Types.Int], [GHC.Types.Int])
-> (# [GHC.Types.Int], [GHC.Types.Int] #)
GblId
[Arity 3
Str: DmdType LLL]
NewMerge.$wmergeSP =
\ (ww_snB :: [GHC.Types.Int])
(ww1_snC :: [GHC.Types.Int])
(w_snE :: ([GHC.Types.Int], [GHC.Types.Int])) ->
let {
ds_soa [ALWAYS Just D(SS)] :: ([GHC.Types.Int], [GHC.Types.Int])
LclId
[Str: DmdType]
ds_soa =
case w_snE of _ { (c_adj, d_adk) ->

-- There's no reference to the pair after this

case NewMerge.$wspMerge ww1_snC c_adj d_adk
of _ { (# ww3_snK, ww4_snL #) ->
(ww3_snK, ww4_snL)
}
} } in
(# GHC.Base.++
@ GHC.Types.Int
ww_snB
(case ds_soa of _ { (bc_ajT, _) -> bc_ajT }),
case ds_soa of _ { (_, b'_ajV) -> b'_ajV } #)

NewMerge.mergeSP :: ([GHC.Types.Int], [GHC.Types.Int])
-> ([GHC.Types.Int], [GHC.Types.Int])
-> ([GHC.Types.Int], [GHC.Types.Int])
GblId
[Arity 2
Worker NewMerge.$wmergeSP
Str: DmdType U(LL)Lm]
NewMerge.mergeSP =
__inline_me (\ (w_snz :: ([GHC.Types.Int], [GHC.Types.Int]))
(w1_snE :: ([GHC.Types.Int], [GHC.Types.Int])) ->
case w_snz of _ { (ww_snB, ww1_snC) ->
case NewMerge.$wmergeSP ww_snB ww1_snC w1_snE
of _ { (# ww3_snQ, ww4_snR #) ->
(ww3_snQ, ww4_snR)
}
})
----------------------------------------------------------------------

Will Ness

unread,

Jan 14, 2010, 2:26:34 AM1/14/10

to haskel...@haskell.org

Daniel Fischer <daniel.is.fischer <at> web.de> writes:

>
> Am Mittwoch 13 Januar 2010 10:43:42 schrieb Heinrich Apfelmus:
> > I wonder whether it's really the liveness of pair in
> >
> > mergeSP (a,b) pair
> > = let sm = spMerge b (fst pair)
> > in (a ++ fst sm, merge (snd sm) (snd pair))
> >
> > that is responsible for the space leak, for chances are that Sparud's
> > technique applies and pair is properly disposed of. Rather, it could
> > be that we need the stronger property that forcing the second component
> > will evaluate the first to NF.
>
> I think that is responsible. At least that's how I understand the core:
>
> mergeSP (a,b) ~(c,d) = (a ++ bc, merge b' d)
> where
> (bc, b') = spMerge b c
> spMerge ...

That is equivalent to

first (a++) . second (`merge`d) $ spMerge b c

and Daniel's fix is equivalent to

first (a++) $ spMerge b c d

Now, when compiler sees the first variant, it probably treats spMerge as
opaque. I.e. although in reality spMerge only contributes to the
first "channel" while it is progressively instantiated, and (`merge`d) will
only be called upon when spMerge's final clause is reached, that is (most
likely) not known to the compiler at this stage. When looking at just the first
expression itself, it has to assume that spMerge may contribute to both
channels (parts of a pair) while working, and so can't know _when_ /d/ will get
called upon to contribute to the data, as it is consumed finally at access.

So /d/ is gotten hold of prematurely, _before_ going into spMerge.

The second variant passes the responsibility for actually accessing its inputs
to spMerge itself, and _it_ is clear about needing /d/ only in the very end.

Just a theory. :)

Does that make sense?

Daniel Fischer

unread,

Jan 14, 2010, 6:04:41 AM1/14/10

to haskel...@haskell.org

No, the problem is that d itself is gotten hold of too late, it is
accessed only indirectly via the pair, so we keep an unnecessary reference
to c via d.

>
> The second variant passes the responsibility for actually accessing its
> inputs to spMerge itself, and _it_ is clear about needing /d/ only in
> the very end.

The second variant is clear about not needing the _pair_ anymore once
spMerge is entered. Thus d doesn't reference c anymore.

Will Ness

unread,

Jan 16, 2010, 12:54:22 PM1/16/10

to haskel...@haskell.org

Daniel Fischer <daniel.is.fischer <at> web.de> writes:

>
> Am Donnerstag 14 Januar 2010 08:25:48 schrieb Will Ness:
> > Daniel Fischer <daniel.is.fischer <at> web.de> writes:
> > > Am Mittwoch 13 Januar 2010 10:43:42 schrieb Heinrich Apfelmus:
> > > > I wonder whether it's really the liveness of pair in
> > > >
> > > > mergeSP (a,b) pair
> > > > = let sm = spMerge b (fst pair)
> > > > in (a ++ fst sm, merge (snd sm) (snd pair))
> > > >

> > > > that is responsible for the space leak, ...

> > >
> > > I think that is responsible. At least that's how I understand the
> > > core:
> > >
> > > mergeSP (a,b) ~(c,d) = (a ++ bc, merge b' d)
> > > where
> > > (bc, b') = spMerge b c
> > > spMerge ...
> >
> > That is equivalent to
> >
> > first (a++) . second (`merge`d) $ spMerge b c
> >
> > and Daniel's fix is equivalent to
> >
> > first (a++) $ spMerge b c d
> >

That should've been

mergeSP (a,b) p = first(a++) . second(`merge`snd p) $ spMerge b (fst p)

and

mergeSP (a,b) p = first(a++) $ spMerge b (fst p) (snd p)

The code fragments you've posted are essentially

mergeSP (a,b) p = let res = case p of (c,_) ->
case spMerge b c of (# x,y #) ->
(x,y)
in
(# (++) a (case res of (bc,_)-> bc) ,
case res of (_,b') ->
case p of (_,d) -> merge b' d #)

and

mergeSP (a,b) p = let res = case p of (c,d) ->
case spMerge b c d of (# x,y #) ->
(x,y)
in
(# (++) a (case res of (bc,_)-> bc) ,
case res of (_,b') -> b' #)

This looks like Haskell to me, with many detailes explicitely written out,
probaly serving as immediate input to the compiler - not its output. So it
can't say to us much about how this is actually implemented on the lower level.
(?)

Your theory would certainly hold if the translation was done one-to-one without
any further code rearrangements. But it could've been further re-arranged by
the compiler at some later stage (is there one?) into an equivalent of, e.g.

mergeSP (a,b) p = let (bc,b') = case p of (c,_) ->
case spMerge b c of (x,y) ->
(x,y)
in
(# (++) a bc ,
case p of (_,d) -> merge b' d #)

and further,

mergeSP (a,b) p = let (c,d) = case p of (x,y) -> (x,y)
(bc,b') = case spMerge b c of (x,y) ->
(x,y)
in
(# (++) a bc , merge b' d #)

could it? This would take hold on /d/ and /c/ at the same time, right?

What is that code that you've shown exactly? At which stage is it produced and
is it subject to any further manipulation? I apologise if these are obvious
questions, I don't know anything about GHC. I also don't know what (# x,y #)
means?

One thing seems certain - we should not hold explicit references to same
entities in different parts of our code, to avoid space leaks with more
confidence. To make code look as much tail-recursive as possible, so to speak.

Does that make sense?

Anyway that was a very educational (for me) and fruitful discussion, and I
greatly appreciate your help, and fixing and improving of the code.

Thanks!

Daniel Fischer

unread,

Jan 16, 2010, 3:40:05 PM1/16/10

to haskel...@haskell.org, Will Ness

It's 'core', the intermediate language GHC uses and does its
transformations upon. It's more or less a slimmed down version of Haskell.

That came from -ddump-simpl, thus it's the output of the simplifier, after
numerous transformation/optimisation passes. I think that is then fairly
directly translated to assembly (via Cmm), the STG to STG and Cmm passes do
not much optimisation anymore (I may be wrong, what know I about the
compiler. However, I've found that the -ddump-simpl output corresponds well
to the actual behaviour whenever I look.).

I find that much more redable than the -fext-core output
(http://www.haskell.org/ghc/docs/6.12.1/html/users_guide/ext-core.html
"GHC can dump its optimized intermediate code (said to be in “Core” format)
to a file as a side-effect of compilation."), which says the same, only
more verbose and less readable.

Of course, if I could read assembly, that would exactly reveal what
happens.

>
> Your theory would certainly hold if the translation was done one-to-one
> without any further code rearrangements. But it could've been further
> re-arranged by the compiler at some later stage (is there one?) into an
> equivalent of, e.g.
>
>
> mergeSP (a,b) p = let (bc,b') = case p of (c,_) ->
> case spMerge b c of (x,y) ->
> (x,y)
> in
> (# (++) a bc ,
> case p of (_,d) -> merge b' d #)
>
>
> and further,
>
>
> mergeSP (a,b) p = let (c,d) = case p of (x,y) -> (x,y)
> (bc,b') = case spMerge b c of (x,y) ->
> (x,y)
> in
> (# (++) a bc , merge b' d #)
>
>
> could it? This would take hold on /d/ and /c/ at the same time, right?

It would. But AFAICT, after the simplifier is done, no such rearrangements
occur anymore.

>
> What is that code that you've shown exactly? At which stage is it
> produced and is it subject to any further manipulation?

I'm no GHC expert either, so I don't know what it does when exactly.
But after parsing and desugaring, there come a few iterations of the
simplifier, intermingled with specialising, demand-analysis, CSE, let-
floating, worker-wrapper-split, ... .
At the end of all that, the Tidy Core is generated (part of which I
posted). What happens from then on, well ...

> I apologise if
> these are obvious questions, I don't know anything about GHC. I also
> don't know what (# x,y #) means?

Unboxed tuple (pair in this case). That means, have the components for
themselves, don't wrap them in a (,) constructor.

>
> One thing seems certain - we should not hold explicit references to same
> entities in different parts of our code, to avoid space leaks with more
> confidence.

Except when it's good to share ;)

> To make code look as much tail-recursive as possible, so to speak.

Tail recursion isn't really important (for GHC at least, I think for lazy
languages in general), due to different evaluation models (cf.
http://www.haskell.org/pipermail/haskell-cafe/2009-March/058607.html and
the thread starting with
http://www.haskell.org/pipermail/haskell-cafe/2007-May/025570.html).

>
> Does that make sense?

In general, yes.

>
> Anyway that was a very educational (for me) and fruitful discussion, and
> I greatly appreciate your help, and fixing and improving of the code.
>
> Thanks!

You're welcome.