Running out of memory when using filter?

249 views
Skip to first unread message

Paul Mooser

unread,
Dec 4, 2008, 6:20:37 PM12/4/08
to Clojure
I've been writing a few different functions that return and operate on
some very long sequences. Everything I've been doing has been written
in terms of underlying operations that are described as lazy, so I
assumed everything would be fine as long as I don't retain the head of
any sequences.

However, I'm running out of heap space using the following function to
filter my sequences:

(defn documents-from-vendors [doc-seq m]
(filter
(fn [document]
(let [item (. document get Constants/ITEM_ID)]
(contains? m item)))
doc-seq))

I know that filter is lazy, and I see that the documentation of lazy-
cons says that lazy-conses are cached. Cached for what duration ? Does
this mean that filter is not appropriate for use on long sequences, or
is there something else likely to be going on here?

Thanks for your help!

Stuart Sierra

unread,
Dec 4, 2008, 7:09:28 PM12/4/08
to Clojure
On Dec 4, 6:20 pm, Paul Mooser <taron...@gmail.com> wrote:
> However, I'm running out of heap space using the following function to
> filter my sequences:
...
> I know that filter is lazy, and I see that the documentation of lazy-
> cons says that lazy-conses are cached. Cached for what duration ? Does
> this mean that filter is not appropriate for use on long sequences, or
> is there something else likely to be going on here?


Hi Paul,
I think a lazy cons is cached only as long as you hold a pointer to
the head. Are you sure you don't have any pointers to doc-seq hanging
around in another piece of code? Or, alternately, pointers to the
filter output?

-S

Paul Mooser

unread,
Dec 4, 2008, 8:09:03 PM12/4/08
to Clojure
Hi Stuart,

On Dec 4, 4:09 pm, Stuart Sierra <the.stuart.sie...@gmail.com> wrote:
> I think a lazy cons is cached only as long as you hold a pointer to
> the head.  Are you sure you don't have any pointers to doc-seq hanging
> around in another piece of code?  Or, alternately, pointers to the
> filter output?

You're probably right with regard to lazy-cons not being the problem,
because looking at the source code for the map function, it is built
on top of lazy-conses as well, and I'm using it fine for some
sequences with millions items.

I don't think I'm retaining either the head or the sequence it is
creating, because I only use this in the context of doseq (here's a
brief snippet):

(doseq [document (documents-from-vendors (document-seq path)
vendors)

When I'm filtering using documents-from-vendors (as above), it
actually runs out of memory quite quickly - it occurred to me that
perhaps there was some small memory overhead, but looking at the code
I could not reason about why it would blow up after a few tens of
thousands of documents, instead of many millions.

If I remove that filtering line, and instead do:

(doseq [document (document-seq path)

it doesn't blow up, and it will get through the complete sequence
without exhausting the available heap space (but of course without the
filtering I'm including things in the set of items under consideration
that I don't actually want there). I can manually filter when I'm
consuming documents from one of my other (working) sequences, but I
like the elegance of composing filtering operations on top of lazy
sequences.

Thanks for the feedback and ideas.

Paul Mooser

unread,
Dec 5, 2008, 1:48:11 PM12/5/08
to Clojure
I'm continuing to try to suss this out, so I decided to run with a
memory profiler. I'm seeing tens of thousands of lazy conses
accounting for hundreds of megabytes of memory, which perhaps implies
I'm holding on to a reference to them somewhere, but I just don't see
how, since as I showed above, I am only referring to these sequences
in the context of doseq.

I'll post an update if I find a smoking gun.

Paul Mooser

unread,
Dec 5, 2008, 2:38:36 PM12/5/08
to Clojure
My operating theory was that the anonymous function being used by
filter was closing over both parameters to the enclosing function, but
making a simple modification to avoid that didn't seem to address the
problem.

Paul Mooser

unread,
Dec 5, 2008, 4:59:57 PM12/5/08
to Clojure
The memory profiler says that the following object is a GC root which
is holding onto the collection being passed into the filter call:

clojure.core$filter__3364$fn__3367

I'm not familiar enough with clojure's internals to speculate about
what that means, beyond what I've already mentioned previously in the
thread. Does anyone else have any ideas?

Stuart Sierra

unread,
Dec 5, 2008, 10:37:41 PM12/5/08
to Clojure
On Dec 5, 4:59 pm, Paul Mooser <taron...@gmail.com> wrote:
> The memory profiler says that the following object is a GC root which
> is holding onto the collection being passed into the filter call:
>
> clojure.core$filter__3364$fn__3367

That class should be the instance of the anonymous fn you defined in
your filter.

Here's an experiment -- make the filter a normal function, i.e.:

(def *m* ... fill in your map here ...)

(defn my-filter [document]
(let [item (. document get Constants/ITEM_ID)]
(contains? *m* item)))

(defn documents-from-vendors [doc-seq]
(filter my-filter doc-seq))

Does it still blow up?

-Stuart Sierra

Paul Mooser

unread,
Dec 6, 2008, 4:52:08 AM12/6/08
to Clojure
On Dec 5, 7:37 pm, Stuart Sierra <the.stuart.sie...@gmail.com> wrote:
> Here's an experiment -- make the filter a normal function, i.e.:
> <snip>

Yeah, I tried that in various forms previously - I wanted to make sure
that I wasn't implicitly capturing the parameters of documents-from-
vendors, so I did basically exactly what you suggested in order to
remove that possibility (that was what I was alluding to in my 11:38AM
post). Unfortunately, it had no effect.

I'm redoing the heap analysis right now, using this modified function
just to verify that the underlying issue is the same.

Ok, even after the change precisely as you described, I still get the
same result. The following object is a GC root of a graph worth
several hundreds of megabytes of memory, consisting basically of a
giant chain of lazy-conses:

clojure.core$filter__3364$fn__3367

Chouser

unread,
Dec 6, 2008, 8:45:07 AM12/6/08
to clo...@googlegroups.com
On Sat, Dec 6, 2008 at 4:52 AM, Paul Mooser <taro...@gmail.com> wrote:
>
> Ok, even after the change precisely as you described, I still get the
> same result. The following object is a GC root of a graph worth
> several hundreds of megabytes of memory, consisting basically of a
> giant chain of lazy-conses:
>
> clojure.core$filter__3364$fn__3367

This may not be worth much, but can you see the data members of that
object? It's not itself the head of a cons chain, presumably, so I'm
wondering if the data member that *is* at the head has a useful name.

--Chouser

MikeM

unread,
Dec 6, 2008, 12:37:28 PM12/6/08
to Clojure
A while back I posted an experimental patch to capture each form that
is compiled into a fn. This might be useful in your case, where you
have identified a function but don't have details on it. Here's the
thread:
http://groups.google.com/group/clojure/browse_frm/thread/e63e48a71935e31a?q=

Also, you can name your anonymous fn's to help track a problem back to
your source (Clojure will include your fn name in the class name).

Aaron Cohen

unread,
Dec 6, 2008, 12:41:39 PM12/6/08
to clo...@googlegroups.com
Would it be possible to make anonymous fns have a "toString" which
incorporates the file and line number they're defined on?
Message has been deleted

Paul Mooser

unread,
Dec 6, 2008, 12:58:43 PM12/6/08
to Clojure
On Dec 6, 5:45 am, Chouser <chou...@gmail.com> wrote:
> This may not be worth much, but can you see the data members of that
> object? It's not itself the head of a cons chain, presumably, so I'm
> wondering if the data member that *is* at the head has a useful name.

I can, and it has the elements you might expect for something defined
in the context of filter:

coll
pred

coll is a lazy-cons, and is the start of a chain of a larger number
of lazy conses. One thing that is interesting is that it does not
appear to be the head of the sequence - it's near the head, but not
the head, as far as I can tell.

pred is the predicate I passed in, and I can see that it definitely
does not itself maintain a reference to the sequence, at least
according to the heap profiler.

Paul Mooser

unread,
Dec 6, 2008, 12:59:59 PM12/6/08
to Clojure
On Dec 6, 9:37 am, MikeM <michael.messini...@invista.com> wrote:
> A while back I posted an experimental patch to capture each form that
> is compiled into a fn. This might be useful in your case, where you
> have identified a function but don't have details on it. Here's the
> thread:http://groups.google.com/group/clojure/browse_frm/thread/e63e48a71935...

Thanks Mike, this sounds like a possible avenue for me to explore. At
this point I'm not even introducing any anonymous functions of my own
anymore, as far as I know - maybe I can use your patch to help debug
what is really going on. Thanks for the help.

Paul Mooser

unread,
Dec 6, 2008, 6:21:17 PM12/6/08
to Clojure
Well, I've dug around on this some more, and I'm unfortunately no
closer to finding an answer. I decided to try to whittle down the
source code to the minimal set which exhibits the problem, and post
the result here, so that at least there's a higher chance that if I'm
making a mistake, someone might be able to identify it and point it
out.

Without further ado, here it is:

(import '(org.apache.lucene.store FSDirectory)
'(org.apache.lucene.index IndexReader)
'(org.apache.lucene.search IndexSearcher))

(def *vendors* #{ "1211", "7784" })

(defn document-seq [index-path]
(let [directory (. FSDirectory (getDirectory index-path))
searcher (new IndexSearcher directory)
reader (. searcher getIndexReader)
numDocs (. reader numDocs)]
(map (fn [i] (. reader document i)) (range 0 numDocs))))

(defn my-filter-pred [document]
(let [item (. document get Constants/ITEM_ID)]
(contains? *vendors* item)))

(defn splode [index-path]
(with-local-vars [doc-count 0]
(doseq [document (filter my-filter-pred (document-seq index-
path))]
(var-set doc-count (inc @doc-count)))
'done))

I did the same heap analysis on this, and found exactly the same
results. That is to say that

clojure.core$filter__3364$fn__3367

is a stack local (which I think is what makes it a GC root), and it
has a reference to "coll", which is a variable used in the definition
of filter, which refers to the whole gigantic list of lazy-conses. You
can see that I'm operating on lucene indexes - I actually tried to
rewrite this all using something more fundamental (like a sequence of
random strings), but I could not come up with something that caused
the heap to blow out this way.

I would love to find that I'm making some mistake in how I've written
my sequences, but I'm running out of ideas.

Stephen C. Gilardi

unread,
Dec 6, 2008, 7:35:28 PM12/6/08
to clo...@googlegroups.com

On Dec 6, 2008, at 6:21 PM, Paul Mooser wrote:

(defn splode [index-path]
 (with-local-vars [doc-count 0]
   (doseq [document (filter my-filter-pred (document-seq index-
path))]
     (var-set doc-count (inc @doc-count)))
   'done))

The fn in question is likely the one in the macro expansion of the lazy-cons inside filter. As a next step, I would try replacing the call to "doseq" with the equivalent loop/recur:

(loop [document (filter my-filter-pred (document-seq index-path))]
  (var-set doc-count (inc !doc-count))
  (recur (filter my-filter-pred (document-seq index-path))))

This will rule in our out a problem with doseq.

--Steve

Paul Mooser

unread,
Dec 6, 2008, 8:28:58 PM12/6/08
to Clojure
On Dec 6, 4:35 pm, "Stephen C. Gilardi" <squee...@mac.com> wrote:
> The fn in question is likely the one in the macro expansion of the  
> lazy-cons inside filter. As a next step, I would try replacing the  
> call to "doseq" with the equivalent loop/recur:

That's a good idea, Steve - I didn't totally understand the code you
included, but I do always forget that clojure has destructuring in its
binding forms, so I rewrote it like this, which I believe should be
fine (correct me if I am wrong):

(defn splode2 [index-path]
(with-local-vars [doc-count 0]
(loop [[document & rest-documents] (filter my-filter-pred
(document-seq index-path))]
(when document
(var-set doc-count (inc @doc-count))
(recur rest-documents)))
'done))

This blows up exactly the same, with exactly the same object
(clojure.core$filter__3364$fn__3367) referencing the big chain of lazy-
conses.

Stephen C. Gilardi

unread,
Dec 6, 2008, 8:49:35 PM12/6/08
to clo...@googlegroups.com

On Dec 6, 2008, at 8:28 PM, Paul Mooser wrote:

> That's a good idea, Steve - I didn't totally understand the code you
> included, but I do always forget that clojure has destructuring in its
> binding forms, so I rewrote it like this, which I believe should be
> fine (correct me if I am wrong):

Thanks for the polite correction to my incorrect code. :)

What you have looks correct to me.

> (defn splode2 [index-path]
> (with-local-vars [doc-count 0]
> (loop [[document & rest-documents] (filter my-filter-pred
> (document-seq index-path))]
> (when document
> (var-set doc-count (inc @doc-count))
> (recur rest-documents)))
> 'done))

For more simplifications with nearly equivalent effect:

(defn splode3 [index-path]


(loop [[document & rest-documents] (filter my-filter-pred (document-
seq index-path))]
(when document

(recur rest-documents))))

(defn splode4 [index-path]
(loop [myseq (filter my-filter-pred (document-seq index-path))]
(when (first myseq)
(recur (rest myseq)))))

(defn splode5 [index-path]
(dorun (filter my-filter-pred (document-seq index-path))))

--Steve

MikeM

unread,
Dec 6, 2008, 8:58:02 PM12/6/08
to Clojure
I think I've been able to duplicate your results with a simpler
example.

I first tried this:

(defn splode [n]
(with-local-vars [doc-count 0]
(doseq [document (filter #(= % 1) (range n))]
(var-set doc-count (inc @doc-count)))
'done))

and it doesn't blow up with (splode 1000000000).

Next, I tried this (since your app filters the seq from a map):

(defn splode2 [n]
(with-local-vars [doc-count 0]
(doseq [document (filter #(= % 1) (map inc (range n)))]
(var-set doc-count (inc @doc-count)))
'done))

and it blows up (out of memory) with (splode2 10000000)

So it seems that (filter pred (map ...)) is the issue.


MikeM

unread,
Dec 6, 2008, 9:29:04 PM12/6/08
to Clojure
Some more experimentation:

(defn splode3 [n]
(with-local-vars [doc-count 0]
(doseq [document (filter #(= % (- n 1)) (map inc (range n)))]
(var-set doc-count (inc @doc-count)))
'done))

which does not blow up with (splode3 1000000000).

Stephen C. Gilardi

unread,
Dec 6, 2008, 9:45:34 PM12/6/08
to clo...@googlegroups.com
This also fails:

(defn splode [n]
  (doseq [document (filter #(= % 20) (map inc (range n)))]))

Looking at the heap dump, I see that the first item for which the filter returns true is the root of the chain of lazy cons's that's being kept.

The filter is constructing a lazy-cons from the first of a lazy-cons returned by map.

--Steve

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
To unsubscribe from this group, send email to clojure+u...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/clojure?hl=en
-~----------~----~----~----~------~----~------~--~---


Paul Mooser

unread,
Dec 6, 2008, 9:48:53 PM12/6/08
to Clojure
You reproduced it for sure --

On Dec 6, 5:58 pm, MikeM <michael.messini...@invista.com> wrote:
> Next, I tried this (since your app filters the seq from a map):
>
> (defn splode2 [n]
>   (with-local-vars [doc-count 0]
>     (doseq [document (filter #(= % 1) (map inc (range n)))]
>       (var-set doc-count (inc @doc-count)))
>     'done))
>
> and it blows up (out of memory) with (splode2 10000000)

I went ahead and dug around through the heap again, and found it's
exactly the same thing that's holding on to all the memory, so this is
definitely the same problem.

I also saw your subsequent example which uses a different anonymous
function which does NOT blow up, and that's very interesting. I'm not
sure why this would be, but it seems that filter ends up holding on to
the collection its filtering internally from the point at which it
first matches - I think the second one doesn't blow up because it
doesn't happen until almost the end.

Stephen C. Gilardi

unread,
Dec 6, 2008, 9:57:25 PM12/6/08
to clo...@googlegroups.com

On Dec 6, 2008, at 9:48 PM, Paul Mooser wrote:

I also saw your subsequent example which uses a different anonymous
function which does NOT blow up, and that's very interesting. I'm not
sure why this would be, but it seems that filter ends up holding on to
the collection its filtering internally from the point at which it
first matches - I think the second one doesn't blow up because it
doesn't happen until almost the end.

I agree with this as the reason why the second example worked.

When filter finds a match it tries to construct a lazy cons of the first of the match (which in this case is a one of the members of the sequence) and the rest which it then goes on to fully realize. The first is still hooked up to that rest and since it's being retained, the entire sequence stays in memory.

Somehow the "first" needs to shed its rest before we construct the new lazy_cons. I'm still thinking about the details of that.

--Steve

Paul Mooser

unread,
Dec 6, 2008, 10:27:36 PM12/6/08
to Clojure
I think I understand. The lazy-cons that filter is constructing
maintains a reference to the whole coll in its tail so that it can
evaluate (rest coll) when it is forced. Hmmm!

Stephen C. Gilardi

unread,
Dec 6, 2008, 10:43:24 PM12/6/08
to clo...@googlegroups.com

In the current implementation of filter, coll is held the entire time
that (rest coll) is calculated on the first match.

The following separation into two functions appears to solve it. I'll
be looking at simplifying it.

If you use a definition of filter like this in your test, I think it
will succeed:

(defn filter-iter
[pred coll]
(when (seq coll)
(if (pred (first coll))
[(first coll) (rest coll)]
(recur pred (rest coll)))))

(defn filter
"Returns a lazy seq of the items in coll for which
(pred item) returns true. pred must be free of side-effects."
[pred coll]
(let [result (filter-iter pred coll)]
(when result
(lazy-cons (result 0) (result 1)))))

--Steve

Mark Engelberg

unread,
Dec 6, 2008, 10:52:15 PM12/6/08
to clo...@googlegroups.com
Except your version of filter doesn't do any filtering on the rest in
the case where the first satisfies the predicate.

Stephen C. Gilardi

unread,
Dec 6, 2008, 10:54:35 PM12/6/08
to clo...@googlegroups.com

On Dec 6, 2008, at 10:43 PM, Stephen C. Gilardi wrote:

The following separation into two functions appears to solve it. I'll be looking at simplifying it.

Well it doesn't run out of memory, but it's not an implementation of filter either... ah well.

--Steve

Stephen C. Gilardi

unread,
Dec 6, 2008, 10:59:34 PM12/6/08
to clo...@googlegroups.com

On Dec 6, 2008, at 10:52 PM, Mark Engelberg wrote:

Except your version of filter doesn't do any filtering on the rest in
the case where the first satisfies the predicate.

Right. It's looking to me now like this may have to be solved in Java. I don't see how to write this in Clojure without retaining "coll" through the entire implementation of filter--including the call to (filter pred (rest coll)) and that's enough to make it fail.

--Steve

Mark Engelberg

unread,
Dec 6, 2008, 11:08:09 PM12/6/08
to clo...@googlegroups.com
Well, part of the puzzle is to figure out why filter works just fine
on the output of the range function, but not on the output of the map
function.

I'm starting to wonder whether there might be a fundamental bug in the
java implementation of LazyCons. Maybe it doesn't implement "first"
correctly?

puzzler

unread,
Dec 6, 2008, 11:38:43 PM12/6/08
to Clojure
OK, I see what you're saying now. Range doesn't cause problems
because it's not coded in a way that links to a bunch of other cells.
So it's plausible that the problem is the way filter hangs on to the
collection while generating the rest (since it's not a tail-recursive
call in that case).

This is a real problem. It means that a whole range of things can't
be coded naturally with lazy sequences.

Maybe LazyCons shouldn't cache. Make LazyCons something that executes
its function every time. For most things, it's not a problem because
sequences are often traversed only once. If a person wants to cache
it for multiple traversals, he can explicitly call cache-seq on it.

Stephen C. Gilardi

unread,
Dec 6, 2008, 11:39:49 PM12/6/08
to clo...@googlegroups.com

On Dec 6, 2008, at 11:08 PM, Mark Engelberg wrote:

> Well, part of the puzzle is to figure out why filter works just fine
> on the output of the range function, but not on the output of the map
> function.

Good point and my previous analysis was all wet. lazy-cons is a macro.
Its arguments aren't evaluated until later.

It's fun seeing some more detail about how it all works.

I look forward to seeing a solution!

--Steve

Paul Mooser

unread,
Dec 7, 2008, 1:02:35 AM12/7/08
to Clojure
On Dec 6, 8:38 pm, puzzler <mark.engelb...@gmail.com> wrote:
> Maybe LazyCons shouldn't cache.  Make LazyCons something that executes
> its function every time.  For most things, it's not a problem because
> sequences are often traversed only once.  If a person wants to cache
> it for multiple traversals, he can explicitly call cache-seq on it.

Is caching really the problem here? I'm curious.

Looking at LazyCons.java, I see that it uses an IFn "f" generate the
rest, and saves the value (a seq) in _rest. I assume that the function
f is what is actually holding a reference to the whole collection (it
may actually be the version of the function that the _rest contains,
but I'll get there in a moment). At this point, once we have a value
for _first and _rest, it doesn't seem like there's any value to having
a reference to f. The code seems to agree with me, because once _rest
has been set, it sets f to null.

However, clearly there is still a problem here, based on all the
different code samples in the thread which exhibit the bad behavior.
I'm just not sure which of our analyses are correct yet, either.

RZez...@gmail.com

unread,
Dec 7, 2008, 1:16:16 AM12/7/08
to Clojure
I'm also running into, what I believe to be, the same problem. Every
time I run the following code I get "java.lang.OutOfMemoryError: Java
heap space".

(use 'clojure.contrib.duck-streams)
(count (line-seq (reader "big.csv")))

If I change "count" to "dorun" then it will return without problem.

I uploaded a file called bigcsv.zip which contains the big.csv file.
As Gilardi pointed out to me in #clojure a text file with a million
line feeds would probably have sufficed too, but I already uploaded it
so take your pick.


Chouser

unread,
Dec 7, 2008, 1:52:09 AM12/7/08
to clo...@googlegroups.com
On Sun, Dec 7, 2008 at 1:16 AM, rzez...@gmail.com <RZez...@gmail.com> wrote:
>
> I'm also running into, what I believe to be, the same problem. Every
> time I run the following code I get "java.lang.OutOfMemoryError: Java
> heap space".
>
> (use 'clojure.contrib.duck-streams)
> (count (line-seq (reader "big.csv")))
>
> If I change "count" to "dorun" then it will return without problem.

I think I can reproduce this one like so:

user=> (count (take 15000 (iterate #(str % "more") "some")))
java.lang.OutOfMemoryError: Java heap space (NO_SOURCE_FILE:0)

As with yours, I can replace 'count' with 'dorun' and it works fine.
I can also use 'last':

user=> (.length (last (take 15000 (iterate #(str % "more") "some")))))
60000

I think the problem in this case is 'count', which for all
IPersistentCollections (including lazy sequences) calls the 'count'
method of the instance. ASeq's count method is a tight for() loop,
but since it's an instance method it must retain a 'this' reference to
the head of the seq.

Fixing this is hard becasue RT.count() is holding onto the head as
well. I've attached a patch that fixes the problem, but it's pretty
ugly, perhaps only useful to demonstrate that this is the problem.

--Chouser

count-lazy.patch

RZez...@gmail.com

unread,
Dec 7, 2008, 3:59:33 AM12/7/08
to Clojure


On Dec 7, 1:52 am, Chouser <chou...@gmail.com> wrote:
>  count-lazy.patch
> < 1KViewDownload

I started a new thread since these problems may be independent of each
other. I don't want to get this one off track.

http://groups.google.com/group/clojure/browse_thread/thread/41bc83dbe3cccc74

Mark Engelberg

unread,
Dec 8, 2008, 7:45:08 PM12/8/08
to clo...@googlegroups.com
Has anyone made progress on this bug?

The simplest form of the bug was this:
(defn splode [n]
(doseq [i (filter #(= % 20) (map inc (range n)))]))

This blows the heap, but it shouldn't.

I find this deeply troubling, because if this doesn't work, it
undermines my faith in the implementation of lazy sequences, which are
quite ubiquitous in Clojure. map and filter are written in the most
natural way in boot.clj, so if they don't work properly, it means that
anything written with lazy-cons is suspect.

I have an idea to try, but I'm not set up to build the java sources on
my computer, so maybe someone else can run with it:

Right now, LazyCons.java takes one function which includes both the
knowledge of how to create the first and the rest.
Perhaps LazyCons.java needs to be implemented so that the constructor
take two separate functions, one that generates the first, and one
that generates the rest. This way, as soon as the first is generated,
it can be cached, and the first-generating-function can be set to
null. Similarly, when the rest is generated, it will be cached, and
its generating function can be set to null. So if the problem is
being caused by the first-generating-function keeping something alive
that needs to be garbage collected, this would solve it, by releasing
it as soon as possible.

The lazy-cons macro in boot.clj wouldl also need to be changed to call
the new LazyCons constructor with two functions, rather than just one.

Of course, if the problem is that the rest-generating-function is
holding onto something, this will not solve the problem.

--Mark

Paul Mooser

unread,
Dec 8, 2008, 8:30:37 PM12/8/08
to Clojure
Is there a place we should file an official bug, so that Rich and the
rest of the clojure people are aware of it? I imagine that they may
have read this thread, but I'm not sure if there is an "official"
process to make sure these things get addressed.

As I said in a previous reply, it's not clear (to me, at least!) how
it matters if the function is holding on to the rest, since one first
and rest have been evaluated for a lazy cons cell, it sets "f" (the
function) to null. I realize that the sequence that represents rest at
that time might have a derivative version of the original sequence
(presumably one link further down the list!), but it has been my
assumption that it would not hold on to the whole thing forever, just
as long as it needs to, which seems correct.

If someone understands exactly what it is that is holding on to a
reference to the whole sequence, once we have a first match, can
someone explain the exact mechanism that is causing it, and why
nulling out "f" has no effect?

Paul Mooser

unread,
Dec 8, 2008, 8:30:37 PM12/8/08
to Clojure

MikeM

unread,
Dec 8, 2008, 8:35:55 PM12/8/08
to Clojure
I share your concern about the LazyCons problem - hopefully Rich and
others are looking into this.

I have continued to experiment to see if I can gain some understanding
that might help with a solution.

The following is something I thought of today, and I'd like to see if
others get the same result:

(defmacro my-lzy-cons
"lazy cons by proxy"
[fbody rbody] `(proxy [clojure.lang.ISeq] []
(first [] ~fbody)
(rest [] ~rbody)
(seq [] ~'this)))


(defn my-map
"not fully a replacement for map, but enough to test splode"
[f coll] (when (seq coll)
(my-lzy-cons (f (first coll)) (my-map f (rest coll)))))

(defn my-splode [n]
(doseq [i (filter #(= % 20) (my-map inc (range n)))]))

my-splode does not blow the heap for me with (my-splode 10000000);
(splode 10000000) does.

my-lzy-cons appears to be slow compared to LazyCons, so this probably
isn't a solution, but perhaps this will give someone some insight to
help improve LazyCons.

Stephen C. Gilardi

unread,
Dec 8, 2008, 8:40:05 PM12/8/08
to clo...@googlegroups.com

On Dec 8, 2008, at 7:45 PM, Mark Engelberg wrote:

I have an idea to try, but I'm not set up to build the java sources on
my computer, so maybe someone else can run with it:

This looked very promising to me. For one thing, I remembered that the root of the big chain of LazyCons objects in memory (as displayed by the YourKit profiler) was "f".

I gave it a try, though, and it didn't solve the problem.

I encourage someone else to try it as well. I'll continue to play with it later tonight.

My attempt is included below.

--Steve

Index: LazyCons.java
===================================================================
--- LazyCons.java (revision 1147)
+++ LazyCons.java (working copy)
@@ -15,11 +15,13 @@
 final public class LazyCons extends ASeq{
 final private static ISeq sentinel = new Cons(null, null);
 IFn f;
+IFn r;
 Object _first;
 ISeq _rest;
 
-public LazyCons(IFn f){
+public LazyCons(IFn f, IFn r){
  this.f = f;
+ this.r = r;
  this._first = sentinel;
  this._rest = sentinel;
 }
@@ -43,6 +45,7 @@
  {
  throw new RuntimeException(ex);
  }
+        f = null;
  }
  return _first;
 }
@@ -57,13 +60,13 @@
  //force sequential evaluation
  if(_first == sentinel)
  first();
- _rest = RT.seq(f.invoke(null));
+ _rest = RT.seq(r.invoke());
  }
  catch(Exception ex)
  {
  throw new RuntimeException(ex);
  }
- f = null;
+ r = null;
  }
  return _rest;
 }


Index: core.clj
===================================================================
--- core.clj (revision 1147)
+++ core.clj (working copy)
@@ -410,7 +410,7 @@
   same node of the seq evaluates first/rest-expr once - the values they yield are
   cached."
  [first-expr & rest-expr]
- (list 'new 'clojure.lang.LazyCons (list `fn (list [] first-expr) (list* [(gensym)] rest-expr))))
+ (list 'new 'clojure.lang.LazyCons (list `fn [] first-expr) (list* `fn [] rest-expr)))
 
 ;(defmacro lazy-seq
 ;  "Expands to code which produces a seq object whose first is the

Stephen C. Gilardi

unread,
Dec 8, 2008, 8:46:29 PM12/8/08
to clo...@googlegroups.com

On Dec 8, 2008, at 8:40 PM, Stephen C. Gilardi wrote:

This looked very promising to me. For one thing, I remembered that the root of the big chain of LazyCons objects in memory (as displayed by the YourKit profiler) was "f".

Now it's "r" and is listed as a "stack local".

--Steve

Stephen C. Gilardi

unread,
Dec 8, 2008, 8:56:26 PM12/8/08
to clo...@googlegroups.com
I think I finally see the problem. The "rest expression" in filter's
call to lazy-cons has a reference to "coll" in it. That's all it takes
for coll to be retained during the entire calculation of the rest.

(defn filter
"Returns a lazy seq of the items in coll for which
(pred item) returns true. pred must be free of side-effects."
[pred coll]

(when (seq coll)
(if (pred (first coll))

(lazy-cons (first coll) (filter pred (rest coll)))
(recur pred (rest coll)))))

The stye of fix that occurs to me involves peeling off coll from
(frest coll) and (rrest coll) before continuing the lazy evaluation.

Posting so someone else can beat me to the answer. :-)

--Steve

Mark Engelberg

unread,
Dec 8, 2008, 9:04:14 PM12/8/08
to clo...@googlegroups.com
On Mon, Dec 8, 2008 at 5:56 PM, Stephen C. Gilardi <sque...@mac.com> wrote:
> I think I finally see the problem. The "rest expression" in filter's call to
> lazy-cons has a reference to "coll" in it. That's all it takes for coll to
> be retained during the entire calculation of the rest.
>

Well, I had already tried this, eagerly evaluating the first and rest,
and it didn't help:

(defn map2


[f coll]
(when (seq coll)

(let [fcoll (first coll) rcoll (rest coll) ffcoll (f fcoll)]
(lazy-cons ffcoll (map2 f rcoll)))))

(defn filter2


"Returns a lazy seq of the items in coll for which
(pred item) returns true. pred must be free of side-effects."
[pred coll]
(when (seq coll)

(let [fcoll (first coll) rcoll (rest coll)]
(if (pred fcoll)
(lazy-cons fcoll (filter2 pred rcoll))
(recur pred rcoll)))))

(defn splode [n]
(doseq [document (filter2 #(= % 20) (map2 inc (range n)))]))

Mark Engelberg

unread,
Dec 8, 2008, 9:19:20 PM12/8/08
to clo...@googlegroups.com
On Mon, Dec 8, 2008 at 5:56 PM, Stephen C. Gilardi <sque...@mac.com> wrote:
> I think I finally see the problem. The "rest expression" in filter's call to
> lazy-cons has a reference to "coll" in it. That's all it takes for coll to
> be retained during the entire calculation of the rest.
>

Well, I had already tried this, eagerly evaluating the first and rest,


and it didn't help:

(defn map2
[f coll]
(when (seq coll)
(let [fcoll (first coll) rcoll (rest coll) ffcoll (f fcoll)]
(lazy-cons ffcoll (map2 f rcoll)))))

(defn filter2


"Returns a lazy seq of the items in coll for which
(pred item) returns true. pred must be free of side-effects."
[pred coll]
(when (seq coll)

Rich Hickey

unread,
Dec 8, 2008, 9:51:23 PM12/8/08
to clo...@googlegroups.com

I'm sorry I haven't chimed in sooner. I fully understand this. Yes,
it's the closure over coll in the rest portion, which means that after
finding some match, a subsequent call to rest that needs to skip a lot
will create a window over the interval where the seq will be realized.

Eagerly evaluating (rest coll) won't work - the result will still be
in the closure while the recursion occurs. The only solutions involve
mutation in filter. Here's one way:

(defn filter
[pred coll]
(let [sa (atom (seq coll))
step (fn step []
(when-let [s @sa]
(let [x (first s)]
(if (pred x)
(lazy-cons x (do (swap! sa rest) (step)))
(do (swap! sa rest)
(recur))))))]
(step)))

But it's not pretty. Fortunately, this segues with work I have been
doing on I/O and generators/streams. This will let you write things
like:

(defn filter-stream
"Returns a stream of the items in strm for which


(pred item) returns true. pred must be free of side-effects."

[pred strm]
(stream
#(let [x (next! strm)]
(if (or (eos? x) (pred x)) x (recur)))))

(defn filter
"Returns a lazy seq of the items in coll for which
(pred item) returns true. pred must be free of side-effects."
[pred coll]

(stream-seq (filter-stream pred (stream coll))))

I'm still not done with this yet, so anyone who is stuck can use the
former version.

Rich

Mark Engelberg

unread,
Dec 12, 2008, 12:29:07 AM12/12/08
to clo...@googlegroups.com
On Mon, Dec 8, 2008 at 6:51 PM, Rich Hickey <richh...@gmail.com> wrote:
I don't have the latest build of Clojure with atoms, so I
reimplemented Rich's filter solution using refs, turning:

> (defn filter
> [pred coll]
> (let [sa (atom (seq coll))
> step (fn step []
> (when-let [s @sa]
> (let [x (first s)]
> (if (pred x)
> (lazy-cons x (do (swap! sa rest) (step)))
> (do (swap! sa rest)
> (recur))))))]
> (step)))
>

into:

(defn safe-filter
[pred coll]
(let [sa (ref (seq coll)),


step (fn step []
(when-let [s @sa]
(let [x (first s)]
(if (pred x)
(lazy-cons x (do

(dosync (ref-set sa (rest s)))
(step)))
(do
(dosync (ref-set sa (rest s)))
(recur))))))]
(step)))

But splode still splodes!:

(defn splode [n]
(doseq [i (safe-filter #(= % 20) (map inc (range n)))]))

Any idea why the ref version wouldn't be working?

> But it's not pretty. Fortunately, this segues with work I have been
> doing on I/O and generators/streams.

I'm really looking forward to seeing how this all turns out. Cached
lazy sequences seems to be a bad default for all the standard sequence
operations. You have to be very careful to not retain the head of one
of these sequences (can't give names to intermediate results, for
example), and it's very hard to predict when the head of one of these
sequences might be unintentionally held. This seems to make code more
brittle. Probably the best solution is to default to sequences that
always freshly generate the results, and you have to intentionally
cache the sequence if that is what you want.

Mark

Rich Hickey

unread,
Dec 12, 2008, 9:37:26 AM12/12/08
to Clojure


On Dec 12, 12:29 am, "Mark Engelberg" <mark.engelb...@gmail.com>
wrote:
Yes, because that is not the same - your version still closes over s.
swap! passes a fn to transform the current value, ref-set does not.
But alter and commute do:

(defn safe-filter
[pred coll]
(let [sa (ref (seq coll)),
step (fn step []
(when-let [s @sa]
(let [x (first s)]
(if (pred x)
(lazy-cons x (do
(dosync (commute sa rest))
(step)))
(do
(dosync (commute sa rest))
(recur))))))]
(step)))

> > But it's not pretty. Fortunately, this segues with work I have been
> > doing on I/O and generators/streams.
>
> I'm really looking forward to seeing how this all turns out. Cached
> lazy sequences seems to be a bad default for all the standard sequence
> operations. You have to be very careful to not retain the head of one
> of these sequences (can't give names to intermediate results, for
> example), and it's very hard to predict when the head of one of these
> sequences might be unintentionally held. This seems to make code more
> brittle. Probably the best solution is to default to sequences that
> always freshly generate the results, and you have to intentionally
> cache the sequence if that is what you want.
>

I'm appreciate the time you and others have spent on this, and will
improve filter, but I'm not sure where you are getting your
presumptions about lazy sequences. They are not a magic bullet that
makes working with data bigger than memory transparent. They do make
it possible in some cases, but that is not their intent. Latent
runtime memory consumption is a fact of life of laziness, and is
present in Haskell etc. I can make it go away for the specific case of
filter and the rest of the library, but not generally.

Not caching would let to excessive recalculation in many contexts, and
then people would be complaining about performance. There are many
benefits of the thread-safe once-and-only-once evaluation promise of
lazy-cons that you may not be considering. It is certainly not a bad
default.

There will soon be a streams/generator library, intended for i/o, that
will do one-pass non-cached iteration. It will offer a different set
of tradeoffs and benefits, including reduced ease-of use - not being
persistent removes a lot of things like first/rest/nth, destructuring
etc.

But the important thing is tradeoffs/benefits. They always come
together.

Rich



Paul Mooser

unread,
Dec 12, 2008, 1:03:00 PM12/12/08
to Clojure
On Dec 12, 6:37 am, Rich Hickey <richhic...@gmail.com> wrote:
> I'm appreciate the time you and others have spent on this, and will
> improve filter, but I'm not sure where you are getting your
> presumptions about lazy sequences. They are not a magic bullet that
> makes working with data bigger than memory transparent. They do make
> it possible in some cases, but that is not their intent. Latent
> runtime memory consumption is a fact of life of laziness, and is
> present in Haskell etc. I can make it go away for the specific case of
> filter and the rest of the library, but not generally.

I think some people have been assuming (perhaps partially motivated by
my own questions in the initial post regarding the lifetime or extent
of the caching of lazy-conses), but I actually think an issue that
seems more fundamental here is being able to state with confidence
what variables are being closed over and when.

Perhaps it should be obvious to me, but presumably clojure's model is
not (for example) like scheme's typical simple model of nested
environments, because in the above example, I'd assume that the lazy-
cons would capture the environment introduced by let, which presumably
extends the environment of the when-let, which extends the environment
of the function which itself includes coll.

Does clojure analyze the expression and close over any bound
identifiers in the expression? Is that why the version of filter that
Rich posted works and manages not to capture coll ? Is there any
documentation for clojure which describes clojure's closure or
environment semantics, and are there are interactive ways to examine a
function and determine what it is closing over?

Thanks for your patience - I'm looking forward to understanding
clojure better.

Mark Engelberg

unread,
Dec 12, 2008, 6:15:26 PM12/12/08
to clo...@googlegroups.com
On Fri, Dec 12, 2008 at 6:37 AM, Rich Hickey <richh...@gmail.com> wrote:
> I'm appreciate the time you and others have spent on this, and will
> improve filter, but I'm not sure where you are getting your
> presumptions about lazy sequences. They are not a magic bullet that
> makes working with data bigger than memory transparent. They do make
> it possible in some cases, but that is not their intent. Latent
> runtime memory consumption is a fact of life of laziness, and is
> present in Haskell etc. I can make it go away for the specific case of
> filter and the rest of the library, but not generally.

I think this is a great discussion to have, so let me see if I can
articulate my thoughts a little better.

First, let's talk about laziness, because I think we're in agreement
here. Laziness is undoubtedly a very powerful tool. In Haskell,
laziness is the default. Every single function call is automatically
lazy. In LISP/Scheme, laziness is optional. It's something you have
to specifically opt into using delay and force. Which is better?

Peter Van Roy, the designer of the Oz programming language, has argued
in many places that laziness is not a good default for a
general-purpose programming language. This is also a point he
discusses in his excellent book "Concepts, Techniques, and Models of
Computer Programming". He has probably explained this point better
than I can, so I won't belabor it here, but in a nutshell, it all
comes down to the fact that laziness makes the performance (especially
space performance) of your program much harder to analyze. It is all
too easy to write a Haskell program that blows up or performs poorly,
and you have no idea why. You can get back some performance by
opting-out of the laziness with strictness annotations, but it can be
difficult to figure out how or where to do this. To put it simply,
it's easier to write high-performing code when you opt-in to laziness
when you need it, rather than trying to figure out how to opt out when
it's wrecking things for you.

I assume that you agree with this, since you have chosen the explicit
force/delay "opt-in laziness" model for Clojure.

So now let's talk about sequences, specifically the behavior of
lazily-generated sequences. A similar choice needs to be considered.
It is easy to imagine two possible implementations of lazy-cons.
lazy-cons-cached would work exactly as lazy-cons currently does.
lazy-cons-uncached would be similar, but would not cache its first and
rest values when computed. Sequences built with either version of
lazy-cons would both respond to the same sequence interface, and would
therefore behave the same, producing the same results as long as the
generating functions don't refer to mutable data, but possibly with
different performance profiles. lazy-cons-cached sequences might run
faster when traversing the same lazy list more than once, but also
might crash your program by consuming too much memory in places where
lazy-cons-uncached would not. It's a tradeoff, and both versions of
lazy-cons are potentially useful, but which should be the default, or
more specifically, which should be the default used by all of
Clojure's library functions like map, filter, etc.? Should these
lazily-generated sequences be automatically cached, or not?

The first question is, does it matter? I would say that yes, this
design decision matters a great deal. I think I remember you saying
somewhere (in a post here, or possibly in one of your presentations)
that sequences, especially lazy sequences, have become a far more
important part of Clojure programming than you envisioned at the
outset. And it is easy to see why lazy sequences take on a
significant role in a fairly pure functional programming language:

People coming from imperative languages often ask how you code up
certain kinds of for/while loops that are used to traverse data
structures and build up new ones. Let's say you want to build a list
of all the even squares of whole numbers up to 100, inclusively. One
literal way to formulate this problem in an imperative language
(taking Python 2.6 syntax as a sample) would be:
l = []
for i in xrange(100):
square = i*i
if (square%2 == 0):
l.append(square)

Now in Clojure, things like this can be written using loop/recur. But
any newcomer to Clojure will certainly be told that there's a more
elegant way to express this concept:

(def l (filter even? (map #(* % %) (range 100))))

or the equivalent comprehension.

The newcomer's first reaction is to say, "Whoa, but won't that have
poor performance? You're generating an awful lot of big, intermediate
lists." And the answer is usually, "Well, it's lazy, so no, in fact,
this elegant form behaves like the iterative version. You can have
your cake and eat it too."

You said, "I'm not sure where you are getting your presumptions about


lazy sequences. They are not a magic bullet that makes working with

data bigger than memory transparent." Well, actually, I know that
cached lazy sequences do not always behave like their iterative
counterparts. But this is where these presumptions are born.
Iterative stuff is ugly in functional languages. We are told to use
comprehensions, map, filter, tricks like Stuart Holloway's blog about
writing the indexed function which makes a temporary sequence of pairs
of indices and values to avoid certain imperative patterns. And we
expect the behavior to be similar, because most of the time it is.
When something blows up from some subtle capturing of a lazy list, we
feel frustrated and betrayed.

Next, let's consider whether it is easier to opt-in or opt-out of
caching your sequences. Opting out of sequence caching is very
difficult. Right now, all the built-in sequence functions use a
cached lazy-cons, and there is no way to undo that caching. The first
line of defense is to make sure that you don't give an intermediate
temporary name to any of the lazy lists you are transforming. In
other words, you should never do something like:

(def squares (map #(* % %) (range 100))
(def even-squares (filter even? squares)).

This kind of thing could crash your program with large lists. It's
irritating to have to be careful about this, because it seems
intutitive that giving a temporary name to something shouldn't be the
kind of thing that could crash your program, but okay, let's say you
learn to be really careful, and to carefully avoid naming your lists.
But then, sometimes your lists get captured by closures, or other
subtle corner cases, and it just gets frustrating. Using lazy
sequences, which are pervasive in Clojure, should not be so difficult.
Right now the only solution is to rewrite one's code to use
loop/recur, and it sounds like eventually there will be an alternative
"generators" system which you say you're envisioning more as an I/O
tool with a special interface, rather than as a general-purpose form
of sequences.

On the other hand, opting in to sequence caching is very easy. You
only gain the benefits of a cached sequence when you use it twice,
which inherently means that you have to name it or bind it to
something in order to refer to it multiple times. So if I'm going to
be traversing a sequence multiple times, I just opt in by calling a
function to turn it into a cached sequence. (Let's call it
"cached-seq". I know there is such a function in the Clojure API
already, but I haven't really verified that it works the way I mean
here. So if I'm using the term in a different way from the way that
cached-seq actually works in the current API, please bear with me for
the sake of argument.)

So with an opt-in system, where range, map and filter use
lazy-cons-uncached, you could easily write something like:
(def squares (map #(* % %) (range 100))
(def even-squares (cached-seq (filter even? squares)))

This means that squares behaves exactly like its iterative
counterpart, despite the fact that I named it, and even-squares I'm
setting up with caching because I intend to use it repeatedly later in
my program.

Is the opt-out or opt-in system more intuitive, and which is easier to
analyze from a correctness and performance standpoint? I'd argue that
the opt-out system has serious problems with respect to being
intuitive, and ease of confirming correctness and performance bounds.
Programmers really want these things (most of the time) to behave like
the iterative loops they are replacing. It can be very subtle to
detect the kinds of things that can cause an explosion of memory
consumption. The filter-map explosion issue is a great case in point.
You can test it on a wide variety of inputs, even big inputs, and it
seems to work fine. But then when you pass it a combination of filter
and map that results in a large "window" between filtered elements in
the mapped list, it blows up. Even if you patch the specific behavior
of filter, this is indicative of a larger problem. filter is
currently written in the most natural way using lazy-cons. Other
programmers are going to write similar functions using lazy-cons, and
these programs will all be brittle and prone to unpredictable failure
on certain kinds of large inputs.

On the other hand, the opt-in system is fairly straightforward to
understand. If a sequence is not explicitly cached, you can expect no
memory to be consumed by a traversal, and the traversal time will be
the same every time you execute it. Caching a sequence becomes
explcit, and is then clearly identified in the code as such.

>
> Not caching would let to excessive recalculation in many contexts, and
> then people would be complaining about performance.

Let's talk about performance. First, I would claim that the vast
majority of lazy sequences (especially the really big ones), are used
as one-off temporary sequences to more elegantly express something
that would be expressed via looping in an imperative language. So for
these cases (which I believe to be the most common case), you gain
nothing by caching, and in fact, you lose something (a small amount of
increased memory consumption / garbage collection, and increased
brittleness and unpredictability).

Then there are some sequences which are traversed more than once, but
the computation is fairly trivial (e.g., the output of range). In
these situations, it's probably a wash. I remember reading a paper
not long ago (can't remember exactly which one, sorry) that pointed
out that most programmers' intuitions about the need to cache
intermediate results is often wrong, and simple computations should
just be recomputed. This is because chips have gotten really fast,
and one of the biggest performance hits these days is a cache miss, so
if you store something, and it requires the program to go off and look
in the "slow part of memory" to find it, you're much worse off than if
you had just recomputed.

Sometimes, if you know you are going to be traversing a sequence more
than once, you would be better off converting it to an explicitly
realized list or putting the data in a vector.

So this leaves what I believe to be a small set of cases where you
genuinely need a cached-but-lazy list. Yes, make this a possibility;
I don't deny that it is useful. But I think your fears about people
complaining about performance if you change the default behavior are
unfounded. It's relatively easy to say to people, "If you need better
performance when traversing a lazy sequence multiple times, you may
benefit from explicitly realizing the result of the intermediate lazy
computations, or using a cached lazy sequence if that's what you
need." On the other hand, if you keep things as they are, I can
pretty much guarantee that you will be faced with ongoing posts to the
list of, "Help! I've got this program that is giving me an out of
memory error, and I can't figure out why."

One issue is that you'd want to make that cached-seq behaves
intelligently when someone tries to cache something that's essentially
already cached. That way people could more easily write generic code
that safely calls cached-seq on various collections to guarantee
certain kinds of time-performance bounds with repeat traversals. For
example, calling cached-seq on a vector shouldn't really do anything.
But there are already plenty of precedents and facilities in Clojure
for handling this sort of thing. We already expect seq to "do the
right thing" and not add extra layers of "sequifying" to something
that already is a sequence. One idea is to have a meta tag of :cached
that applies to things like vectors, sets, maps, lists, but not lazy
lists built with the lazy-cons-uncached variant I've proposed in this
post. cached-seq acts like an ordinary seq on things with the :cached
tag. cached-seq of a lazy-cons-uncached cell would essentially just
construct a lazy-cons-cached cell with the same first and rest
function, but with the cache fields. In the general case, cached-seq
would generate a lazy list built with lazy-cons-cached. Of course,
lazy-cons-cached cells would be marked with a :cached meta tag, and
would guarantee that the rest, when realized, is also made into a
cached sequence.

> There are many
> benefits of the thread-safe once-and-only-once evaluation promise of
> lazy-cons that you may not be considering. It is certainly not a bad
> default.

Well, I've tried as best as I can to articulate my reasoning that
lazy-cons-cached is in fact a bad default. It is potentially
unintuitive, hard to verify correctness and performance bounds, and
difficult to opt out of. The Scala programming language offers both
Streams (uncached lazy list), and LazyList (cached lazy list), and
generally uses Streams as the default. F# uses Seqs (uncached lazy
list), and LazyList (cached lazy list), and generally uses Seqs as the
default. Both of these languages are among the closest comparisons to
Clojure, in terms of meshing practical functional programming within
an existing imperative VM, and they have both made this design choice
to favor uncached lazy lists. And in fact, it turns out that in those
languages, uncached lazy lists end up rarely used. And of course,
languages like Python get along just fine with "generators" and no
built-in cached variant at all, other than explicitly realizing to a
list. Haskell, of course, is the one exception. They use
cached-lazy-lists by default, but then again, they really have to,
because it's the only sensible thing in a language where everything is
lazy by default.

> There will soon be a streams/generator library, intended for i/o, that
> will do one-pass non-cached iteration. It will offer a different set
> of tradeoffs and benefits, including reduced ease-of use - not being
> persistent removes a lot of things like first/rest/nth, destructuring
> etc.

Right, I'm talking about making something that works just like any
other sequence, supporting first/rest/nth and destructuring.

>
> But the important thing is tradeoffs/benefits. They always come
> together.

Yes, but hopefully I can convince you that in this case, the tradeoffs
fall clearly on the side of defaulting to uncached lazy lists.

--Mark

Randall R Schulz

unread,
Dec 12, 2008, 6:36:47 PM12/12/08
to clo...@googlegroups.com
On Friday 12 December 2008 15:15, Mark Engelberg wrote:
> ...
>
> --Mark

Not being nearly sophisticated enough in Clojure, FP or the relevant
concepts to say anything other than "that all makes complete sense to
me," I wonder only what would be the impact on existing programs were
the default to be switched as you suggest? Or, relatedly, how would you
go about making this transition while simultaneously minimizing
breaking changes? And lastly, has Clojure reached the point where
breaking changes are taboo?


Randall Schulz

Rich Hickey

unread,
Dec 12, 2008, 6:44:33 PM12/12/08
to Clojure


On Dec 12, 6:15 pm, "Mark Engelberg" <mark.engelb...@gmail.com> wrote:
I'd appreciate it if you could make more succinct arguments.

The argument you've made, however, is theoretical.

I've tried what you said. In fact, it's sitting there in the
implementation right now - just uncomment lazy-seq. cached-seq is
there, as is a non-caching LazySeq class implementing ISeq, and all of
the library functions can be defined in terms of it. I also did that,
and then tried it and some common code.

You should too, and report back your actual experience.

My experience was:

Everything with side-effects breaks. Lots of performance problems,
because it is not as simple as what you say, clear multiple
traversals, but often casual multiple local references to (rest x) and
other parts of the seq, causing multiple evaluations and corresponding
runtime multipliers. In short, an entirely different set of things you
need to be careful of, and far more often. I became so concerned about
having to support it that I commented out lazy-seq.

The short of it is, the problem you encountered is not, in fact,
common. But I encourage you to try your idea, as I said, I've already
implemented it and you can try it right now.

Rich

Rich Hickey

unread,
Dec 12, 2008, 6:44:33 PM12/12/08
to Clojure


On Dec 12, 6:15 pm, "Mark Engelberg" <mark.engelb...@gmail.com> wrote:

Rich Hickey

unread,
Dec 12, 2008, 7:31:45 PM12/12/08
to Clojure
Clojure has normal lexical scope and closures do what you expect,
although there aren't reified environments like you might find in an
interpreter. Because it uses lazy sequences and doesn't have tail
calls, it does local variable and argument nulling prior to tail
calls, to prevent any lingering references on the stack. In
particular, you can normally hold on to a seq head in a local or
argument and pass it to a tail call without retaining the head.
Closures are lexical and visible, only capturing references from the
enclosing scope actually used in their bodies, not entire
environments, and each individually releasable.

Rich

Paul Mooser

unread,
Dec 12, 2008, 8:28:18 PM12/12/08
to Clojure
On Dec 12, 3:15 pm, "Mark Engelberg" <mark.engelb...@gmail.com> wrote:
>And in fact, it turns out that in those languages, uncached lazy lists end up rarely used.

Did you mean that the cached lazy lists are rarely used? Or does
everyone actually choose to use the cached ones?

Mark Engelberg

unread,
Dec 12, 2008, 10:09:06 PM12/12/08
to clo...@googlegroups.com

Sorry, in that particular sentence I said the opposite of what I
meant. I meant that cached lazy lists are rarely used in those
languages.

Although I'm a relative newbie to Clojure, I've spent a lot of time
using a wide variety of languages. When Rich's reaction was
"everything mutable breaks without caching", my initial reaction was
astonishment that he perceives that to be the common case in a
language where mutability is mostly shunned. I have trouble thinking
of a case where I'd want to put something mutable in a lazy list. But
of course, he's spent more time with Clojure than anyone, so I don't
doubt his experience on this matter. So then the question becomes,
why is my experience so different in other languages?

Taking Scala as an example, I can think of two things that might make
it more suitable in that language to work with non-cached lazy lists
as a default.

First, in Scala there is richer syntactic support for mutable,
imperative-style programming when you need it. So you mostly use
laziness, comprehensions, etc. with your immutable, functional-style
code. But when you're working with mutable stuff (like interop with
Java), you go ahead and code with for/while loops, assignment, and it
doesn't feel particularly gross.

Second, Scala has a more polymorphic approach to things like map and
filter. If you map a vector, you get back a vector. If you filter a
concrete list, you get back a concrete list. Comprehension syntax is
essentially a macro that expands to combinations of map, filter, and
flatMap (analogous to Clojure's mapcat, I think), so your
comprehension output is also determined by the type of the first
collection in the comprehension. So if you're working with mutable
data, you'd be storing it in a different kind of collection anyway
(like a vector, or a cached lazy list), and all the map, filter, etc.
would work the way you'd expect it to.

The first point can just be chalked up to rather fundamental design
differences in the two languages. Incorporating this kind of
"imperative subsystem" would clutter up Clojure's elegance.

As to the second point, it's not inconceivable to do something like
that in Clojure. Clojure's multimethods can certainly support such a
thing. But certainly Scala's approach has a downside because
sometimes you don't want a comprehension to build the same thing as
the source collection, and converting between them can be inefficient.
There's something rather nice about the way Clojure always returns a
bland sequence that can essentially be "realized" into anything you
want. It just seems unfortunate to me that a consequence of this is
that all these output sequences are automatically of the cached
variety.

Perhaps there's some sort of middle ground where Clojure can always
return a lazy sequence, but be a bit more intelligent about choosing
the right variety depending on the nature of the input, but it's not
immediately apparent to me how one would do that. If I get a chance,
I'll definitely play around with some of these ideas, as Rich
suggested, although my "common case" programming seems to be different
from his. I got burned by the filter issue on my very first Clojure
program, when I tried to filter a lazy stream of 10-digit permutations
to find all the permutations with a certain property. The
permutations which satisfied the property were far enough apart that
it caused a problem. This is the kind of program I typically write.

In the meantime, I'm definitely looking forward to seeing Rich's new
generator approach. Maybe having another way to tackle the problem
cases will make a lot of my worries about this issue go away.

--Mark

P.S. I don't want to sound too negative, so I'll mention here that
there are several things I *love* about Clojure. First, the way
operations on so many of the data structures are unified through the
sequence interface. Second, multimethods (haven't seen much
multimethod action since Dylan, and I've always loved them). Third,
many little touches like making maps, sets, vectors, and keys also act
like function, which contribute to readable brevity.

Rich Hickey

unread,
Dec 13, 2008, 12:28:24 AM12/13/08
to clo...@googlegroups.com
On Fri, Dec 12, 2008 at 10:09 PM, Mark Engelberg
<mark.en...@gmail.com> wrote:
>
> On Fri, Dec 12, 2008 at 5:28 PM, Paul Mooser <taro...@gmail.com> wrote:
>>
>> On Dec 12, 3:15 pm, "Mark Engelberg" <mark.engelb...@gmail.com> wrote:
>>>And in fact, it turns out that in those languages, uncached lazy lists end up rarely used.
>>
>> Did you mean that the cached lazy lists are rarely used? Or does
>> everyone actually choose to use the cached ones?
>
> Sorry, in that particular sentence I said the opposite of what I
> meant. I meant that cached lazy lists are rarely used in those
> languages.
>
> Although I'm a relative newbie to Clojure, I've spent a lot of time
> using a wide variety of languages. When Rich's reaction was
> "everything mutable breaks without caching", my initial reaction was
> astonishment that he perceives that to be the common case in a
> language where mutability is mostly shunned. I have trouble thinking
> of a case where I'd want to put something mutable in a lazy list. But
> of course, he's spent more time with Clojure than anyone, so I don't
> doubt his experience on this matter.

I wasn't talking at all about lists of mutable things. In Clojure,
people often build a sequence from an imperative/ephemeral source. The
caching becomes very important in these cases.

> As to the second point, it's not inconceivable to do something like
> that in Clojure. Clojure's multimethods can certainly support such a
> thing. But certainly Scala's approach has a downside because
> sometimes you don't want a comprehension to build the same thing as
> the source collection, and converting between them can be inefficient.
> There's something rather nice about the way Clojure always returns a
> bland sequence that can essentially be "realized" into anything you
> want. It just seems unfortunate to me that a consequence of this is
> that all these output sequences are automatically of the cached
> variety.
>
> Perhaps there's some sort of middle ground where Clojure can always
> return a lazy sequence, but be a bit more intelligent about choosing
> the right variety depending on the nature of the input, but it's not
> immediately apparent to me how one would do that. If I get a chance,
> I'll definitely play around with some of these ideas, as Rich
> suggested, although my "common case" programming seems to be different
> from his. I got burned by the filter issue on my very first Clojure
> program, when I tried to filter a lazy stream of 10-digit permutations
> to find all the permutations with a certain property. The
> permutations which satisfied the property were far enough apart that
> it caused a problem. This is the kind of program I typically write.
>

I'm sorry you encountered a bug, and will fix, but that's not an
indictment of Clojure's approach. Scala et al have had their own
problems:

http://lampsvn.epfl.ch/trac/scala/ticket/692
http://lampsvn.epfl.ch/trac/scala/ticket/498
http://groups.google.com/group/cal_language/browse_thread/thread/728a3d4ff0f77b00

Note that Clojure does do locals nulling on tail calls. It certainly
has paid attention to laziness implementation issues, a bug
notwithstanding.

> In the meantime, I'm definitely looking forward to seeing Rich's new
> generator approach. Maybe having another way to tackle the problem
> cases will make a lot of my worries about this issue go away.
>

I think it's very important not to conflate different notions of
sequences. Clojure's model a very specific abstraction, the Lisp list,
originally implemented as a singly-linked list of cons cells. It is a
persistent abstraction, first/second/third/rest etc, it is not a
stream, nor an iterator. Lifting that abstraction off of cons cells
doesn't change its persistent nature, nor does lazily realizing it.
After my experimentation with a non-caching version, I am convinced it
is incompatible with the abstraction. If a seq was originally
generated from an imperative source, you need it to be cached in order
to get repeatable read persistence, if it was calculated at some
expense, you need to cache it in order to get the performance
characteristics you would expect from a persistent list. An
abstraction is more than just an interface.

That said, I think there is certainly room for a stream/generator
model, especially for I/O, but also for more efficient collection
processing. Such a thing is explicitly one-pass and ephemeral. It will
not have the interface of first/rest, nor Java's thread-unsafe
hasNext/next iterator model (shared by Scala). You can obviously build
seqs from streams/generators, and in my model, with a single
definition you will get both a stream and seq version of functions
like map and filter, as I showed here:

http://groups.google.com/group/clojure/msg/53227004728d6c54

Note also that filter/map etc are not part of these abstractions,
though they can be defined on top of both.

Stream/generators and a corresponding map/filter library on them will
give you other options for one-pass processing, but lazy seqs work
pretty well right now.

Rich

Mark Engelberg

unread,
Dec 13, 2008, 2:18:03 AM12/13/08
to clo...@googlegroups.com
On Fri, Dec 12, 2008 at 9:28 PM, Rich Hickey <richh...@gmail.com> wrote:
> I think it's very important not to conflate different notions of
> sequences. Clojure's model a very specific abstraction, the Lisp list,
> originally implemented as a singly-linked list of cons cells. It is a
> persistent abstraction, first/second/third/rest etc,

OK, I think I see where you're going with this. It sounds like you're
saying that one of the key ideas here is that the first/rest interface
is meant to guarantee a certain kind of persistence. If I say (first
(rest coll)), it should aways give me the same thing. If you designed
first/rest to work on uncached sequences, most would work this way,
but there is certainly no guarantee, depending on the nature of the
generating function. Since you want the interface to imply this sort
of guarantee, you have no choice but to cache anything involving
first/rest (not including things like "seq"ified vectors which are
inherently cached). This makes sense.

So the piece you're working on is to provide better support for things
that are determined by their generating functions. You are
intentionally avoiding using the terms first/rest for these "streams"
for the reasons above. Since streams will be easily convertable to
seqs, we'll be able to get the best of both words. We can manipulate
streams, and eventually when we pass the stream to a function that
requires seqs, it will do the conversion, and stability will be
guaranteed.

It sounds like a very promising approach, and I'm looking forward to
seeing the results. It seems like ideally, it should be as easy as
possible to manipulate the streams in stream form, so if you're
working with a stream source you can go as long as possible without
converting a stream to seq. Having map/filter/mapcat/comprehensions
work fluidly over streams as well as seqs could be hugely beneficial.

> That said, I think there is certainly room for a stream/generator
> model, especially for I/O, but also for more efficient collection
> processing. Such a thing is explicitly one-pass and ephemeral. It will
> not have the interface of first/rest, nor Java's thread-unsafe
> hasNext/next iterator model (shared by Scala). You can obviously build
> seqs from streams/generators, and in my model, with a single
> definition you will get both a stream and seq version of functions
> like map and filter, as I showed here:
> http://groups.google.com/group/clojure/msg/53227004728d6c54

OK, looking back over your sneak preview example, I have a couple
quick comments/questions.

1. You are using stream-seq to convert streams to seqs. Can't you
just make the seq function work automatically on streams to produce
the sequence, just like it does on vectors, sets, etc., rather than
have a special name for converting streams to sequences? That way,
you can pass streams to all the functions that begin with (seq coll)
and you'll get the desired behavior.
2. After you've added streams, what will the range function produce?
Consider making range produce a stream rather than a sequence.
(Somewhat relatedly, in Python, range produced a concrete list and
xrange produced a generated stream. range was so rarely used that one
of the breaking changes they made in Python 3.0 was to get rid of the
list-producing range, and start using the name "range" for the
generator-variant rather than "xrange").
3. In your sneak-preview filter function, it always produces a seq.
In the spirit of making it easy to work with streams as long as
possible without conversion, how about making filter return a stream
in the case that the input is a stream, rather than having separate
filter-stream and filter public functions? Or is it not in the spirit
of Clojure to have functions overloaded in this way?

--Mark

Rich Hickey

unread,
Dec 13, 2008, 8:51:03 AM12/13/08
to clo...@googlegroups.com

On Dec 13, 2008, at 2:18 AM, Mark Engelberg wrote:

>
> On Fri, Dec 12, 2008 at 9:28 PM, Rich Hickey <richh...@gmail.com>
> wrote:
>> I think it's very important not to conflate different notions of
>> sequences. Clojure's model a very specific abstraction, the Lisp
>> list,
>> originally implemented as a singly-linked list of cons cells. It is a
>> persistent abstraction, first/second/third/rest etc,
>
> OK, I think I see where you're going with this. It sounds like you're
> saying that one of the key ideas here is that the first/rest interface
> is meant to guarantee a certain kind of persistence. If I say (first
> (rest coll)), it should aways give me the same thing. If you designed
> first/rest to work on uncached sequences, most would work this way,
> but there is certainly no guarantee, depending on the nature of the
> generating function. Since you want the interface to imply this sort
> of guarantee, you have no choice but to cache anything involving
> first/rest (not including things like "seq"ified vectors which are
> inherently cached). This makes sense.
>

There are other attributes as well, like the concurrency properties of
lazy-cons, which guarantees once-only evaluation of first and rest.

> So the piece you're working on is to provide better support for things
> that are determined by their generating functions. You are
> intentionally avoiding using the terms first/rest for these "streams"
> for the reasons above. Since streams will be easily convertable to
> seqs, we'll be able to get the best of both words. We can manipulate
> streams, and eventually when we pass the stream to a function that
> requires seqs, it will do the conversion, and stability will be
> guaranteed.
>
> It sounds like a very promising approach, and I'm looking forward to
> seeing the results. It seems like ideally, it should be as easy as
> possible to manipulate the streams in stream form, so if you're
> working with a stream source you can go as long as possible without
> converting a stream to seq. Having map/filter/mapcat/comprehensions
> work fluidly over streams as well as seqs could be hugely beneficial.
>

Yes, those things will be available for streams.

>> That said, I think there is certainly room for a stream/generator
>> model, especially for I/O, but also for more efficient collection
>> processing. Such a thing is explicitly one-pass and ephemeral. It
>> will
>> not have the interface of first/rest, nor Java's thread-unsafe
>> hasNext/next iterator model (shared by Scala). You can obviously
>> build
>> seqs from streams/generators, and in my model, with a single
>> definition you will get both a stream and seq version of functions
>> like map and filter, as I showed here:
>> http://groups.google.com/group/clojure/msg/53227004728d6c54
>
> OK, looking back over your sneak preview example, I have a couple
> quick comments/questions.
>
> 1. You are using stream-seq to convert streams to seqs. Can't you
> just make the seq function work automatically on streams to produce
> the sequence, just like it does on vectors, sets, etc., rather than
> have a special name for converting streams to sequences? That way,
> you can pass streams to all the functions that begin with (seq coll)
> and you'll get the desired behavior.

No you can't, for the same reasons you can't for Iterator or
Enumeration seqs. Again it comes down to abstractions, and the
abstraction for (seq x) is one on persistent collections. It presumes
that (seq x) is referentially transparent, which it isn't for
ephemeral sources - i.e. streams aren't colls. It would become quite
fragile if people had to adopt call-once-only semantics for seq -
ephemerality would pollute the world. That said, there are a growing
number of these ephemeral-source-seq functions which could be folded
into one multimethod.

Also note that converting a stream to a seq may involve I/O (as may
any consumption of a stream, and stream->seq will call next!), and
thus will have different concurrency semantics than (seq coll).

>
> 2. After you've added streams, what will the range function produce?
> Consider making range produce a stream rather than a sequence.
> (Somewhat relatedly, in Python, range produced a concrete list and
> xrange produced a generated stream. range was so rarely used that one
> of the breaking changes they made in Python 3.0 was to get rid of the
> list-producing range, and start using the name "range" for the
> generator-variant rather than "xrange").

If you look in the svn you'll see that Range already implements
Streamable, and so can be used in either context.

>
> 3. In your sneak-preview filter function, it always produces a seq.
> In the spirit of making it easy to work with streams as long as
> possible without conversion, how about making filter return a stream
> in the case that the input is a stream, rather than having separate
> filter-stream and filter public functions? Or is it not in the spirit
> of Clojure to have functions overloaded in this way?

I'm still on the fence about overloading by type here. I'm not fond of
sticking "-stream" on everything either, though any prefix/suffix will
be short. What you get back from (map f aseq) and (map f astream)
would be two very different things, and without declared types you
won't know from looking at the code what you are dealing with. That's
tricky. Also, as discussed earlier, when transitioning from streams to
the easier-to-use seqs there will need to be a once-only explicit
conversion. There are also issues about being able to partition by
stream(able)/seq(able) - will nothing be both?

Anyway, these are just details, the design is pretty well along (e.g.
the stream system will be queue and timeout savvy) and I don't see any
impediments other than needing to cut a release soon :)

Rich

Mark Engelberg

unread,
Dec 15, 2008, 1:48:07 AM12/15/08
to clo...@googlegroups.com
On Sat, Dec 13, 2008 at 5:51 AM, Rich Hickey <richh...@gmail.com> wrote:
> No you can't, for the same reasons you can't for Iterator or
> Enumeration seqs. Again it comes down to abstractions, and the
> abstraction for (seq x) is one on persistent collections. It presumes
> that (seq x) is referentially transparent, which it isn't for
> ephemeral sources - i.e. streams aren't colls. It would become quite
> fragile if people had to adopt call-once-only semantics for seq -
> ephemerality would pollute the world. That said, there are a growing
> number of these ephemeral-source-seq functions which could be folded
> into one multimethod.

Okay, so you've got one abstraction (seq) for persistent collections
(and therefore caches to guarantee persistence), and one abstraction
(streams) that is for ephemeral sources, and you want to build a
barrier between them (so the user has to explicitly call seq-stream to
make the conversion).

But don't forget that there are collections which are persistent (and
therefore, it would be intuitive to use the first/rest sequence
abstraction with them), yet too big to cache (e.g., sequence of all
permutations of an 11 item list), and/or so cheap to compute that it
makes no sense to cache (e.g., (range 100)). I am particularly
interested to see what happens with these "persistent streams" for
which a sequence abstraction DOES make sense. Perhaps one avenue is
to annotate certain streams as being from non-ephemeral sources, and
therefore they can automatically be passed to seq-based functions, and
remain uncached through various transformations like
rest/map/filter/etc.

Paul Mooser

unread,
Jan 5, 2009, 8:11:25 PM1/5/09
to Clojure
Sorry to resurrect this, but I noticed that there isn't an issue to
track this - is this something unlikely to be fixed officially for
1.0 ? The workaround you posted certainly works for me, but I just
wanted to make sure the actual core.clj filter implementation receives
the fix eventually.

On Dec 8 2008, 6:51 pm, Rich Hickey <richhic...@gmail.com> wrote:
> On Dec 8, 2008, at 8:56 PM, Stephen C. Gilardi wrote:
>
>
>
> > I think I finally see the problem. The "rest expression" infilter's 
> > call to lazy-cons has a reference to "coll" in it. That's all it  
> > takes for coll to be retained during the entire calculation of the  
> > rest.
>
> > (defnfilter
> >  "Returns a lazy seq of the items in coll for which
> >  (pred item) returns true. pred must be free of side-effects."
> >  [pred coll]
> >    (when (seq coll)
> >      (if (pred (first coll))
> >        (lazy-cons (first coll) (filterpred (rest coll)))
Reply all
Reply to author
Forward
0 new messages