This may not be worth much, but can you see the data members of that
object? It's not itself the head of a cons chain, presumably, so I'm
wondering if the data member that *is* at the head has a useful name.
--Chouser
(defn splode [index-path]
(with-local-vars [doc-count 0]
(doseq [document (filter my-filter-pred (document-seq index-
path))]
(var-set doc-count (inc @doc-count)))
'done))
> That's a good idea, Steve - I didn't totally understand the code you
> included, but I do always forget that clojure has destructuring in its
> binding forms, so I rewrote it like this, which I believe should be
> fine (correct me if I am wrong):
Thanks for the polite correction to my incorrect code. :)
What you have looks correct to me.
> (defn splode2 [index-path]
> (with-local-vars [doc-count 0]
> (loop [[document & rest-documents] (filter my-filter-pred
> (document-seq index-path))]
> (when document
> (var-set doc-count (inc @doc-count))
> (recur rest-documents)))
> 'done))
For more simplifications with nearly equivalent effect:
(defn splode3 [index-path]
(loop [[document & rest-documents] (filter my-filter-pred (document-
seq index-path))]
(when document
(recur rest-documents))))
(defn splode4 [index-path]
(loop [myseq (filter my-filter-pred (document-seq index-path))]
(when (first myseq)
(recur (rest myseq)))))
(defn splode5 [index-path]
(dorun (filter my-filter-pred (document-seq index-path))))
--Steve
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
To unsubscribe from this group, send email to clojure+u...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/clojure?hl=en
-~----------~----~----~----~------~----~------~--~---
I also saw your subsequent example which uses a different anonymous
function which does NOT blow up, and that's very interesting. I'm not
sure why this would be, but it seems that filter ends up holding on to
the collection its filtering internally from the point at which it
first matches - I think the second one doesn't blow up because it
doesn't happen until almost the end.
In the current implementation of filter, coll is held the entire time
that (rest coll) is calculated on the first match.
The following separation into two functions appears to solve it. I'll
be looking at simplifying it.
If you use a definition of filter like this in your test, I think it
will succeed:
(defn filter-iter
[pred coll]
(when (seq coll)
(if (pred (first coll))
[(first coll) (rest coll)]
(recur pred (rest coll)))))
(defn filter
"Returns a lazy seq of the items in coll for which
(pred item) returns true. pred must be free of side-effects."
[pred coll]
(let [result (filter-iter pred coll)]
(when result
(lazy-cons (result 0) (result 1)))))
--Steve
The following separation into two functions appears to solve it. I'll be looking at simplifying it.
Except your version of filter doesn't do any filtering on the rest in
the case where the first satisfies the predicate.
I'm starting to wonder whether there might be a fundamental bug in the
java implementation of LazyCons. Maybe it doesn't implement "first"
correctly?
> Well, part of the puzzle is to figure out why filter works just fine
> on the output of the range function, but not on the output of the map
> function.
Good point and my previous analysis was all wet. lazy-cons is a macro.
Its arguments aren't evaluated until later.
It's fun seeing some more detail about how it all works.
I look forward to seeing a solution!
--Steve
I think I can reproduce this one like so:
user=> (count (take 15000 (iterate #(str % "more") "some")))
java.lang.OutOfMemoryError: Java heap space (NO_SOURCE_FILE:0)
As with yours, I can replace 'count' with 'dorun' and it works fine.
I can also use 'last':
user=> (.length (last (take 15000 (iterate #(str % "more") "some")))))
60000
I think the problem in this case is 'count', which for all
IPersistentCollections (including lazy sequences) calls the 'count'
method of the instance. ASeq's count method is a tight for() loop,
but since it's an instance method it must retain a 'this' reference to
the head of the seq.
Fixing this is hard becasue RT.count() is holding onto the head as
well. I've attached a patch that fixes the problem, but it's pretty
ugly, perhaps only useful to demonstrate that this is the problem.
--Chouser
The simplest form of the bug was this:
(defn splode [n]
(doseq [i (filter #(= % 20) (map inc (range n)))]))
This blows the heap, but it shouldn't.
I find this deeply troubling, because if this doesn't work, it
undermines my faith in the implementation of lazy sequences, which are
quite ubiquitous in Clojure. map and filter are written in the most
natural way in boot.clj, so if they don't work properly, it means that
anything written with lazy-cons is suspect.
I have an idea to try, but I'm not set up to build the java sources on
my computer, so maybe someone else can run with it:
Right now, LazyCons.java takes one function which includes both the
knowledge of how to create the first and the rest.
Perhaps LazyCons.java needs to be implemented so that the constructor
take two separate functions, one that generates the first, and one
that generates the rest. This way, as soon as the first is generated,
it can be cached, and the first-generating-function can be set to
null. Similarly, when the rest is generated, it will be cached, and
its generating function can be set to null. So if the problem is
being caused by the first-generating-function keeping something alive
that needs to be garbage collected, this would solve it, by releasing
it as soon as possible.
The lazy-cons macro in boot.clj wouldl also need to be changed to call
the new LazyCons constructor with two functions, rather than just one.
Of course, if the problem is that the rest-generating-function is
holding onto something, this will not solve the problem.
--Mark
I have an idea to try, but I'm not set up to build the java sources on
my computer, so maybe someone else can run with it:
This looked very promising to me. For one thing, I remembered that the root of the big chain of LazyCons objects in memory (as displayed by the YourKit profiler) was "f".
(defn filter
"Returns a lazy seq of the items in coll for which
(pred item) returns true. pred must be free of side-effects."
[pred coll]
(when (seq coll)
(if (pred (first coll))
(lazy-cons (first coll) (filter pred (rest coll)))
(recur pred (rest coll)))))
The stye of fix that occurs to me involves peeling off coll from
(frest coll) and (rrest coll) before continuing the lazy evaluation.
Posting so someone else can beat me to the answer. :-)
--Steve
Well, I had already tried this, eagerly evaluating the first and rest,
and it didn't help:
(defn map2
[f coll]
(when (seq coll)
(let [fcoll (first coll) rcoll (rest coll) ffcoll (f fcoll)]
(lazy-cons ffcoll (map2 f rcoll)))))
(defn filter2
"Returns a lazy seq of the items in coll for which
(pred item) returns true. pred must be free of side-effects."
[pred coll]
(when (seq coll)
(let [fcoll (first coll) rcoll (rest coll)]
(if (pred fcoll)
(lazy-cons fcoll (filter2 pred rcoll))
(recur pred rcoll)))))
(defn splode [n]
(doseq [document (filter2 #(= % 20) (map2 inc (range n)))]))
Well, I had already tried this, eagerly evaluating the first and rest,
and it didn't help:
(defn map2
[f coll]
(when (seq coll)
(let [fcoll (first coll) rcoll (rest coll) ffcoll (f fcoll)]
(lazy-cons ffcoll (map2 f rcoll)))))
(defn filter2
"Returns a lazy seq of the items in coll for which
(pred item) returns true. pred must be free of side-effects."
[pred coll]
(when (seq coll)
I'm sorry I haven't chimed in sooner. I fully understand this. Yes,
it's the closure over coll in the rest portion, which means that after
finding some match, a subsequent call to rest that needs to skip a lot
will create a window over the interval where the seq will be realized.
Eagerly evaluating (rest coll) won't work - the result will still be
in the closure while the recursion occurs. The only solutions involve
mutation in filter. Here's one way:
(defn filter
[pred coll]
(let [sa (atom (seq coll))
step (fn step []
(when-let [s @sa]
(let [x (first s)]
(if (pred x)
(lazy-cons x (do (swap! sa rest) (step)))
(do (swap! sa rest)
(recur))))))]
(step)))
But it's not pretty. Fortunately, this segues with work I have been
doing on I/O and generators/streams. This will let you write things
like:
(defn filter-stream
"Returns a stream of the items in strm for which
(pred item) returns true. pred must be free of side-effects."
[pred strm]
(stream
#(let [x (next! strm)]
(if (or (eos? x) (pred x)) x (recur)))))
(defn filter
"Returns a lazy seq of the items in coll for which
(pred item) returns true. pred must be free of side-effects."
[pred coll]
(stream-seq (filter-stream pred (stream coll))))
I'm still not done with this yet, so anyone who is stuck can use the
former version.
Rich
> (defn filter
> [pred coll]
> (let [sa (atom (seq coll))
> step (fn step []
> (when-let [s @sa]
> (let [x (first s)]
> (if (pred x)
> (lazy-cons x (do (swap! sa rest) (step)))
> (do (swap! sa rest)
> (recur))))))]
> (step)))
>
into:
(defn safe-filter
[pred coll]
(let [sa (ref (seq coll)),
step (fn step []
(when-let [s @sa]
(let [x (first s)]
(if (pred x)
(lazy-cons x (do
(dosync (ref-set sa (rest s)))
(step)))
(do
(dosync (ref-set sa (rest s)))
(recur))))))]
(step)))
But splode still splodes!:
(defn splode [n]
(doseq [i (safe-filter #(= % 20) (map inc (range n)))]))
Any idea why the ref version wouldn't be working?
> But it's not pretty. Fortunately, this segues with work I have been
> doing on I/O and generators/streams.
I'm really looking forward to seeing how this all turns out. Cached
lazy sequences seems to be a bad default for all the standard sequence
operations. You have to be very careful to not retain the head of one
of these sequences (can't give names to intermediate results, for
example), and it's very hard to predict when the head of one of these
sequences might be unintentionally held. This seems to make code more
brittle. Probably the best solution is to default to sequences that
always freshly generate the results, and you have to intentionally
cache the sequence if that is what you want.
Mark
I think this is a great discussion to have, so let me see if I can
articulate my thoughts a little better.
First, let's talk about laziness, because I think we're in agreement
here. Laziness is undoubtedly a very powerful tool. In Haskell,
laziness is the default. Every single function call is automatically
lazy. In LISP/Scheme, laziness is optional. It's something you have
to specifically opt into using delay and force. Which is better?
Peter Van Roy, the designer of the Oz programming language, has argued
in many places that laziness is not a good default for a
general-purpose programming language. This is also a point he
discusses in his excellent book "Concepts, Techniques, and Models of
Computer Programming". He has probably explained this point better
than I can, so I won't belabor it here, but in a nutshell, it all
comes down to the fact that laziness makes the performance (especially
space performance) of your program much harder to analyze. It is all
too easy to write a Haskell program that blows up or performs poorly,
and you have no idea why. You can get back some performance by
opting-out of the laziness with strictness annotations, but it can be
difficult to figure out how or where to do this. To put it simply,
it's easier to write high-performing code when you opt-in to laziness
when you need it, rather than trying to figure out how to opt out when
it's wrecking things for you.
I assume that you agree with this, since you have chosen the explicit
force/delay "opt-in laziness" model for Clojure.
So now let's talk about sequences, specifically the behavior of
lazily-generated sequences. A similar choice needs to be considered.
It is easy to imagine two possible implementations of lazy-cons.
lazy-cons-cached would work exactly as lazy-cons currently does.
lazy-cons-uncached would be similar, but would not cache its first and
rest values when computed. Sequences built with either version of
lazy-cons would both respond to the same sequence interface, and would
therefore behave the same, producing the same results as long as the
generating functions don't refer to mutable data, but possibly with
different performance profiles. lazy-cons-cached sequences might run
faster when traversing the same lazy list more than once, but also
might crash your program by consuming too much memory in places where
lazy-cons-uncached would not. It's a tradeoff, and both versions of
lazy-cons are potentially useful, but which should be the default, or
more specifically, which should be the default used by all of
Clojure's library functions like map, filter, etc.? Should these
lazily-generated sequences be automatically cached, or not?
The first question is, does it matter? I would say that yes, this
design decision matters a great deal. I think I remember you saying
somewhere (in a post here, or possibly in one of your presentations)
that sequences, especially lazy sequences, have become a far more
important part of Clojure programming than you envisioned at the
outset. And it is easy to see why lazy sequences take on a
significant role in a fairly pure functional programming language:
People coming from imperative languages often ask how you code up
certain kinds of for/while loops that are used to traverse data
structures and build up new ones. Let's say you want to build a list
of all the even squares of whole numbers up to 100, inclusively. One
literal way to formulate this problem in an imperative language
(taking Python 2.6 syntax as a sample) would be:
l = []
for i in xrange(100):
square = i*i
if (square%2 == 0):
l.append(square)
Now in Clojure, things like this can be written using loop/recur. But
any newcomer to Clojure will certainly be told that there's a more
elegant way to express this concept:
(def l (filter even? (map #(* % %) (range 100))))
or the equivalent comprehension.
The newcomer's first reaction is to say, "Whoa, but won't that have
poor performance? You're generating an awful lot of big, intermediate
lists." And the answer is usually, "Well, it's lazy, so no, in fact,
this elegant form behaves like the iterative version. You can have
your cake and eat it too."
You said, "I'm not sure where you are getting your presumptions about
lazy sequences. They are not a magic bullet that makes working with
data bigger than memory transparent." Well, actually, I know that
cached lazy sequences do not always behave like their iterative
counterparts. But this is where these presumptions are born.
Iterative stuff is ugly in functional languages. We are told to use
comprehensions, map, filter, tricks like Stuart Holloway's blog about
writing the indexed function which makes a temporary sequence of pairs
of indices and values to avoid certain imperative patterns. And we
expect the behavior to be similar, because most of the time it is.
When something blows up from some subtle capturing of a lazy list, we
feel frustrated and betrayed.
Next, let's consider whether it is easier to opt-in or opt-out of
caching your sequences. Opting out of sequence caching is very
difficult. Right now, all the built-in sequence functions use a
cached lazy-cons, and there is no way to undo that caching. The first
line of defense is to make sure that you don't give an intermediate
temporary name to any of the lazy lists you are transforming. In
other words, you should never do something like:
(def squares (map #(* % %) (range 100))
(def even-squares (filter even? squares)).
This kind of thing could crash your program with large lists. It's
irritating to have to be careful about this, because it seems
intutitive that giving a temporary name to something shouldn't be the
kind of thing that could crash your program, but okay, let's say you
learn to be really careful, and to carefully avoid naming your lists.
But then, sometimes your lists get captured by closures, or other
subtle corner cases, and it just gets frustrating. Using lazy
sequences, which are pervasive in Clojure, should not be so difficult.
Right now the only solution is to rewrite one's code to use
loop/recur, and it sounds like eventually there will be an alternative
"generators" system which you say you're envisioning more as an I/O
tool with a special interface, rather than as a general-purpose form
of sequences.
On the other hand, opting in to sequence caching is very easy. You
only gain the benefits of a cached sequence when you use it twice,
which inherently means that you have to name it or bind it to
something in order to refer to it multiple times. So if I'm going to
be traversing a sequence multiple times, I just opt in by calling a
function to turn it into a cached sequence. (Let's call it
"cached-seq". I know there is such a function in the Clojure API
already, but I haven't really verified that it works the way I mean
here. So if I'm using the term in a different way from the way that
cached-seq actually works in the current API, please bear with me for
the sake of argument.)
So with an opt-in system, where range, map and filter use
lazy-cons-uncached, you could easily write something like:
(def squares (map #(* % %) (range 100))
(def even-squares (cached-seq (filter even? squares)))
This means that squares behaves exactly like its iterative
counterpart, despite the fact that I named it, and even-squares I'm
setting up with caching because I intend to use it repeatedly later in
my program.
Is the opt-out or opt-in system more intuitive, and which is easier to
analyze from a correctness and performance standpoint? I'd argue that
the opt-out system has serious problems with respect to being
intuitive, and ease of confirming correctness and performance bounds.
Programmers really want these things (most of the time) to behave like
the iterative loops they are replacing. It can be very subtle to
detect the kinds of things that can cause an explosion of memory
consumption. The filter-map explosion issue is a great case in point.
You can test it on a wide variety of inputs, even big inputs, and it
seems to work fine. But then when you pass it a combination of filter
and map that results in a large "window" between filtered elements in
the mapped list, it blows up. Even if you patch the specific behavior
of filter, this is indicative of a larger problem. filter is
currently written in the most natural way using lazy-cons. Other
programmers are going to write similar functions using lazy-cons, and
these programs will all be brittle and prone to unpredictable failure
on certain kinds of large inputs.
On the other hand, the opt-in system is fairly straightforward to
understand. If a sequence is not explicitly cached, you can expect no
memory to be consumed by a traversal, and the traversal time will be
the same every time you execute it. Caching a sequence becomes
explcit, and is then clearly identified in the code as such.
>
> Not caching would let to excessive recalculation in many contexts, and
> then people would be complaining about performance.
Let's talk about performance. First, I would claim that the vast
majority of lazy sequences (especially the really big ones), are used
as one-off temporary sequences to more elegantly express something
that would be expressed via looping in an imperative language. So for
these cases (which I believe to be the most common case), you gain
nothing by caching, and in fact, you lose something (a small amount of
increased memory consumption / garbage collection, and increased
brittleness and unpredictability).
Then there are some sequences which are traversed more than once, but
the computation is fairly trivial (e.g., the output of range). In
these situations, it's probably a wash. I remember reading a paper
not long ago (can't remember exactly which one, sorry) that pointed
out that most programmers' intuitions about the need to cache
intermediate results is often wrong, and simple computations should
just be recomputed. This is because chips have gotten really fast,
and one of the biggest performance hits these days is a cache miss, so
if you store something, and it requires the program to go off and look
in the "slow part of memory" to find it, you're much worse off than if
you had just recomputed.
Sometimes, if you know you are going to be traversing a sequence more
than once, you would be better off converting it to an explicitly
realized list or putting the data in a vector.
So this leaves what I believe to be a small set of cases where you
genuinely need a cached-but-lazy list. Yes, make this a possibility;
I don't deny that it is useful. But I think your fears about people
complaining about performance if you change the default behavior are
unfounded. It's relatively easy to say to people, "If you need better
performance when traversing a lazy sequence multiple times, you may
benefit from explicitly realizing the result of the intermediate lazy
computations, or using a cached lazy sequence if that's what you
need." On the other hand, if you keep things as they are, I can
pretty much guarantee that you will be faced with ongoing posts to the
list of, "Help! I've got this program that is giving me an out of
memory error, and I can't figure out why."
One issue is that you'd want to make that cached-seq behaves
intelligently when someone tries to cache something that's essentially
already cached. That way people could more easily write generic code
that safely calls cached-seq on various collections to guarantee
certain kinds of time-performance bounds with repeat traversals. For
example, calling cached-seq on a vector shouldn't really do anything.
But there are already plenty of precedents and facilities in Clojure
for handling this sort of thing. We already expect seq to "do the
right thing" and not add extra layers of "sequifying" to something
that already is a sequence. One idea is to have a meta tag of :cached
that applies to things like vectors, sets, maps, lists, but not lazy
lists built with the lazy-cons-uncached variant I've proposed in this
post. cached-seq acts like an ordinary seq on things with the :cached
tag. cached-seq of a lazy-cons-uncached cell would essentially just
construct a lazy-cons-cached cell with the same first and rest
function, but with the cache fields. In the general case, cached-seq
would generate a lazy list built with lazy-cons-cached. Of course,
lazy-cons-cached cells would be marked with a :cached meta tag, and
would guarantee that the rest, when realized, is also made into a
cached sequence.
> There are many
> benefits of the thread-safe once-and-only-once evaluation promise of
> lazy-cons that you may not be considering. It is certainly not a bad
> default.
Well, I've tried as best as I can to articulate my reasoning that
lazy-cons-cached is in fact a bad default. It is potentially
unintuitive, hard to verify correctness and performance bounds, and
difficult to opt out of. The Scala programming language offers both
Streams (uncached lazy list), and LazyList (cached lazy list), and
generally uses Streams as the default. F# uses Seqs (uncached lazy
list), and LazyList (cached lazy list), and generally uses Seqs as the
default. Both of these languages are among the closest comparisons to
Clojure, in terms of meshing practical functional programming within
an existing imperative VM, and they have both made this design choice
to favor uncached lazy lists. And in fact, it turns out that in those
languages, uncached lazy lists end up rarely used. And of course,
languages like Python get along just fine with "generators" and no
built-in cached variant at all, other than explicitly realizing to a
list. Haskell, of course, is the one exception. They use
cached-lazy-lists by default, but then again, they really have to,
because it's the only sensible thing in a language where everything is
lazy by default.
> There will soon be a streams/generator library, intended for i/o, that
> will do one-pass non-cached iteration. It will offer a different set
> of tradeoffs and benefits, including reduced ease-of use - not being
> persistent removes a lot of things like first/rest/nth, destructuring
> etc.
Right, I'm talking about making something that works just like any
other sequence, supporting first/rest/nth and destructuring.
>
> But the important thing is tradeoffs/benefits. They always come
> together.
Yes, but hopefully I can convince you that in this case, the tradeoffs
fall clearly on the side of defaulting to uncached lazy lists.
--Mark
Not being nearly sophisticated enough in Clojure, FP or the relevant
concepts to say anything other than "that all makes complete sense to
me," I wonder only what would be the impact on existing programs were
the default to be switched as you suggest? Or, relatedly, how would you
go about making this transition while simultaneously minimizing
breaking changes? And lastly, has Clojure reached the point where
breaking changes are taboo?
Randall Schulz
Sorry, in that particular sentence I said the opposite of what I
meant. I meant that cached lazy lists are rarely used in those
languages.
Although I'm a relative newbie to Clojure, I've spent a lot of time
using a wide variety of languages. When Rich's reaction was
"everything mutable breaks without caching", my initial reaction was
astonishment that he perceives that to be the common case in a
language where mutability is mostly shunned. I have trouble thinking
of a case where I'd want to put something mutable in a lazy list. But
of course, he's spent more time with Clojure than anyone, so I don't
doubt his experience on this matter. So then the question becomes,
why is my experience so different in other languages?
Taking Scala as an example, I can think of two things that might make
it more suitable in that language to work with non-cached lazy lists
as a default.
First, in Scala there is richer syntactic support for mutable,
imperative-style programming when you need it. So you mostly use
laziness, comprehensions, etc. with your immutable, functional-style
code. But when you're working with mutable stuff (like interop with
Java), you go ahead and code with for/while loops, assignment, and it
doesn't feel particularly gross.
Second, Scala has a more polymorphic approach to things like map and
filter. If you map a vector, you get back a vector. If you filter a
concrete list, you get back a concrete list. Comprehension syntax is
essentially a macro that expands to combinations of map, filter, and
flatMap (analogous to Clojure's mapcat, I think), so your
comprehension output is also determined by the type of the first
collection in the comprehension. So if you're working with mutable
data, you'd be storing it in a different kind of collection anyway
(like a vector, or a cached lazy list), and all the map, filter, etc.
would work the way you'd expect it to.
The first point can just be chalked up to rather fundamental design
differences in the two languages. Incorporating this kind of
"imperative subsystem" would clutter up Clojure's elegance.
As to the second point, it's not inconceivable to do something like
that in Clojure. Clojure's multimethods can certainly support such a
thing. But certainly Scala's approach has a downside because
sometimes you don't want a comprehension to build the same thing as
the source collection, and converting between them can be inefficient.
There's something rather nice about the way Clojure always returns a
bland sequence that can essentially be "realized" into anything you
want. It just seems unfortunate to me that a consequence of this is
that all these output sequences are automatically of the cached
variety.
Perhaps there's some sort of middle ground where Clojure can always
return a lazy sequence, but be a bit more intelligent about choosing
the right variety depending on the nature of the input, but it's not
immediately apparent to me how one would do that. If I get a chance,
I'll definitely play around with some of these ideas, as Rich
suggested, although my "common case" programming seems to be different
from his. I got burned by the filter issue on my very first Clojure
program, when I tried to filter a lazy stream of 10-digit permutations
to find all the permutations with a certain property. The
permutations which satisfied the property were far enough apart that
it caused a problem. This is the kind of program I typically write.
In the meantime, I'm definitely looking forward to seeing Rich's new
generator approach. Maybe having another way to tackle the problem
cases will make a lot of my worries about this issue go away.
--Mark
P.S. I don't want to sound too negative, so I'll mention here that
there are several things I *love* about Clojure. First, the way
operations on so many of the data structures are unified through the
sequence interface. Second, multimethods (haven't seen much
multimethod action since Dylan, and I've always loved them). Third,
many little touches like making maps, sets, vectors, and keys also act
like function, which contribute to readable brevity.
I wasn't talking at all about lists of mutable things. In Clojure,
people often build a sequence from an imperative/ephemeral source. The
caching becomes very important in these cases.
> As to the second point, it's not inconceivable to do something like
> that in Clojure. Clojure's multimethods can certainly support such a
> thing. But certainly Scala's approach has a downside because
> sometimes you don't want a comprehension to build the same thing as
> the source collection, and converting between them can be inefficient.
> There's something rather nice about the way Clojure always returns a
> bland sequence that can essentially be "realized" into anything you
> want. It just seems unfortunate to me that a consequence of this is
> that all these output sequences are automatically of the cached
> variety.
>
> Perhaps there's some sort of middle ground where Clojure can always
> return a lazy sequence, but be a bit more intelligent about choosing
> the right variety depending on the nature of the input, but it's not
> immediately apparent to me how one would do that. If I get a chance,
> I'll definitely play around with some of these ideas, as Rich
> suggested, although my "common case" programming seems to be different
> from his. I got burned by the filter issue on my very first Clojure
> program, when I tried to filter a lazy stream of 10-digit permutations
> to find all the permutations with a certain property. The
> permutations which satisfied the property were far enough apart that
> it caused a problem. This is the kind of program I typically write.
>
I'm sorry you encountered a bug, and will fix, but that's not an
indictment of Clojure's approach. Scala et al have had their own
problems:
http://lampsvn.epfl.ch/trac/scala/ticket/692
http://lampsvn.epfl.ch/trac/scala/ticket/498
http://groups.google.com/group/cal_language/browse_thread/thread/728a3d4ff0f77b00
Note that Clojure does do locals nulling on tail calls. It certainly
has paid attention to laziness implementation issues, a bug
notwithstanding.
> In the meantime, I'm definitely looking forward to seeing Rich's new
> generator approach. Maybe having another way to tackle the problem
> cases will make a lot of my worries about this issue go away.
>
I think it's very important not to conflate different notions of
sequences. Clojure's model a very specific abstraction, the Lisp list,
originally implemented as a singly-linked list of cons cells. It is a
persistent abstraction, first/second/third/rest etc, it is not a
stream, nor an iterator. Lifting that abstraction off of cons cells
doesn't change its persistent nature, nor does lazily realizing it.
After my experimentation with a non-caching version, I am convinced it
is incompatible with the abstraction. If a seq was originally
generated from an imperative source, you need it to be cached in order
to get repeatable read persistence, if it was calculated at some
expense, you need to cache it in order to get the performance
characteristics you would expect from a persistent list. An
abstraction is more than just an interface.
That said, I think there is certainly room for a stream/generator
model, especially for I/O, but also for more efficient collection
processing. Such a thing is explicitly one-pass and ephemeral. It will
not have the interface of first/rest, nor Java's thread-unsafe
hasNext/next iterator model (shared by Scala). You can obviously build
seqs from streams/generators, and in my model, with a single
definition you will get both a stream and seq version of functions
like map and filter, as I showed here:
http://groups.google.com/group/clojure/msg/53227004728d6c54
Note also that filter/map etc are not part of these abstractions,
though they can be defined on top of both.
Stream/generators and a corresponding map/filter library on them will
give you other options for one-pass processing, but lazy seqs work
pretty well right now.
Rich
OK, I think I see where you're going with this. It sounds like you're
saying that one of the key ideas here is that the first/rest interface
is meant to guarantee a certain kind of persistence. If I say (first
(rest coll)), it should aways give me the same thing. If you designed
first/rest to work on uncached sequences, most would work this way,
but there is certainly no guarantee, depending on the nature of the
generating function. Since you want the interface to imply this sort
of guarantee, you have no choice but to cache anything involving
first/rest (not including things like "seq"ified vectors which are
inherently cached). This makes sense.
So the piece you're working on is to provide better support for things
that are determined by their generating functions. You are
intentionally avoiding using the terms first/rest for these "streams"
for the reasons above. Since streams will be easily convertable to
seqs, we'll be able to get the best of both words. We can manipulate
streams, and eventually when we pass the stream to a function that
requires seqs, it will do the conversion, and stability will be
guaranteed.
It sounds like a very promising approach, and I'm looking forward to
seeing the results. It seems like ideally, it should be as easy as
possible to manipulate the streams in stream form, so if you're
working with a stream source you can go as long as possible without
converting a stream to seq. Having map/filter/mapcat/comprehensions
work fluidly over streams as well as seqs could be hugely beneficial.
> That said, I think there is certainly room for a stream/generator
> model, especially for I/O, but also for more efficient collection
> processing. Such a thing is explicitly one-pass and ephemeral. It will
> not have the interface of first/rest, nor Java's thread-unsafe
> hasNext/next iterator model (shared by Scala). You can obviously build
> seqs from streams/generators, and in my model, with a single
> definition you will get both a stream and seq version of functions
> like map and filter, as I showed here:
> http://groups.google.com/group/clojure/msg/53227004728d6c54
OK, looking back over your sneak preview example, I have a couple
quick comments/questions.
1. You are using stream-seq to convert streams to seqs. Can't you
just make the seq function work automatically on streams to produce
the sequence, just like it does on vectors, sets, etc., rather than
have a special name for converting streams to sequences? That way,
you can pass streams to all the functions that begin with (seq coll)
and you'll get the desired behavior.
2. After you've added streams, what will the range function produce?
Consider making range produce a stream rather than a sequence.
(Somewhat relatedly, in Python, range produced a concrete list and
xrange produced a generated stream. range was so rarely used that one
of the breaking changes they made in Python 3.0 was to get rid of the
list-producing range, and start using the name "range" for the
generator-variant rather than "xrange").
3. In your sneak-preview filter function, it always produces a seq.
In the spirit of making it easy to work with streams as long as
possible without conversion, how about making filter return a stream
in the case that the input is a stream, rather than having separate
filter-stream and filter public functions? Or is it not in the spirit
of Clojure to have functions overloaded in this way?
--Mark
>
> On Fri, Dec 12, 2008 at 9:28 PM, Rich Hickey <richh...@gmail.com>
> wrote:
>> I think it's very important not to conflate different notions of
>> sequences. Clojure's model a very specific abstraction, the Lisp
>> list,
>> originally implemented as a singly-linked list of cons cells. It is a
>> persistent abstraction, first/second/third/rest etc,
>
> OK, I think I see where you're going with this. It sounds like you're
> saying that one of the key ideas here is that the first/rest interface
> is meant to guarantee a certain kind of persistence. If I say (first
> (rest coll)), it should aways give me the same thing. If you designed
> first/rest to work on uncached sequences, most would work this way,
> but there is certainly no guarantee, depending on the nature of the
> generating function. Since you want the interface to imply this sort
> of guarantee, you have no choice but to cache anything involving
> first/rest (not including things like "seq"ified vectors which are
> inherently cached). This makes sense.
>
There are other attributes as well, like the concurrency properties of
lazy-cons, which guarantees once-only evaluation of first and rest.
> So the piece you're working on is to provide better support for things
> that are determined by their generating functions. You are
> intentionally avoiding using the terms first/rest for these "streams"
> for the reasons above. Since streams will be easily convertable to
> seqs, we'll be able to get the best of both words. We can manipulate
> streams, and eventually when we pass the stream to a function that
> requires seqs, it will do the conversion, and stability will be
> guaranteed.
>
> It sounds like a very promising approach, and I'm looking forward to
> seeing the results. It seems like ideally, it should be as easy as
> possible to manipulate the streams in stream form, so if you're
> working with a stream source you can go as long as possible without
> converting a stream to seq. Having map/filter/mapcat/comprehensions
> work fluidly over streams as well as seqs could be hugely beneficial.
>
Yes, those things will be available for streams.
>> That said, I think there is certainly room for a stream/generator
>> model, especially for I/O, but also for more efficient collection
>> processing. Such a thing is explicitly one-pass and ephemeral. It
>> will
>> not have the interface of first/rest, nor Java's thread-unsafe
>> hasNext/next iterator model (shared by Scala). You can obviously
>> build
>> seqs from streams/generators, and in my model, with a single
>> definition you will get both a stream and seq version of functions
>> like map and filter, as I showed here:
>> http://groups.google.com/group/clojure/msg/53227004728d6c54
>
> OK, looking back over your sneak preview example, I have a couple
> quick comments/questions.
>
> 1. You are using stream-seq to convert streams to seqs. Can't you
> just make the seq function work automatically on streams to produce
> the sequence, just like it does on vectors, sets, etc., rather than
> have a special name for converting streams to sequences? That way,
> you can pass streams to all the functions that begin with (seq coll)
> and you'll get the desired behavior.
No you can't, for the same reasons you can't for Iterator or
Enumeration seqs. Again it comes down to abstractions, and the
abstraction for (seq x) is one on persistent collections. It presumes
that (seq x) is referentially transparent, which it isn't for
ephemeral sources - i.e. streams aren't colls. It would become quite
fragile if people had to adopt call-once-only semantics for seq -
ephemerality would pollute the world. That said, there are a growing
number of these ephemeral-source-seq functions which could be folded
into one multimethod.
Also note that converting a stream to a seq may involve I/O (as may
any consumption of a stream, and stream->seq will call next!), and
thus will have different concurrency semantics than (seq coll).
>
> 2. After you've added streams, what will the range function produce?
> Consider making range produce a stream rather than a sequence.
> (Somewhat relatedly, in Python, range produced a concrete list and
> xrange produced a generated stream. range was so rarely used that one
> of the breaking changes they made in Python 3.0 was to get rid of the
> list-producing range, and start using the name "range" for the
> generator-variant rather than "xrange").
If you look in the svn you'll see that Range already implements
Streamable, and so can be used in either context.
>
> 3. In your sneak-preview filter function, it always produces a seq.
> In the spirit of making it easy to work with streams as long as
> possible without conversion, how about making filter return a stream
> in the case that the input is a stream, rather than having separate
> filter-stream and filter public functions? Or is it not in the spirit
> of Clojure to have functions overloaded in this way?
I'm still on the fence about overloading by type here. I'm not fond of
sticking "-stream" on everything either, though any prefix/suffix will
be short. What you get back from (map f aseq) and (map f astream)
would be two very different things, and without declared types you
won't know from looking at the code what you are dealing with. That's
tricky. Also, as discussed earlier, when transitioning from streams to
the easier-to-use seqs there will need to be a once-only explicit
conversion. There are also issues about being able to partition by
stream(able)/seq(able) - will nothing be both?
Anyway, these are just details, the design is pretty well along (e.g.
the stream system will be queue and timeout savvy) and I don't see any
impediments other than needing to cut a release soon :)
Rich
Okay, so you've got one abstraction (seq) for persistent collections
(and therefore caches to guarantee persistence), and one abstraction
(streams) that is for ephemeral sources, and you want to build a
barrier between them (so the user has to explicitly call seq-stream to
make the conversion).
But don't forget that there are collections which are persistent (and
therefore, it would be intuitive to use the first/rest sequence
abstraction with them), yet too big to cache (e.g., sequence of all
permutations of an 11 item list), and/or so cheap to compute that it
makes no sense to cache (e.g., (range 100)). I am particularly
interested to see what happens with these "persistent streams" for
which a sequence abstraction DOES make sense. Perhaps one avenue is
to annotate certain streams as being from non-ephemeral sources, and
therefore they can automatically be passed to seq-based functions, and
remain uncached through various transformations like
rest/map/filter/etc.