Chunking is making my life more difficult.

255 views
Skip to first unread message

ehanneken

unread,
Dec 31, 2010, 12:25:15 AM12/31/10
to Clojure
I spent a long time debugging some Clojure code yesterday. The
essence of it looked similar to this:

(defn items []
(mapcat expensive-function (range 0 4000 100)))

. . . (take 5 (items)) . . .

expensive-function is a function that issues an HTTP GET to retrieve a
vector of data. Since range's docstring says it returns a lazy
sequence, and since mapcat is defined in terms of map and concat,
which are both supposed to return lazy sequences, I expected (take 5
(items)) to cause only one HTTP GET. In reality, it causes 32 GETs.
That's kind of costly in time and space, considering I merely wanted
the first 5 of the 100 items returned in the response to the first
GET.

This behavior was baffling to me at first, but after some research I
found section 12.3 of _The Joy of Clojure_, which mentions that ever
since Clojure 1.1 some functions (such as range) which are advertised
as lazy are actually moderately eager, realizing chunks of up to 32
elements at a time.

I retrieved the release notes for Clojure 1.1. In the section about
chunking, it says, "Please share any use cases where chunked
processing has resulted in behavioral differences that matter to you
on the Clojure Google group." That's why I posted this.

Ken Wesson

unread,
Dec 31, 2010, 12:48:08 AM12/31/10
to clo...@googlegroups.com
On Fri, Dec 31, 2010 at 12:25 AM, ehanneken <ehan...@pobox.com> wrote:
> I spent a long time debugging some Clojure code yesterday.  The
> essence of it looked similar to this:
>
> (defn items []
>  (mapcat expensive-function (range 0 4000 100)))
>
> . . . (take 5 (items)) . . .
>
> expensive-function is a function that issues an HTTP GET to retrieve a
> vector of data.  Since range's docstring says it returns a lazy
> sequence, and since mapcat is defined in terms of map and concat,
> which are both supposed to return lazy sequences, I expected (take 5
> (items)) to cause only one HTTP GET.  In reality, it causes 32 GETs.
> That's kind of costly in time and space, considering I merely wanted
> the first 5 of the 100 items returned in the response to the first
> GET.
>
> This behavior was baffling to me at first, but after some research I
> found section 12.3 of _The Joy of Clojure_, which mentions that ever
> since Clojure 1.1 some functions (such as range) which are advertised
> as lazy are actually moderately eager, realizing chunks of up to 32
> elements at a time.

Interesting. I find it odd, though, that that would result in
expensive-function being called 32 times. I'd expect mapcat to cause
it to be called once; so you'd realize the first 32 elements of (range
0 4000 100) and consume one, and realize the first 100 elements of
(mapcat expensive-function (range 0 4000 100)) and consume five.

Is mapcat also semi-eager, then?

Chas Emerick

unread,
Dec 31, 2010, 1:49:05 AM12/31/10
to clo...@googlegroups.com
On Fri, Dec 31, 2010 at 12:25 AM, ehanneken <ehan...@pobox.com> wrote:
> I spent a long time debugging some Clojure code yesterday. The
> essence of it looked similar to this:
>
> (defn items []
> (mapcat expensive-function (range 0 4000 100)))
>
> . . . (take 5 (items)) . . .
>
> expensive-function is a function that issues an HTTP GET to retrieve a
> vector of data. Since range's docstring says it returns a lazy
> sequence, and since mapcat is defined in terms of map and concat,
> which are both supposed to return lazy sequences, I expected (take 5
> (items)) to cause only one HTTP GET. In reality, it causes 32 GETs.
> That's kind of costly in time and space, considering I merely wanted
> the first 5 of the 100 items returned in the response to the first
> GET.
>
> This behavior was baffling to me at first, but after some research I
> found section 12.3 of _The Joy of Clojure_, which mentions that ever
> since Clojure 1.1 some functions (such as range) which are advertised
> as lazy are actually moderately eager, realizing chunks of up to 32
> elements at a time.


Chunking is dependent upon the type of seq being traversed, which is in turn dependent upon the type of collection underlying the seq. Ranges always produce chunked seqs, as do non-empty vectors for example. If chunking is a concern, you can always fall back to seqs "grounded" in lists, which are always unchunked (and therefore yield one-at-a-time behaviour with e.g. map):

=> (->> (range 50)
(map println)
first)
0
1
2
...
31
nil

vs…

=> (->> (range 50)
(mapcat list)
(map println)
first)
0
nil

I'm not sure that this means that map, concat, etc are eager in practical terms. I view the chunking in much the same light as transients vis á vis immutability of persistent data structures -- the result is a relative perf improvement that doesn't impact semantics in the vast majority of cases. In my experience, chunking is only detrimental when mapping side-effecting (usually IO-related) functions across collections, as you're doing. Given that, using a (mapcat list) interstitial to get unchunked seqs is inconsequential w.r.t. perf, etc. I'd be interested in hearing any different perspectives.

FYI, chunked-seq? can be used to determine if a seq supports chunking or not.

- Chas

ehanneken

unread,
Dec 31, 2010, 10:52:18 AM12/31/10
to Clojure
On Dec 31, 12:48 am, Ken Wesson <kwess...@gmail.com> wrote:
> Is mapcat also semi-eager, then?

I guess so. The Clojure 1.1 release notes also say, "Some of the
sequence processing functions (like map and
filter) are now chunk-aware and leverage this efficiency." I should
have mentioned that.

Steven E. Harris

unread,
Dec 31, 2010, 11:53:31 AM12/31/10
to clo...@googlegroups.com
ehanneken <ehan...@pobox.com> writes:

> I spent a long time debugging some Clojure code yesterday. The
> essence of it looked similar to this:
>
> (defn items []
> (mapcat expensive-function (range 0 4000 100)))
>
> . . . (take 5 (items)) . . .

I tried to distill the problem down by defining a non-chunking range
function (a simple variant), then working outward from there to see what
else is evaluating a surprising number of times.

,----
| (defn non-chunked-range
| [start end]
| (prn (format "In non-chunked-range with %d and %d." start end))
| (lazy-seq
| (when-not (= start end)
| (cons start (non-chunked-range (inc start) end)))))
`----

Note that `mapcat` is defined in terms of `map` and `concat`. First
let's confirm that `map` is not eager:

,----
| ;; Draws one, and evaluates lazy sequence function twice:
| (take 1
| (map #(list %)
| (non-chunked-range 0 10)))
`----

Experimenting with the argument to `take` shows that the lazy sequence
function is evaluated as expected: a number of times equal to the
argument plus one for the terminal case (n + 1).

Now add `concat` into the mix to make sure it's not eager:

,----
| ;; Draws two, and evaluates lazy sequence function three times:
| (concat (take 2 (non-chunked-range 0 10)))
`----

That works as expected. Now add `apply` to `concat` as `mapcat` does to
flatten the input lists:

,----
| ;; Draws one, and evaluates lazy sequence five times:
| (take 1
| (apply concat
| (map #(list %)
| (non-chunked-range 0 10))))
`----

Whoah! Where did the extra three evaluations of the lazy sequence
function come from? Note that this one calls on the function /five/
times.

Here is the mapping of the argument to `take` and the number of times
the function is called:

take calls
==== =====
0 5
1 5
2 5
3 5
4 6
5 7
...
n n + 2

I read the source for `concat`, but I don't see what it's doing to force
the extra evaluations both below four arguments and the extra one
(yielding n + 2) with four or more arguments. What's responsible for
this difference in behavior?

--
Steven E. Harris

ehanneken

unread,
Dec 31, 2010, 12:29:40 PM12/31/10
to Clojure
Chas,

Thanks for your help. However, modifying the code to use mapcat
instead of (map println) seems to cause some chunking:

(defn tenify [n]
(do
(println \" n \")
[n n n n n n n n n n]))

=> (->> (range 50)
(mapcat list)
(mapcat tenify)
first)
" 0 "
" 1 "
" 2 "
" 3 "
0

And indeed, when I modify my code I see four HTTP GETs being issued.
Four is better than 32, but still.

ehanneken

unread,
Jan 3, 2011, 11:30:37 AM1/3/11
to Clojure
> Four is better than 32, but still.

I found the explanation for this on stackoverflow.com. Stuart Sierra
wrote,

> This is due to the definition of =, which, when given a sequence of arguments, forces the first 4:

>> (defn =
>> ;; ... other arities ...
>> ([x y & more]
>> (if (= x y)
>> (if (next more)
>> (recur y (first more) (next more))
>> (= y (first more)))
>> false)))

http://stackoverflow.com/questions/3407876

He also defined an unchunk function to prevent chunking. For
completeness, I should also mention that Michael Fogus has written a
different function for the same purpose:
http://blog.fogus.me/2010/01/22/de-chunkifying-sequences-in-clojure/
Reply all
Reply to author
Forward
0 new messages