Head retention example

tyaakow

unread,

Apr 13, 2013, 8:58:55 PM4/13/13

to clo...@googlegroups.com

I'm reading Clojure Programming book by O'Reilly..

I came over an example of head retention. First example retains reference to d (I presume), so it doesnt get garbage collected:

(let [[t d] (split-with #(< % 12) (range 1e8))]
    [(count d) (count t)])
;= #<OutOfMemoryError java.lang.OutOfMemoryError: Java heap space>

While second example doesnt retain it, so it goes with no problem:

(let [[t d] (split-with #(< % 12) (range 1e8))]
    [(count t) (count d)])
;= [12 99999988]

What I don't get here is what exactly is retained in which case and why. If I try to return just [(count d)], like this:

(let [[t d] (split-with #(< % 12) (range 1e8))]
    [(count d)])

- it seems to create same memory problem. Why is that?

Further, I recall reading that count in every case realizes/evaluates a sequence. So, i need that clarified.

If I try to return (count t) first, how is that faster/more memory efficient then if I dont return it at all? And what & why gets retained in which case?

Marko Topolnik

unread,

Apr 14, 2013, 2:13:51 PM4/14/13

to clo...@googlegroups.com

On Sunday, April 14, 2013 2:58:55 AM UTC+2, tyaakow wrote:

I'm reading Clojure Programming book by O'Reilly..

I came over an example of head retention. First example retains reference to d (I presume), so it doesnt get garbage collected:
(let [[t d] (split-with #(< % 12) (range 1e8))]
    [(count d) (count t)])
;= #<OutOfMemoryError java.lang.OutOfMemoryError: Java heap space>

split-with gives you two windows into the same original sequence, which is (range 1e8). You are first realizing the tail part d, then the head part t. So you are retaining the head while realizing most of the sequence.

While second example doesnt retain it, so it goes with no problem:

(let [[t d] (split-with #(< % 12) (range 1e8))]
    [(count t) (count d)])
;= [12 99999988]

Here the computation happens in the opposite order and you do not retain the head while realizing d.

If I try to return just [(count d)], like this:

(let [[t d] (split-with #(< % 12) (range 1e8))]
    [(count d)])

- it seems to create same memory problem. Why is that?

This happens because t is bound to the head of the sequence, even if you are not using it. This is probably a bug because the compiler should realize that t is not used and not bind it at all, or at least unbind it before evaluating (count d).

Further, I recall reading that count in every case realizes/evaluates a sequence. So, i need that clarified.

As count realizes one element after another, it doesn't on its own retain a reference to the past elements. However, if you have another reference to the head of the sequence, then you'll transitively hold a reference to each and every member of the sequence, causing the complete sequence to stay in memory at the same time. This is what the "lose your head" maxim is about.

-marko

tyaakow

unread,

Apr 14, 2013, 7:50:37 PM4/14/13

to clo...@googlegroups.com

Thank you for your response, Marko.
I want to clarify one more thing:

(let [[t d] (split-with #(< % 12) (range 1e8))]
    [(count d) (count t)])

does this mean that while (count d) is realizing (range 1e8) seq, it becomes (also) realized within t, therefore
it doubles (range 1e8) in memory causing OOME while (count d) is still not finished?

Also, you say "As count realizes one element after another, it doesn't on its own retain a reference to the past elements."

Does this mean that, eg. in repl, when I do some (count xyz) and it realizes xyz, It will later need to be reevaluated (realized again) if I require xyz within repl (I presume that if I require xyz later within file, it wont be GC due to it and clojure will know it shouldnt be GC)

gerardc

unread,

Apr 17, 2013, 11:28:11 AM4/17/13

to clo...@googlegroups.com

Great Q&A on this, thanks Marko!

Marko Topolnik

unread,

Apr 17, 2013, 4:53:26 PM4/17/13

to clo...@googlegroups.com

On Monday, April 15, 2013 1:50:37 AM UTC+2, tyaakow wrote:

Thank you for your response, Marko.
I want to clarify one more thing:
(let [[t d] (split-with #(< % 12) (range 1e8))]
    [(count d) (count t)])
does this mean that while (count d) is realizing (range 1e8) seq, it becomes (also) realized within t, therefore
it doubles (range 1e8) in memory causing OOME while (count d) is still not finished?

There is no doubling: t and d share the same underlying lazy sequence and will refer to the same objects. The trouble is only that you force the evaluation of (count d) while (count t) still waits to be evaluated, so t must definitely stay bound to the head of the shared sequence.

Also, you say "As count realizes one element after another, it doesn't on its own retain a reference to the past elements."

Does this mean that, eg. in repl, when I do some (count xyz) and it realizes xyz, It will later need to be reevaluated (realized again) if I require xyz within repl (I presume that if I require xyz later within file, it wont be GC due to it and clojure will know it shouldnt be GC)

Be careful to observe that I say "doesn't on its own retain a reference to the past elements". If you have xyz bound to the head of your sequence, it will force the entire sequence to stay in memory for as long as xyz is within scope (if it's a local) or indefinitely (if it's a global def'd var). Generally, a lazy sequence never gets un-realized once it got realized---the only option is for it to disappear entirely (turn into garbage).

-marko

Michał Marczyk

unread,

Apr 17, 2013, 6:01:30 PM4/17/13

to clo...@googlegroups.com

Note that the problem is not that t needs to hang around; it's that t
holds a lazy sequence which hangs around in unrealized state. That
lazy sequence internally holds a thunk -- a nullary function --
capable of producing the actual sequence elements on request. It is
this thunk that holds a reference to the underlying huge sequence.
Once t is realized, the actual sequence gets cached and the thunk
becomes eligible for GC (the field holding it is set to null). If it
then needs to stay around for some other purpose, that is no problem:

user=> (let [[t d] (split-with #(< % 12) (range 1e8))] [(count t)

(count d) (count t)])

[12 99999988 12]

(Or I suppose you could return [(count d) (count t)], but (dorun t)
before that.)

Also, just to be explicit about this, calling (let [x
(produce-huge-seq)] (count x)) is not a problem, because x gets
cleared prior to control being handed off to count.

I've also discussed the details of what's going on on SO, which is
where I first noticed this question:

http://stackoverflow.com/questions/15994316/clojure-head-retention

Cheers,
Michał

> --
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clo...@googlegroups.com
> Note that posts from new members are moderated - please be patient with your
> first post.
> To unsubscribe from this group, send email to
> clojure+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to clojure+u...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Tonino Jankov

unread,

Apr 20, 2013, 5:33:37 PM4/20/13

to clo...@googlegroups.com

Marko, you say "There is no doubling: t and d share the same underlying lazy sequence and will refer to the same objects. The trouble is only that you force the evaluation of (count d) while (count t) still waits to be evaluated, so t must definitely stay bound to the head of the shared sequence.".

But if there is no doubling, and single lazy sequence is in the memory in both cases, how does then memory have problem with one case and not with the other?

If both t and d refer to the same (realized) object in memory.

In both cases, to spit out t or d, the program must have it at one point in its memory.

So what spends the EXTRA, critical, OOME memory in (count d) (count t) case?

Or does it get instantly garbaged the moment it gets realized in (count t) (count d) case?

Anyway, thanks for the exhaustive discussion, Marko & Michal

Tonino Jankov

unread,

Apr 20, 2013, 5:41:48 PM4/20/13

to clo...@googlegroups.com

I mean, I think that in both cases the original sequence at one point in time must be, entirely realized, in memory.

And if there is no doubling of it in critical case, what is critical?

If in (count t) (count d) - non.problematic- case original sequence also, at one point, is, actually, in its entirety present in memory, it means that memory can handle the whole collection.

Maybe my questions sound a bit dubious, but anyway, I'm a bit sold out on this lisp, so I want to get it right.

Tonino Jankov

unread,

Apr 20, 2013, 5:48:01 PM4/20/13

to clo...@googlegroups.com

I think what Michal is saying is that in "good" case, the original sequence is cleared instantly upon being realized and in OOME case it hangs around, so the issue is not the quantity of memory occupied by it, but also the length of time interval it occupies the memory (in OOME case it stays in memory for prolonged time, parallel to program running its thing).

Did I get it right?

Michał Marczyk

unread,

Apr 20, 2013, 6:51:50 PM4/20/13

to clo...@googlegroups.com

On 20 April 2013 23:41, Tonino Jankov <tya...@gmail.com> wrote:

> I mean, I think that in both cases the original sequence at one point in
> time must be, entirely realized, in memory.

Well no, it doesn't.

The original sequence is lazy and chunked, so it looks like a chain of
links holding 32 elements each. It so happens that here it is iterated
over in a chunk-oblivious manner, so it's not terribly inaccurate to
simply think about it as a singly linked list. Calculating the length
of such a list involves walking along it while keeping a running
counter; clearly storing a reference to the head of the list
throughout the process is not necessary, and indeed Clojure doesn't do
it.

Thus in the OOME-free case, the reference to the original sequence is
thrown out almost immediately, followed by a reference to its "rest"
part, followed by the reference to the "rest" of that etc. The
throwing out happens inside drop-while at first and then inside the
clojure.lang.RT.countFrom method.

The key detail here is the way in which all references to d held by
methods in the chain leading up to the call to countFrom are cleared
before control is handed of to countFrom. The trick involved is known
as "locals clearing"; I've hinted at how it works in the SO answer,
see also the methods relevant here -- clojure.lang.RT.count,
clojure.lang.Util.ret1, clojure.lang.RT.countFrom.

A further clarification: t and d refer to two different lazy
sequences, which are constructed by applying different transformations
to a third sequence, which we have been referring to as "the original
sequence". This is the huge sequence which doesn't fit in available
memory. As it happens, while d is not the same object as the original
sequence, it is a subsequence of the original sequence (from where the
split-with predicate fails to the end), so it does share structure
with it, so there is no "doubling".

So, as mentioned previously, the key difference between the working
and the non-working version is in when the reference to the original
sequence hiding inside t gets cleared, as (count d) by itself doesn't
require a live reference to either the original sequence or even d
itself.

Cheers,
Michał

Michał Marczyk

unread,

Apr 20, 2013, 7:11:55 PM4/20/13

to clo...@googlegroups.com

PS. Of course the other key detail is that lazy sequences are realized
little by little (1 element at a time or up to 32, depending on
whether they're chunked or not; the original sequence here is chunked,
so the arrays underlying the chunks will be filled 32 elements at a
time, but then the iteration happens a step at a time, with separate
wrappers around the arrays allocated at every step). Thus at any given
point during the execution of (count d), what's stored in memory is
the current value of the counter, a reference to some subsequence of d
and, in the subsequence, a handful of actual elements and a function
which can be called to generate more.

tyaakow

unread,

Apr 22, 2013, 6:50:51 PM4/22/13

to clo...@googlegroups.com

Thank you for the exhaustive explanation, Michal.

xumingmingv

unread,

Apr 23, 2013, 3:55:12 AM4/23/13

to clo...@googlegroups.com, xumingmingv

Thanks a lot Michal!

xumingmingv

unread,

Apr 23, 2013, 7:37:15 AM4/23/13

to clo...@googlegroups.com, xumingmingv

Michal, do you know any resources/links that introduce the internals of local-clearing? or The Clojure source files which implemented this technique?

Colin Fleming

unread,

Apr 23, 2013, 9:23:12 PM4/23/13

to clo...@googlegroups.com, xumingmingv

I'm pretty sure the implementation of locals clearing is all in Compiler.java, in the LocalsBinding class. See canBeCleared in that class.

xumingmingv

unread,

Apr 23, 2013, 9:37:00 PM4/23/13

to Colin Fleming, xumingmingv, clo...@googlegroups.com

Thanks Colin!

Reply all

Reply to author

Forward