Memory Consumption of Large Sequences

Showing 1-13 of 13 messages
Memory Consumption of Large Sequences Keith Bennett 2/2/09 10:06 AM
All -

I'm curious to know how to minimize memory consumption in Clojure for
large lists.  I did the following test in repl:

(def d ())
(dotimes [_ 100000] (def d (cons "x" d)))

Then, I used VisualVM, an awesome free profiling tool (https://
visualvm.dev.java.net/) to examine the results.  It indicated that the
number of clojure.lang.PersistentList instances increased by 100,000.
Each instance appeared to consume 48 bytes (not including the actual
value, but only its reference, I presume).  I don't think any were
eligible for garbage collection, because I initiated gc several times,
and they were not removed.  (I know that gc() is not deterministic,
but am pretty sure that they would have been removed. Feel free to
correct me.)

Thinking that maybe the special functions like map did some magic to
save memory, I tried this:

(def a (map #(* 2 %) (range 100000)))

The result was 100,000 clojure.lang.LazyCons objects, each of which
consuming 52 bytes.

Are there alternate strategies for building a large sequence that
would consume less memory per element?

Thanks,
Keith
Re: Memory Consumption of Large Sequences Paul Barry 2/2/09 10:41 AM
Ketih,

I think what you have done, at least at the JVM level, is create 100,000 lists, with basically each list containing an element and a pointer to the next element.  Because one list points to the next, none of them can be garbage collected.  It seems to me that this would be roughly equivalent to one list with 100,000 elements, in terms of memory usage.  In your map example, the memory usage is 52 * 100,000 bytes, so that's about 5MB.  How much memory is used by the equivalent Java code?

List l = new ArrayList();
for(int i = 0; i < 100000; i++) {
    l.add("x");
Re: Memory Consumption of Large Sequences Christian Vest Hansen 2/2/09 10:44 AM
The trick with these listish things is to not calculate the tail until
you need it, and to throw away the head when you're done with it.

On Mon, Feb 2, 2009 at 7:06 PM, Keith Bennett <keithr...@gmail.com> wrote:
>
> All -
>
> I'm curious to know how to minimize memory consumption in Clojure for
> large lists.  I did the following test in repl:
>
> (def d ())
> (dotimes [_ 100000] (def d (cons "x" d)))

Here, you are keeping the head of the list in the "d" and do nothing
but append to the "d" list and redefine "d" as the new, longer list.

Instead, try defining "d" like this:

(def d (take 100000 (repeat "x")))

And take a gander at how much memory that consumes.

However, the ideal thing here is to not define "d" as a list at all.
Make instead a function that produces the value of "d" on demand - you
don't actually need "d", you need it's value. All "d" does is to
reference the head of your list and thereby preventing it from being
garbage collected.

>
> Then, I used VisualVM, an awesome free profiling tool (https://
> visualvm.dev.java.net/) to examine the results.  It indicated that the
> number of clojure.lang.PersistentList instances increased by 100,000.
> Each instance appeared to consume 48 bytes (not including the actual
> value, but only its reference, I presume).  I don't think any were
> eligible for garbage collection, because I initiated gc several times,
> and they were not removed.  (I know that gc() is not deterministic,
> but am pretty sure that they would have been removed. Feel free to
> correct me.)
>
> Thinking that maybe the special functions like map did some magic to
> save memory, I tried this:
>
> (def a (map #(* 2 %) (range 100000)))
>
> The result was 100,000 clojure.lang.LazyCons objects, each of which
> consuming 52 bytes.

But not until the list was realized. However, your "a" reference will
prevent this list from being garbage collected as well.

>
> Are there alternate strategies for building a large sequence that
> would consume less memory per element?

Don't keep the head around on any list (well, seq really) unless you
really mean/need it.

>
> Thanks,
> Keith
>
> >
>



--
Venlig hilsen / Kind regards,
Christian Vest Hansen.
Re: Memory Consumption of Large Sequences Keith Bennett 2/2/09 1:47 PM


On Feb 2, 1:41 pm, Paul Barry <pauljbar...@gmail.com> wrote:
> Ketih,
>
> I think what you have done, at least at the JVM level, is create 100,000
> lists, with basically each list containing an element and a pointer to the
> next element.  Because one list points to the next, none of them can be
> garbage collected.  It seems to me that this would be roughly equivalent to
> one list with 100,000 elements, in terms of memory usage.  In your map
> example, the memory usage is 52 * 100,000 bytes, so that's about 5MB.  How
> much memory is used by the equivalent Java code?
>
> List l = new ArrayList();
> for(int i = 0; i < 100000; i++) {
>     l.add("x");
>
> }
>
> On Mon, Feb 2, 2009 at 1:06 PM, Keith Bennett <keithrbenn...@gmail.com>wrote:
Re: Memory Consumption of Large Sequences Keith Bennett 2/2/09 1:48 PM
Paul -

Clojure definitely has its benefits, but in terms of memory footprint,
Java appears to be *much* more economical, unless elements can be
discarded shortly after use as Christian describes, in which case it's
much *less* economical.

In a Java ArrayList, only a single ArrayList object is used to store
all n elements added.  The objects are stored in an internal Object
[].  In the test, the 100,000 elements resulted in an increase of
about 1,000,000 bytes for the object array, a cost of 10 bytes each.
(Actually, probably less than 10, since ArrayList.ensureCapacity()
increases the Object [] by 50% each time it needs increasing, so there
is probably a significant amount of unused capacity in that 1,000,000
bytes.)

In contrast, Clojure creates a new holder object for each item added,
at a cost of 48 bytes per item for the PersistentList objects, and 52
bytes per item for the lazy cons.

So the overhead of Clojure appears to be 5 times the overhead of Java,
for lists that need to stay around in memory for more than a single
operation.

I don't mean to imply that Java is better than Clojure; in the vast
majority of cases this extra memory consumption will not be a problem,
and I am really enjoying working with Clojure, not only for the way
functional programming is stretching my mind, but also for the
productivity improvements I anticipate getting from adding this tool
to my toolkit.  However, I want to be objective and scientific about
the relative costs of the different approaches so that decisions I
make are made in an informed and objective way.

The Java code I used to test this is at http://www.pastie.org/377716.

- Keith


On Feb 2, 1:41 pm, Paul Barry <pauljbar...@gmail.com> wrote:
> Ketih,
>
> I think what you have done, at least at the JVM level, is create 100,000
> lists, with basically each list containing an element and a pointer to the
> next element.  Because one list points to the next, none of them can be
> garbage collected.  It seems to me that this would be roughly equivalent to
> one list with 100,000 elements, in terms of memory usage.  In your map
> example, the memory usage is 52 * 100,000 bytes, so that's about 5MB.  How
> much memory is used by the equivalent Java code?
>
> List l = new ArrayList();
> for(int i = 0; i < 100000; i++) {
>     l.add("x");
>
> }
>
> On Mon, Feb 2, 2009 at 1:06 PM, Keith Bennett <keithrbenn...@gmail.com>wrote:
Re: Memory Consumption of Large Sequences Chouser 2/2/09 3:11 PM
On Mon, Feb 2, 2009 at 4:48 PM, Keith Bennett <keithr...@gmail.com> wrote:
>
> Clojure definitely has its benefits, but in terms of memory footprint,
> Java appears to be *much* more economical

It's probably worth being careful to separate the different parts of
Java and Clojure.  Clojure code can use most Java data structures
quite comfortably, and Java code can without too much inelegance use
Clojure data structures.  Besides that, each have other benefits and
drawbacks besides the data structures they can use.

> In a Java ArrayList, only a single ArrayList object is used to store
> all n elements added.  The objects are stored in an internal Object
> [].

If you really want a large mutable collection in your Clojure code,
you can use an ArrayList quite easily:

(def a1 (java.util.ArrayList.))
(dotimes [_ 100000] (.add a1 "x"))

user=> (take 10 a1)
("x" "x" "x" "x" "x" "x" "x" "x" "x" "x")

Here Clojure is adding essentially no memory overhead at all, beyond
what the ArrayList itself brings, and the resulting object plays quite
well with many common Clojure idioms.

But mutability has it's own costs, and Clojure provides several
options to increase your chances of finding an immutable solution that
also fits in your memory and CPU performance requirements.

For example, chances are you'll want to be accessing a large ordered
collection like this by index.  In this case the PersistentList you
used in your original example is the wrong choice anyway, since it's
O(n) for lookups.  Instead, you might like one of the vector types:

(def v1 (vec (replicate 100000 "x")))

Take a look at that with your profiling tool and you'll see that it's
taking very little memory.  In fact, the data is stored in a single
Java array Object[].

user=> (class v1)
clojure.lang.LazilyPersistentVector

But unlike ArrayList, this is immutable, so you can get a new object
with one element changed:

(def v2 (assoc v1 5 :foo))

user=> (take 10 v2)
("x" "x" "x" "x" "x" :foo "x" "x" "x" "x")

When you do this, you'll see your memory jump a bit, as you now have
both the original LazilyPersistentVector and the new PersistentVector:

user=> (class v2)
clojure.lang.PersistentVector

PersistentVectors have structural sharing, so each subsequent "copy"
you make with elements changed will cost much less memory then new
full copies of the whole vector would.

I feel like I'm rambling now, so let me tie this off.  Depending on
your actual use case, vectors may work well.  If you don't need the
whole collection at once, perhaps a lazy seq that simply promises the
values as they're demanded would work.  One of the Map types might be
good if your collection is likely to be sparse.  Or if none of these
immutable options will do, there are always ArrayList or even raw Java
arrays at hand.

You don't need to give up macros and the REPL just because a
100000-element cons-list feels bloated. :-)

--Chouser

Re: Memory Consumption of Large Sequences Mark H. 2/2/09 3:23 PM
On Feb 2, 10:06 am, Keith Bennett <keithrbenn...@gmail.com> wrote:
> I'm curious to know how to minimize memory consumption in Clojure for
> large lists.  I did the following test in repl:
>
> (def d ())
> (dotimes [_ 100000] (def d (cons "x" d)))

Let me translate this into pseudocode for you:

Make d into a global reference wrapped with a mutex
Set d to point to an empty list
for i = 1 to 100000:
    If the symbol d has not yet been defined previously:
        Create it
    Within a critical section protecting d:
        Replace the value of d with (cons "x" d)

Note in particular that

1. You're probably doing it wrong if you call "def" more than twice
for the same symbol.  "def" is NOT a replacement for SETF (in Common
Lisp) or "set!" (in Scheme).  The point of Clojure is not to use
destructive updates whenever possible.
2. If you keep the head of a sequence, as in "(def a (map ...))", you
have to store all elements of the sequence in memory.

> Are there alternate strategies for building a large sequence that
> would consume less memory per element?

If you _build_ the sequence, you'll have to store the whole sequence.
If you construct the sequence lazily, only use one element of it at a
time, and don't keep that element anywhere, then you'll never need to
store the whole sequence.  But if you store the head of the sequence
(e.g., with a def), then you'll have to store all elements of the
sequence that you visit.  If you print the whole sequence (as you
would at the REPL), then you visit all the elements of the sequence.
Therefore, in keeping the head of the sequence and printing it, you're
storing the whole sequence in memory.  That's where all the list
objects (a.k.a. conses) come from.  The GC doesn't clean them up
because you still have references to all of them; they are still
reachable.  You could do GC a million times and they would all stay.

That's how lazy seqs work.  It has nothing to do with Clojure and
everything to do with the language construct.  If you don't like it,
then check out "streams," which Rich is currently building.  Streams
are ephemeral as long as you don't put their elements into a seq.  Of
course that means you can only iterate over a stream once.  Again,
that has nothing to do with Clojure and everything to do with the
language construct.

mfh
Re: Memory Consumption of Large Sequences Luc Prefontaine 2/2/09 3:51 PM
Paul,

I can understand the concerns about memory footprint if you work in a restricted environment
(lack of physical memory or other system resources) or if your memory load is very high.

However, these things are less common these days with the kind of hardware we can buy
compared to what we could get for the same amount 8 years ago.

Assembly languages or C appear much more economical than Java ... but nobody would
suggest using these as generic day to day tools.
It's more practical and efficient to use Java as an alternative to these things (even if sometimes I miss Cobol and Fortran :))).

I am not sure I would bother now with the "real cost" of a PersistentList versus an ArrayList.
If you look at the JVM and Java, Java 1.2 is a lot different internally than Java 5 or 6.
Performance wise it's night and day.

In the near future, Java, the JVM, the hardware, can improve, ... so there's still some performance gains
to expect in the bottom layers that will eventually benefit Clojure.

The same way we find out that keeping references on objects is bad in Java, we will learn the things
we should avoid when coding in Clojure (and maybe write a cookbook about these things eventually :)))).
In the short term that should be enough to avoid most problems.

We are now facing some visible hardware limits. We need to change the way we design applications to adapt
to the twists the hardware industry is using to pack more power in the same space.
Clojure brings into play concepts that may be our salvation but we need to rethink the way we create software.

There's much work to do in the design area... I think we can postponed the data structure footprint issue
until we assert that the trade off (memory versus programing power) of these representations is paying off or not.

What do you think ?

Luc
Re: Memory Consumption of Large Sequences Keith Bennett 2/2/09 5:32 PM
Luc -

It is I (Keith) who posed the original question.

I am just now learning Clojure, and for me, understanding what's going
on underneath the surface helps me understand how to use the language
properly.  As I said previously, the amount of memory consumed by a
list will very rarely be an issue.  However, it may *sometimes* be an
issue, and I'd like to know the costs and benefits of the various
approaches.  Raising the question resulted in an impressive outpouring
of much high quality information that greatly illuminated the issue.

I have a lot of work to do to learn how to think in functional
programming.  These kinds of discussions are very helpful.

Regards,
Keith
Re: Memory Consumption of Large Sequences Luc Prefontaine 2/2/09 7:58 PM
Paul sorry for the mistake, these emails are a pain to follow sometimes,

Keith,

It's up to you if you prefer to slice the Clojure features one by one down the bone marrow,
as for myself I used a different approach to ramp up with Clojure ASAP

The need to get down to implementation details came after 5 months of intensive use and delivering
some production ready components in Clojure, not caring about the implementation except when hitting a bug
(two if I remember but very minor ones). My code may not be top notch (still I did three code reviews to optimize it)
but it's been running smoothly for two months now.

These 5 months allowed me to focus on getting some 'readable' Clojure code out.

I am presently looking at the implementation mainly because I am trying to closely integrate some framework with the Clojure core...
and that's why I am digging in the code these times. Without some extensive knowledge on how the features of Clojure
can be used, this exercise (digging in the code) looks inefficient to me. There's so much things to get acquainted with at a higher
level...

But again if the slicing approach suits you better why not ? Just don't get lost in the details along the way :)))

Regards,
Luc
Re: Memory Consumption of Large Sequences Mark H. 2/2/09 8:27 PM
On Feb 2, 5:32 pm, Keith Bennett <keithrbenn...@gmail.com> wrote:
> I am just now learning Clojure, and for me, understanding what's going
> on underneath the surface helps me understand how to use the language
> properly.  As I said previously, the amount of memory consumed by a
> list will very rarely be an issue.  However, it may *sometimes* be an
> issue, and I'd like to know the costs and benefits of the various
> approaches.  Raising the question resulted in an impressive outpouring
> of much high quality information that greatly illuminated the issue.
>
> I have a lot of work to do to learn how to think in functional
> programming.  These kinds of discussions are very helpful.

Hope I didn't offend with my rather sharp reply -- I meant to be clear
but I realized later that it could be taken as sounding annoyed!  The
other folks did a great job of explaining the situation with a more
friendly tone :-)

mfh
Re: Memory Consumption of Large Sequences Mark H. 2/2/09 8:32 PM
On Feb 2, 5:32 pm, Keith Bennett <keithrbenn...@gmail.com> wrote:
> I have a lot of work to do to learn how to think in functional
> programming.  These kinds of discussions are very helpful.

A picky point -- lazy sequences aren't really a functional programming
thing (although restricting side effects makes it easier to reason
about them).  Functional programming is about the language helping you
to restrict side effects.  Lazy sequences are one way of making that
efficient (so that you can avoid constructing temporary collections).
However, you don't need lazy sequences to have a functional language
(Scheme doesn't, and at least one modern Scheme implementation is
"functional" enough to make cons cells immutable by default), nor does
the language need to be functional in order to have lazy sequences
(e.g., Common Lisp's SERIES package, which is described in "Common
Lisp the Language" (2nd edition)).

mfh
Re: Memory Consumption of Large Sequences Keith Bennett 2/3/09 5:54 AM
Mark -

Not a problem.  I didn't take it that way at all.

- Keith