Problem with calling count on large lazy sequence?

202 views
Skip to first unread message

RZez...@gmail.com

unread,
Dec 7, 2008, 3:55:22 AM12/7/08
to Clojure
I already posted this in another thread, but I think maybe it should
have it's own now in light of new information.

If you try to call count on a large lazy sequence you will get an
OutOfMemoryError. For example I can evaluate the following
expressions on a million line CSV file to reproduce the error.

(use 'clojure.contrib.duck-streams)
(count (line-seq (reader "big.csv")))

Chouser and I spent some time looking at it. The problem is that not
only is count a member function (meaning the implicit variable "this"
will hold onto the head), but also the RT.count has local variable o
to hold onto the head.

Chouser came up with the idea that you could pass the seq in an array
allowing you to derefence the pointer to the head of the seq after
you've passed it along to a static count method in ASeq.

This solution worked for me and you can see my patch at the following
url.

http://paste.lisp.org/display/71744

Keep in mind I just whipped this up to test Chouser's idea. I'm
completely new to the Clojure code base so this is a naive fix at
best. I noticed that quite a few classes seem to fall under the
IPersistentCollection umbrella so I'm sure the real patch is bigger
than this. Also the array only needs to be of size 1, obviously.

OK, I'm falling asleep at the computer, I'm sure someone else can roll
with this.

Meikel Brandmeyer

unread,
Dec 7, 2008, 5:20:37 AM12/7/08
to clo...@googlegroups.com
Hi,

Am 07.12.2008 um 09:55 schrieb rzez...@gmail.com:
> This solution worked for me and you can see my patch at the following
> url.
>
> http://paste.lisp.org/display/71744

This is of course no solution to the problem, but if you don't want
to start patching Clojure you can have a workaround for seqs:
(reduce (fn [x _] (inc x)) 0 the-seq).

> Keep in mind I just whipped this up to test Chouser's idea. I'm
> completely new to the Clojure code base so this is a naive fix at
> best. I noticed that quite a few classes seem to fall under the
> IPersistentCollection umbrella so I'm sure the real patch is bigger
> than this. Also the array only needs to be of size 1, obviously.

I'm not sure, that this is a general problem. Vectors, maps, sets
and lists have all an O(1) count. So I presume the have this stored
in some instance variable somewhere.

For general seqs however, this is not true. Eg. a line seq doesn't
know the number of lines in advance. A seq originating from a
collection however, could provide also an O(1) count.

Or am I mistaken?

Sincerely
Meikel

RZez...@gmail.com

unread,
Dec 7, 2008, 3:41:06 PM12/7/08
to Clojure


On Dec 7, 5:20 am, Meikel Brandmeyer <m...@kotka.de> wrote:
> Hi,
>
>  smime.p7s
> 5KViewDownload

You are correct, this is not a general problem. Most things (ie non-
lazy things) have a member field that keeps their size count so that
the operation is always O(1). In fact, this is typically a public
field that can be accessed rather than a method. In lazy land this
obviously will not work because the only way to get the count is to
fully realize the sequence and the whole point of being lazy is to not
preemptively do things like tally the size. Obviously you understand
all this, but I thought I'd reiterate it for my own purposes.

However, when we do decide to fully realize the sequence I think we
should be able to do it without Clojure implicitly keeping a reference
(ie a Java reference) to the head. I would think I should be able to
call count, or map, reduce, or whatever without incurring the cost of
keeping the whole sequence in memory at one time. It should be just
like a read-only Iterator. Heck, I should be able to call count on a
centillion sized sequence if I want (except for the fact that the
result will be incorrect since count returns an int).

As for solutions, Chouser's first stab at it was to do the count of
ASeq right in the RT.count method. The more I think about it the more
I like that solution over the one I posted before. Why does the count
have to be a member function of the sequence? However, it seems that
ASeq count is called from other areas besides RT.count, and I'm not
sure what those are.

I posted another patch (below) that uses the ASeq's Iterator. I still
think this is a shallow fix but I think I like it better than what I
posted before. My ignorance of the code base, and the fact that I'm
not Joshua Block impede my ability to come up with anything better
just yet. I'm hoping some more people will read this thread and offer
their input.

http://paste.lisp.org/display/71773

-Ryan
Reply all
Reply to author
Forward
0 new messages