beginner clojure question: OutOfMemory error processing (slightly) large data file

130 views
Skip to first unread message

Avram

unread,
Mar 22, 2011, 4:00:49 PM3/22/11
to Clojure
Hi,

I (still) consider myself new to clojure. I am trying to read a 37Mb
file that will grow 500k every 2 days. I don't consider this to be
input large enough file to merit using Hadoop and I'd like to process
it in Clojure in an efficient, speedy, and idiomatic way.

I simply want something akin to a transpose, where the input looks
like this:
( [ a1 b1 c1 d1 ] [ a2 b2 c2 d2 ] [ a3 b3 c3 d3 ])

…and the output looks like this:

[ [ a1 a2 a3 ] [ b1 b2 b3 ] [ c1 c2 c3 ] [ d1 d2 d3 ] ]

Gleaning what I can from various sources and cobbling them together, I
have the following below, which works for small input but not for the
intended file sizes (and larger) I'd like it to be able to handle.

(use 'clojure.contrib.io)
(require 'clojure.string)

(def tabfn "/Users/avram/data/testdata.tab")

(defn is-comment?
"Checks if argument is a comment (i.e. starts with a '#').
Returns: boolean."
[line]
(= \# (first line)))

(defn data-lines
"Returns data lines in file (i.e. all lines that do not start with
'#')
Returns: sequence containing data lines"
[filename]
(drop-while is-comment? (line-seq (reader filename))))

(defn parsed-data-lines
[filename]
(map #(clojure.string/split % #"\t") (data-lines filename)))

(def signals (vec (apply map vector (parsed-data-lines tabfn))))


user=> (def signals (vec (apply map vector (parsed-data-lines
tabfn))))
java.lang.OutOfMemoryError: Java heap space (NO_SOURCE_FILE:68)


How can I avoid the OutOfMemoryError?

Is there a Leiningen setting where I can increase the memory or is
there a more efficient way to achieve this?

Also, I'd prefer to read in gzip'd tab-delimited files instead of
uncompressed tab-delimited files. What is the idiomatic clojure way
to do this?


Comments on improvements and criticisms welcome :)

Thanks,
Avram

Ken Wesson

unread,
Mar 22, 2011, 7:14:37 PM3/22/11
to clo...@googlegroups.com
On Tue, Mar 22, 2011 at 4:00 PM, Avram <aav...@me.com> wrote:
> Hi,
>
> I (still) consider myself new to clojure.  I am trying to read a 37Mb
> file that will grow 500k every 2 days. I don't consider this to be
> input large enough file to merit using Hadoop and I'd like to process
> it in Clojure in an efficient, speedy, and idiomatic way.
>
> I simply want something akin to a transpose, where the input looks
> like this:
> ( [ a1 b1 c1 d1 ] [ a2 b2 c2 d2 ] [ a3 b3 c3 d3 ])
>
> …and the output looks like this:
>
> [ [ a1 a2 a3 ] [ b1 b2 b3 ] [ c1 c2 c3 ] [ d1 d2 d3 ] ]
>
> Gleaning what I can from various sources and cobbling them together, I
> have the following below, which works for small input but not for the
> intended file sizes (and larger) I'd like it to be able to handle.

You'll need to avoid holding onto the head of your line-seq, which
means you'll need to make multiple passes over the data, one for the
as, one for the bs, and etc., with the output a lazy seq of lazy seqs.

> (defn data-lines
>    "Returns data lines in file (i.e. all lines that do not start with
> '#')
>      Returns: sequence containing data lines"
>    [filename]
>    (drop-while is-comment? (line-seq (reader filename))))

The description doesn't match the function, unless it's guaranteed
that no line will start with # after the first line that doesn't do
so. You may want remove instead of drop-while here, or to change the
doc string.

> Also, I'd prefer to read in gzip'd tab-delimited files instead of
> uncompressed tab-delimited files.  What is the idiomatic clojure way
> to do this?

There are zip functions in the Java standard library. I don't know if
they can handle gzip, or just pkzip. In the worst case, you'd have no
library you could use. Even then, it could be done in at least two
ways.

1. Use Runtime/exec to call shell tools to gunzip the file to a
temporary file for processing.

2. Read at wikipedia and implement gunzip in Clojure, using byte arrays
and whatever other tools you'd need to work with binary data at a low
level, and/or Java's ByteBuffer and related classes.

Stuart Sierra

unread,
Mar 22, 2011, 7:15:47 PM3/22/11
to clo...@googlegroups.com
Hi Avram,

Assuming you're using the Sun/Oracle JDK, you can increase the size of the Java heap with the -Xmx command-line option.  For example:

    java -Xmx512mb -cp clojure.jar:your-source-dir clojure.main

Will run Java with a 512 MB heap.  This increases the amount of memory available to your program.  Obviously, you don't want the heap to be larger than the available RAM.

With Leiningen, you can add the :jvm-opts option in project.clj, as shown here: https://github.com/technomancy/leiningen/blob/master/sample.project.clj#L142

More generally, this line:

    (def signals (vec ...))

says that you want the entire result, as a vector, stored as the value of the Var `signals`.  That means your entire result data must fit in the Java heap.  For a 37 MB file, that's not unreasonable, but as soon as your file gets larger than your available RAM, you'll have to come up with an alternate approach.

-Stuart Sierra

Stuart Sierra

unread,
Mar 22, 2011, 7:17:26 PM3/22/11
to clo...@googlegroups.com
Oh, and the standard JDK class java.util.zip.GZIPInputStream implements gzip decompression.

-Stuart Sierra

Avram

unread,
Mar 22, 2011, 7:37:02 PM3/22/11
to Clojure
Thanks, Ken.

> You'll need to avoid holding onto the head of your line-seq, which
> means you'll need to make multiple passes over the data, one for the
> as, one for the bs, and etc., with the output a lazy seq of lazy seqs.

Actually, it would be great to make separate, asynchronous passes for
the a's, the b's, the c's, and the d's. Any suggestions on how to
accomplish this?

> There are zip functions in the Java standard library.

I was hoping to avoid direct calls to Java and stay in clojure, if
possible… (learning clojure is hard enough, without adding java to the
mix ;)
~A

Avram

unread,
Mar 22, 2011, 7:41:01 PM3/22/11
to Clojure
Thanks, Stuart.

> With Leiningen, you can add the :jvm-opts option in project.clj,

Cool, this is what I was looking for :)


>     (def signals (vec ...))
>
> says that you want the entire result, as a vector, stored as the value of
> the Var `signals`.  That means your entire result data must fit in the Java
> heap.  

Yes, this is what I want. I suppose that if it grows too large, I can
figure out a way to write it to a file somehow instead of creating a
vector. The end result will likely be a JSON representation for each
of the signals a, b, c, and d written to a file.

~A
Reply all
Reply to author
Forward
0 new messages