Using clojure-csv with large files

728 views
Skip to first unread message

Timothy Washington

unread,
Jul 8, 2012, 12:34:58 PM7/8/12
to clo...@googlegroups.com
Hi there, 

I'm trying out the Clojure-csv lib. But I run out of heap space when trying to parse a large CSV file. So this call should return a lazy sequence. 

(csv/parse-csv (io/reader "125Mfile.csv")) 


Instead, I get a "java.lang.OutOfMemoryError: Java heap space". Is there a way to get that lazy sequence before reading in the entire file? I can't see one, when looking at the code


Thanks in advance 
Tim 

Denis Labaye

unread,
Jul 8, 2012, 1:39:15 PM7/8/12
to clo...@googlegroups.com
Hi, 

I would try something like (untested): 

(map parse-csv
     (line-seq (clojure.java.io/reader "/tmp/foo.csv")))

But it will break for CSV cells with newlines like: 
a  ; b
foo;"bar
baz"
x  ; z

interesting ...

Denis



--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

David Santiago

unread,
Jul 8, 2012, 2:39:34 PM7/8/12
to clo...@googlegroups.com
Yeah, CSV files can have embedded newlines, so you can't just split it
up on linebreaks and expect it to work, you need to send them through
a parser.

parse-csv *is* lazy, so my question is, are you doing this at the
repl, exactly as you wrote? If so, it will lazily parse the file, and
then print that sequence to the repl output, which will consume the
whole sequence, causing it to all be in memory at once, and the
exception you got. It's the same problem as if you do (repeat 10) at
the repl.

If you are instead consuming it lazily (by say assigning it to a
variable and processing it in some way that only consumes as much as
you need of it, or using a function that processes a lazy seq a piece
at a time), then there is a bug in the library, and I'd appreciate it
if you file an issue for me on the repo so we can get it sorted out
ASAP.

David

Sean Corfield

unread,
Jul 8, 2012, 5:15:28 PM7/8/12
to clo...@googlegroups.com
On Sun, Jul 8, 2012 at 9:34 AM, Timothy Washington <twas...@gmail.com> wrote:
> I'm trying out the Clojure-csv lib. But I run out of heap space when trying
> to parse a large CSV file. So this call should return a lazy sequence.
>
> (csv/parse-csv (io/reader "125Mfile.csv"))

If you are consuming the CSV lazily, you should be able to parse very
large files. At World Singles, we use clojure-csv to parse PowerMTA
logfiles in excess of 400Mb every day (about 1m rows).
--
Sean A Corfield -- (904) 302-SEAN
An Architect's View -- http://corfield.org/
World Singles, LLC. -- http://worldsingles.com/

"Perfection is the enemy of the good."
-- Gustave Flaubert, French realist novelist (1821-1880)

Dustin Getz

unread,
Jul 8, 2012, 7:50:48 PM7/8/12
to clo...@googlegroups.com
what are you doing with the return value of csv/parse-csv?

Timothy Washington

unread,
Jul 8, 2012, 8:58:02 PM7/8/12
to clo...@googlegroups.com
Hmm, these are all good points. My source file just returns the sequence (as in src.clj). I have a midje test file (test.clj) that just asserts that it exists. 

src.clj 

(ns my-ns ... )

(defn load-config []
  
  (let [config (load-file "etc/config.clj")
        dname (-> config :data :test)]
    (csv/parse-csv (io/reader dname))
  )
)


test.clj

(fact "load config training file; ensure first tickis as expected"
      (config/load-config) => truthy
)


Sean and David, you are very correct in that it was midje that was doing the consuming. If, on the repl, I lazily pull from the seq, then I get the desired effect. 

user> (def thing (config/load-config))

user> (first thing) 
["Time" "Ask" "Bid" "AskVolume" "BidVolume"] 

user> (second thing) 
["01.05.2012 20:00:00.676" "1.32390" "1.32379" "3000000.00" "2250000.00"]


So the question seems to be which test functions let me deal with lazy sequences in this way. For now, I'm using this: 

(fact "load config training file; ensure first tickis as expected"
      (-> (config/load-config) nil? not) => true
)



Cheers all 

Tim Washington 



On Sun, Jul 8, 2012 at 7:50 PM, Dustin Getz <dusti...@gmail.com> wrote:
what are you doing with the return value of csv/parse-csv?
Reply all
Reply to author
Forward
0 new messages