[Large File Processing] What am I doing wrong?

355 views
Skip to first unread message

Jarrod Swart

unread,
Jan 21, 2014, 1:55:00 AM1/21/14
to clo...@googlegroups.com
I'm processing a large csv with Clojure, honestly not even that big (~18k rows, 11mb).  I have a list of exported data from a client and I am de-duplicating URLs within the list.  My final output is a series of vectors: [url url-hash].

The odd thing is how slow it seems to be going.  I have tried implementing this as a reduce, and finally I thought to speed things up I might try a "with-open and a loop-recur".  It doesn't seem to have done much in my case.  I know I am doing something wrong I'm just not sure what yet.  The best I can do is about 4 seconds, which may only seem slow because I implemented it in python first and it takes a half second to finish.  Still this is one of the smaller files I will likely deal with so I'm worried that as the files grow it may get too slow.

The code is here on ref-heap for easy viewing: https://www.refheap.com/26098

Any advice is appreciated.

Rudi Engelbrecht

unread,
Jan 21, 2014, 3:11:16 AM1/21/14
to clo...@googlegroups.com
Hi Jarrod

I have had success with the clojure-csv [1] library and processing large files in a lazy way (as opposed to using slurp).


Here is a copy of my source code (disclaimer - this is my first Clojure program - so some things might not be idiomatic).

This code handles a 250MB file, 315K rows (each row has 100 columns / fields) really well, and can scale in terms of memory usage since it handles the file lazily and processes / parses each line one at a time.

See snippets of code below

(ns scripts.core
  (:gen-class))

(require '[clojure.java.io :as io]
         '[clojure-csv.core :as csv]
         '[clojure.string :as str])

(def line-count 0)

(defn parse-row [row]
  (first (csv/parse-csv row :delimiter \tab)))

(defn parse-file [filename]
  (with-open [file (io/reader filename)]
    (doseq [line (line-seq file)]
      (let [record (parse-row line)]
        (println record)) ;; replace println record with your own logic
      (def line-count (inc line-count)))))

(defn process-file [filename]
  (do
    (def line-count 0)
    (parse-file filename)
    (println line-count)))

(defn -main [& args]
  (process-file (first args)))

Feel free to ask questions if you need more info.

Kind regards

Rudi

--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Chris Perkins

unread,
Jan 21, 2014, 8:11:09 AM1/21/14
to clo...@googlegroups.com

This part: (some #{hashed} already-seen) is doing a linear lookup in `already-seen`. Try (contains? already-seen hashed) instead.

- Chris

Jim - FooBar();

unread,
Jan 21, 2014, 8:43:40 AM1/21/14
to clo...@googlegroups.com
On 21/01/14 13:11, Chris Perkins wrote:
> This part: (some #{hashed} already-seen) is doing a linear lookup in
> `already-seen`. Try (contains? already-seen hashed) instead.

+1 to that as it will become faster...

I would also add the following not so related to performance:

(drop1 (line-seqf)) ==> (next(line-seqf))


(ifseen? nil [url hashed]) ==> (when-not seen?[url hashed])

(ifseen? nil hashed) ==>(when-not seen? hashed)

(if(seq(restlines))... ==> (if(seqlines)...


I actually think the last one is a bug...it seems to me that you are skipping one row in the condition...you pass (rest lines) every time you recurse yes?
checking for more lines should be done for *all* current lines, not (rest current-lines)...unless I 've misunderstood something...


Jim


Michael Gardner

unread,
Jan 21, 2014, 9:08:16 AM1/21/14
to clo...@googlegroups.com
On Jan 21, 2014, at 07:11 , Chris Perkins <chrispe...@gmail.com> wrote:

> This part: (some #{hashed} already-seen) is doing a linear lookup in `already-seen`. Try (contains? already-seen hashed) instead.

Or just (already-seen hashed), given that OP's not trying to store nil hashes.

To OP: note that if you’re storing the hashes as strings (as it appears), you’re using 16 more bytes per hash than necessary. If you’re really going to be dealing with so many URLs that you’d use too much memory by storing the unique URLs directly, then you should probably be storing the hashes as byte arrays.

Alternatively, if you’re going to be dealing with REALLY large files and are running on Linux/BSD, consider dumping just the URLs to a file and using “sort -u” on it. UNIX Sort can efficiently handle files that are too large to fit in memory, via external merge sort.

Jarrod Swart

unread,
Jan 21, 2014, 11:00:32 AM1/21/14
to clo...@googlegroups.com
Chris,

Thanks this was in fact it.  I had read that sets had a near O[1] lookup, but apparently I was not achieving this properly with (some).  Thank you the execution time is about 25x faster now!

Jarrod

Jarrod Swart

unread,
Jan 21, 2014, 11:02:03 AM1/21/14
to clo...@googlegroups.com
Jim,

Thanks for the idioms, I appreciate it!

And thanks everyone for the help!

danneu

unread,
Jan 27, 2014, 1:46:46 AM1/27/14
to clo...@googlegroups.com
I use line-seq, split, and destructuring to parse large CSVs.

Here's how I'd approach what I think you're trying to do:

    (with-open [rdr (io/reader (io/resource csv :encoding "UTF-16"))]
        (let [extract-url-hash (fn [line]
                                 (let [[_ _ _ url & _] (str/split line #"\t")]
                                   [(m/md5 url) url]))]
          (->> (drop 1 (line-seq rdr))
               (map extract-url-hash)
               (into {}))))

Curtis Gagliardi

unread,
Jan 27, 2014, 4:50:04 PM1/27/14
to clo...@googlegroups.com
If ordering isn't important, I'd just dump them all into a set instead of manually checking whether or or not you already put the url into a set. 
Reply all
Reply to author
Forward
0 new messages