Performant string concatenation (of many, large strings)

642 views
Skip to first unread message

ma...@datasnap.io

unread,
Feb 11, 2015, 8:25:12 PM2/11/15
to clo...@googlegroups.com
I'm looking for the most performant way to transform a huge seq (size 250000) of maps into a single CSV.

The data structure looks something like:

(def data-struct
(repeat 250000 {:one 1 :two 2 :three 3 :four 4}))

A naive implementation would be:

(let [f #(->> % (map (comp str val)) (clojure.string/join ","))]
(->> data-struct
(map f)
(clojure.string/join "\n")))

However, this takes far too long for my application (an the order of 10s of seconds).

Another attempt using reducers:

(require '[clojure.core.reducers :as r])
 
(let [f #(->> % (map (comp str val)) (clojure.string/join ","))
r-join (fn
([] nil)
([x y]
(if (and x y) (str x "\n" y)
(if x (str x)
(if y (str y))))))]
(->> data-struct
(r/map f)
(r/fold r-join)))

Still not great.

But, Looking at the sources of clojure.string/join and clojure.core/str, it becomes apparent that the both implementations create an instance of java.lang.StringBuilder for each element in the sequence. (I have to imagine this is the main issue, even though GC seems to only be ~5% of the runtime)

Would it make sense to instantiate one java.lang.StringBuilder for all of the concatenation (and call java.lang.StringBuilder append)?

What's the best way to do this with idiomatic Clojure?

Thanks a lot!

Herwig Hochleitner

unread,
Feb 11, 2015, 11:37:13 PM2/11/15
to clo...@googlegroups.com
The most versatile way to do this is a pattern, I like to call concatenated lists of strings.

You implement the parts as functions returning sequences of strings.
You combine parts with concat. And map parts over input data with mapcat.
Interpose and interleave have the role of str/join.
Another useful macro for this style is forcat here: https://github.com/webnf/webnf/blob/master/base/src/webnf/base/utils.clj#L32

At the very top level, you can either (apply str (toplevel-part data)) or write the string parts directly to a socket, like ring does.
In clojure, this is already pretty efficient, thanks to lazy sequences.

(do (time
     (->> data-struct
          (map #(interpose ", " (vals %)))
          (interpose [\newline])
          (apply concat)
          (apply str)))
    nil)
; => "Elapsed time: 1104.311277 msecs"

If you still want to save on the allocation of the lazy sequence, you could write it in reducer style with a StringBuilder as state and .append as a step function.

(defn interpose-xf
  "A transducer version of interpose"
  [sep]
  (fn [xf]
    (let [is-first (volatile! true)]
      (completing
       (fn [s v]
         (-> (if @is-first
               (do (vreset! is-first false) s)
               (xf s sep))
             (xf v)))))))

(defn sb-append [sb s] (.append ^StringBuilder sb (str s)))

(do (time (str (transduce
                (comp (map #(eduction (comp (map val) (interpose-xf ", ")) %))
                      (interpose-xf [\newline])
                      cat)
                sb-append
                (StringBuilder.) data-struct)))
    nil)
; => "Elapsed time: 578.444552 msecs"

Herwig Hochleitner

unread,
Feb 11, 2015, 11:52:53 PM2/11/15
to clo...@googlegroups.com
Also, since you probably want defined order of columns for CSV, replace

(map #(interpose ", " (vals %)))

with

(map #(interpose ", " (map % [:one :two :three :four])))

in the str - sequence version

or 

#(eduction (comp (map val) (interpose-xf ", ")) %)

with

#(eduction (comp (map %) (interpose-xf ", "))
           [:one :two :three :four])

respectively

Andy Chambers

unread,
Feb 13, 2015, 11:02:21 PM2/13/15
to clo...@googlegroups.com
Is there a reason you're collecting the result into a string rather than just writing out to a file?

Michael Blume

unread,
Feb 14, 2015, 3:17:33 AM2/14/15
to clo...@googlegroups.com
For minimal change to the presented code, what about

(defprotocol appendable (append-to [this ^StringBuilder sb]))

(extend-protocol appendable
  String
  (append-to [this ^StringBuilder sb] (.append sb this))
  clojure.lang.IFn
  (append-to [this ^StringBuilder sb] (this sb))
  Object
  (append-to [this ^StringBuilder sb] (.append sb (str this))))

(defn final-str [a]
  (let [sb (StringBuilder.)]
    (append-to a sb)
    (str sb)))

(defn par-join [sep xs]
  (fn [sb]
    (when-let [xs (seq xs)]
      (append-to (first xs) sb)
      (doseq [x (next xs)]
        (append-to sep sb)
        (append-to x sb)))))

(let [f #(->> % (map (comp val)) (par-join ","))]
  (->> data-struct
    (map f)
    (par-join "\n")
    final-str))

Which winds up taking about a second and only creates one StringBuilder.



On Fri Feb 13 2015 at 8:02:27 PM Andy Chambers <achambe...@gmail.com> wrote:
Is there a reason you're collecting the result into a string rather than just writing out to a file?

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+unsubscribe@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Blume

unread,
Feb 14, 2015, 3:21:18 AM2/14/15
to clo...@googlegroups.com
er, s/(comp val)/val

Michael Blume

unread,
Feb 14, 2015, 3:24:17 AM2/14/15
to clo...@googlegroups.com
...Annoyingly, almost all the time for my version is spent in protocol dispatch, so there's probably a much faster way to do that.

Ivan L

unread,
Feb 14, 2015, 2:23:08 PM2/14/15
to clo...@googlegroups.com
What is the best way to profile Clojure? I tried a reduce doto thing but it was way slowe than apply str. would love to know why.

Michael Blume

unread,
Feb 14, 2015, 3:25:37 PM2/14/15
to clo...@googlegroups.com
Basically same way you profile java, I usually use jvisualvm, if you feel like shelling out for yourkit that can be nicer.

On Sat Feb 14 2015 at 11:23:12 AM Ivan L <ivan.l...@gmail.com> wrote:
What is the best way to profile Clojure? I tried a reduce doto thing but it was way slowe than apply str.  would love to know why.

Mikera

unread,
Feb 14, 2015, 10:57:59 PM2/14/15
to clo...@googlegroups.com
If you really care about performance then I would use a macro for code generation and do something the following:

(defmacro appendfunc 
     ([m keys sb]
       `(do
          (.append ~sb (~m ~(first keys)))
          ~@(for [k (next keys)]
              `(do 
                 (.append ~sb \,)
                 (.append ~sb (~m ~k))))
          (.append ~sb \n))))

(let [sb (StringBuilder.)
         f (fn [^StringBuilder sb m] (appendfunc m [:one :two :three :four] sb))
         maps (repeat 250000 {:one 1 :two 2 :three 3 :four 4})]
     (time (let [res (str (reduce f sb maps))] (count res))))

=> "Elapsed time: 118.105355 msecs"

Could probably optimise a bit more but that's under 400ns per row... pretty decent I think.
Reply all
Reply to author
Forward
0 new messages