How to optimize code using core.matrix + vectorz

Alexey Cherkaev

unread,

Mar 4, 2016, 4:18:30 AM3/4/16

to Numerical Clojure

Hi all,

I've written conjugate gradient (CG) algorithm to iteratively solve symmetric linear equations (especially efficient for sparse systems). One beauty of CG is that it doesn't really need a matrix: all it needs is a product of the matrix by a supplied vector. Thus, "matrix" (or rather "linear operator") can be implemented as a function (essentially `mmul` "closed" over matrix). Here is an example of the matrix representing the discretization of the second derivative (for example, diffusive term in mass balance) with `d` being diffusion coefficient and `dx` is the step of the mesh.

(defn diff-matrix [^long n]
  (let [n-1 (long (dec n))]
    (fn [x b]
      (let [d (double 1e-9)
            dx (double 1e-2)
            coeff (double (/ d dx dx))]
        (m/mset! b 0 (m/mget x 0))
        (loop [i (long 1)]
          (cond (= i n-1) (m/mset! b i (m/mget x i))
                :else (do (m/mset! b i
                                   ^double
                                   (* coeff
                                      ^double
                                      (+ ^double (m/mget x (dec i))
                                         ^double (m/mget x (inc i))
                                         ^double (* -2.0 ^double (m/mget x i)))))
                          (recur ^long (inc i)))))))))

The function is closed over `n` - the size of the problem, the main part is within `fn` form. It produces the result of multiplication `A*x` by putting the result into `b`. Couple of points:

I am using mutation to avoid allocation. Allocation is not a problem on its own, but if there is too much allocation, GC needs to be fired too often and that may slow down the computation (especially if linear solver is a part of non-linear solver, which is a part of ODE solver...)
I am using direct `loop`-`recur`: I've tried `map-indexed!`, but it was quite slow.
I am using Vectorz as implementation; I set `*warn-on-reflection*` to true`.

The problem is that the solver is rather slow: on 1000-element system it takes ~240 ms to solve (for comparison, my Common Lisp implementation takes ~50 ms with vector operations not always optimised for double-float arithmetic). Profiling with `timbre` showed that the slowest step is the "matrix multiplication", i.e. application `(diff-matrix x b)` (takes about 90% of the total time).

Question is why it is so slow. In my attempt to optimize `diff-matrix` I've put type hints everywhere I could, but now I'm stuck. Is there a better way to do the loop? Is it arithmetic that is slow, or `mget`/`mset!`? I can avoid the loop, if, for example, I operate on vector slices (can I? considering the mutation of `b`) and set the first and last elements separately.

PS. My aim is to bring this to <100 ms to consider the method/implementation viable.

Cheers, Alexey

Mike Anderson

unread,

Mar 6, 2016, 1:06:41 AM3/6/16

to Numerical Clojure

Hi Alexey,

The way to make this stuff fast is to do operations on entire vectors / matrices at once rather than iterating over individual elements with mset and mget (both of which have the overhead of protocol dispatch, which is quite efficient but far from optimal).

In this case it looks like you are adding multiples of shifted versions of a vector.... so I suggest eliminating the loop entirely and do the computation with subvectors, e.g. using expressions like:

(add (subvector result 0 n-1) (subvector x 1 n))

If you do this, you should get to near-native speed, because subvector is very efficient in Vectorz (uses a lightweight view, no unnecessary array copying etc.)

Also, type hints probably become irrelevant with this approach. Do check that all your vectors are in Vectorz format however, if you mix in Clojure vectors it will be slower.

Alexey Cherkaev

unread,

Mar 6, 2016, 5:18:17 AM3/6/16

to Numerical Clojure

Hi Mike,

Thanks!

Here is an updated version of the function:

(defn diff-matrix [^long n]
  (let [n-1 (long (dec n))]
    (fn [x b]
      (let [d (double 1e-9)
            dx (double 1e-2)
            coeff (double (/ d dx dx))]

(m/assign! b 0.0)

(m/mset! b 0 (m/mget x 0))

        (m/mset! b n-1 (m/mget x n-1))
        (m/add! (m/subvector b 1 (- n 2))
                (m/subvector x 0 (- n 2))
                (m/subvector x 2 (- n 2)))
        (m/add-scaled! (m/subvector b 1 (- n 2)) (m/subvector x 1 (- n 2)) -2.0)
        (m/scale! (m/subvector b 1 (- n 2)) coeff)))))

And the solver overall time dropped from 240 ms to 70-100 ms! And it's not only faster, but clearer as well.

PS. SBCL does some wonderful things with floating arithmetic because the same Common Lisp program that was run using CCL (not bad and reasonably fast native compiler) took ~0.9 s compared to 50 ms! So, Clojure is not doing badly at all.

Thanks,

Alexey

Mike Anderson

unread,

Mar 6, 2016, 8:09:55 PM3/6/16

to numerica...@googlegroups.com

Looks good - that is definitely the way that core.matrix is intended to be used! Glad that you got a good speedup as well.

I have a couple more minor recommendations if you want to refactor:

- Currently you are passing a vector b that you are zeroing, mutating and copying to. You may find it easier to just make a new vector in the function - this is more idiomatic / functional and it is fairly cheap, probably no more expensive than zeroing memory which is what the JVM is doing anyway under the hood.

- Not sure why you need to provide n and create a closure for "n-1". Can't the function simply do (m/dimension-count x 0) to get n? This is also pretty cheap, and you avoid creating a whole bunch of different function objects if n changes frequently.

--
You received this message because you are subscribed to a topic in the Google Groups "Numerical Clojure" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/numerical-clojure/CZ0RJe_dlvc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to numerical-cloj...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alexey Cherkaev

unread,

Mar 7, 2016, 2:47:17 AM3/7/16

to Numerical Clojure, mi...@mikera.net

Starting from the last point: `n` is an artefact of my attempt to hand-optimize the code.

The first question is more interesting: I know that it would be more idiomatic to produce a new vector instead of passing and nullifying the existing one. My reasoning for not doing this is that a new vector will be short-lived (eventually) and considering many calls that `diff-matrix` would experience in the process, say, solving ODE with implicit method: non-linear equation needs to be solved at each time-step, for which the linear equation needs to be solved repeatedly, in the process of which `diff-matrix` will be called repeatedly. So, I am avoiding GC being fired up to clean up all the intermediate short-lived vectors. However, I don't know the internals of JVM that well. Perhaps, it has extremely efficient collection of garbage from "nursery". I guess, the only way to find out is to test different implementations.

Mike Anderson

unread,

Mar 7, 2016, 5:04:50 AM3/7/16

to Numerical Clojure, mi...@mikera.net

True, you will need to benchmark to get a proper understanding for your use case.

In my quick tests (using Google Caliper library):

- Creating new Vectorz Vectors costs about 1.3ns per element

- Creating new Java double[] arrays cost about 1.2ns per element

- Zeroing an existing Vector costs about 0.2ns per element

i.e. these are all very cheap operations - I doubt they will be your bottleneck. The vector additions and multiplications will certainly be more expensive.

Reply all

Reply to author

Forward