DGEMM Benchmark

Emily M

unread,

Jul 31, 2012, 11:11:28 AM7/31/12

to juli...@googlegroups.com

Hi all,

I'm wondering if anyone has done the DGEMM benchmark with Julia yet, as I would be curious to see what sort of times people were getting. I'm getting times at ~90sec for a 1000x1000 matrix, which seems slow to me? But I really have more or less no idea what I'm doing, so that could be perfectly normal, I just don't know. Here's the code I have, maybe there are things I can do to make it faster?

function matmult(n::Int)

a = rand(n,n)

b = rand(n,n)

c = fill(0.0,n,n)

y,z = size(c)

tic()

for j = 1:n

for k = 1:n

for i = 1:n

c[i,j] = c[i,j] + b[k,j] * a[i,k]

end

toc()

end

matmult(1000)

Aron Ahmadia

unread,

Jul 31, 2012, 11:16:44 AM7/31/12

to juli...@googlegroups.com

Obtaining fast DGEMM performance requires fairly sophisticated paneling techniques to maximize reuse of the caches and avoid TLB misses. See Goto and van de Geijn for all of the details: http://dl.acm.org/citation.cfm?id=1356053

I believe that the Julia way here would be to use the Julia interfaces into the fast BLAS implementation on your computer.

A

--

Keno Fischer

unread,

Jul 31, 2012, 11:42:00 AM7/31/12

to juli...@googlegroups.com

Right, there should never be a need to manually implement matmult, because we're using BLAS. However, it can still be still be educational to do so (which if I understood correctly was the original intent). On my machine, the difference the times between the BLAS and this trivial algorithm is about one order of magnitude (0.482s vs 5.263s), but the 90s do seem a bit odd to me. Just for reference, what does `@time rand(1000,1000)*rand(1000,1000)` give you?

--

Emily M

unread,

Jul 31, 2012, 12:58:48 PM7/31/12

to juli...@googlegroups.com

alright, so to clarify, I'm working on this with a year of extremely basic coding ability(in Java), so a lot of the technical terminology is lost on me. How would I use the BLAS implementation?

when I do @time rand(1000,100)*rand(1000,1000) i get anywhere between ~1.04 seconds and ~7.78 based on what else is going on on my system at the same time.

Stefan Karpinski

unread,

Jul 31, 2012, 1:07:36 PM7/31/12

to juli...@googlegroups.com

Julia uses a BLAS (either OpenBLAS or your system BLAS depending on how it was configured when you installed it) automatically when doing a floating-point matrix multiply. All you have to do is write A*B where A and B are dense floating-point matrix objects (i.e. Array{Float64,2} or Array{Float32,2}). So writing

@time rand(1000,1000)*rand(1000,1000)

Is already generating 2 million uniformly distributed random numbers and then multiplying the two 1000x1000 random matrices using a BLAS. Looks like that takes about 1s on your system (it takes about 0.25s on mine).

--

Emily M

unread,

Jul 31, 2012, 2:04:08 PM7/31/12

to juli...@googlegroups.com

okay, but then how does that fit in with DGEMM?

I'm running on a 64 bit, 800+ node Linux machine...so I really feel like it shouldn't be taking as long as it is. However, because I am also sharing said machine with x amount of other users, I have no idea how that is affecting my timings.

I was given this:

do j=1,1000
do k=1,1000
     TEMP=b(k,j)
       do i=1,1000
          c(i,j)=c(i,j)+TEMP*a(i,k)
      enddo
   enddo
enddo

I was then told to rewrite it in Julia, which seems to be what I did...but again, it seems very slow. I mean, does what I wrote compare to that? That could be my problem...? Mainly I just really don't know what I'm doing, so any and all advice is welcome!

Aron Ahmadia

unread,

Jul 31, 2012, 2:18:18 PM7/31/12

to juli...@googlegroups.com

Emily,

Do you mind giving us slightly more context? By the description of your problem it sounds like perhaps your mentor is not giving you enough guidance on this project. Is this a summer research project, are you doing this for a class or a company?

Taking your comments at a very high level: when you submit jobs on a cluster, you are usually receiving dedicated access to some fraction of the nodes, depending on how you specify them in your job submission script. By default, you probably only use one node unless you explicitly ask for more. To take advantage of parallelism of a multicore machine, all you need to do is use BLAS. If you want to use more than one node, you'll need an MPI-enabled library of some sort. I strongly recommend Jack Poulson's Elemental library (https://code.google.com/p/elemental/) for fast parallel dense matrix-matrix linear algebra if you want to use the best freely available scientific software out there.

Warm Regards,

Aron

--

Stefan Karpinski

unread,

Jul 31, 2012, 2:41:49 PM7/31/12

to juli...@googlegroups.com

DGEMM stands for Double (i.e. a 64-bit floating-point value) General Matrix Multiply. The routine you were given is a very naive triple-loop DGEMM routine. Any BLAS library will provide a highly optimized, tuned implementation of DGEMM, along with various other operations on dense matrices and vectors. Any BLAS implementation will use a more sophisticated algorithm than such a triple loop, and accordingly be much, much faster.

BLASes will typically take advantage of multiple cores on a single machine automatically, although they often require some BLAS-specific environment variables to be set to control the degree of parallelism. BLASes do not, however, do distributed computations. For that you need some higher level system that handles all the communication required for distributed computations. Such systems generally use a non-distributed BLAS for local computations and build distributed operations on over that.

In Julia, if you want to multiply any two matrices, you just write A*B where A and B are the matrix objects. If they are dense matrices of doubles, then Julia uses a BLAS to do this DGEMM operation. This has the same effect as the triple-loop code you were given, but it's much faster. Julia also has significant support for distributed computation, but it does not happen automatically — unless you construct distributed array objects, in which case it does, but constructing distributed arrays must be done explicitly. If A and B are distributed array objects (DArray), spread across many nodes, then A*B means perform a distributed matrix multiply. I'm not sure what the status of that functionality is currently, but all the requisite infrastructure is in place. With some work, Julia could also be made to call MPI-based distributed computing libraries (like the Elemental library that Aron linked to).

--

Emily M

unread,

Jul 31, 2012, 2:52:58 PM7/31/12

to juli...@googlegroups.com, ar...@ahmadia.net

@Aron, That basically sums it up. I'm working with Julia this summer, because the people I'm working with don't know anything about it either and so wanted someone to explore the language a little bit and report back to them. Once we got Julia built on my machine, I was told to start working with basic operations on Julia, and when that was "done" to start working on writing the DGEMM algorithm(what I was given) in Julia and run it. I've gotten it written now, but my mentor is on vacation for the week, when this week is also the week that I have to prepare a presentation/poster for them for next week. So, that's why I'm on the group right now is because she has limited internet access and can't necessary get back to me in a timely manner. But anyways, at this point I'm trying to get some accurate timings for it so that I can put it in my report.

@Stefan, Thank you for the info, it really helps! I guess my main question right now is, does A*B produce the same result as getting the b(k,j) * a(i,k) value? because that doesn't make sense in my head...

thank you all!

Stefan Karpinski

unread,

Jul 31, 2012, 3:20:32 PM7/31/12

to juli...@googlegroups.com, ar...@ahmadia.net

Your example loop appears to compute A'*B' rather than A*B, but otherwise, yes, that's just a matrix product. It's not a matter of it making sense, but rather a matter of definition. In Julia, A*B for is defined to compute the matrix product, which happens to be implemented by calling a BLAS. You could also implement it by writing your own triple loop code, but that's going to be really, really slow. Comparing triple loop code in any language to a BLAS and finding that it's slow is completely unsurprising.

We already have a rand_mat_mul benchmark (the entire benchmark is on that line) and have benchmark results on the Julia website, comparing various different high-level dynamic languages. This does exactly what your example is doing: generate two random matrices and then multiply them (the size is even the same). This isn't a terribly interesting benchmark because all of the high-level languages just call a BLAS that's written in Fortran (except for JavaScript, which can't call out to Fortran code). We mainly just include it to show how Julia does on standard vectorized code like that.

One thing that Julia has over all the other languages in the benchmark (except maybe JavaScript which does respectably with a naive triple-loop that I wrote for it), is that you *could* write your own BLAS in Julia and have some hope of it competing with a Fortran BLAS (we are faster than calling BLAS for small matrices). But by "competing" I mean "not getting completely destroyed". Fortran is literally made for this kind of thing and the BLASes that people use have been tuned and optimized over 30+ years to be as fast as possible.

For your talk, since it sounds like your mentor just wants you to find out more about Julia and present on it, I would recommend looking at a few of our videos and talk slides:

Stanford talk
Lang.NEXT talk, slides
Channel 9 interview
NYHackR slides (no video was made)
SciPy 2012 slides (video not available yet)

Like I said, the DGEMM comparison isn't a very interesting one since everyone just uses a BLAS. You might want to pick a different bit of code to compare across languages. Our benchmarks might save you some work since they already do exactly that.

--

Jeff Bezanson

unread,

Jul 31, 2012, 4:28:00 PM7/31/12

to juli...@googlegroups.com

Stefan is exactly right, and I just want to echo that matrix multiply
really is a special case --- it is basically the only commonly-used
math kernel that is different *per cpu*. Not architecture (e.g. x86),
mind you, but tuned for specific CPU models. The triple loop will make
any language, including C, look bad. Fortran will probably do much
better since it avoids C's aliasing problems and is likely to do more
loop transformations in the compiler.

Near the top of base/linalg_dense.jl there is a slightly more
optimized matrix multiply in julia that uses some manual hoisting and
loop blocking, if you are curious.

A good benchmark might be to find a matlab or python/numpy routine
that somebody around the lab uses (of moderate size, maybe 100 lines
or so), port it to julia, and compare. That could make it more
relevant and interesting to people too.

> --
>
>
>

Viral Shah

unread,

Aug 1, 2012, 4:49:07 AM8/1/12

to juli...@googlegroups.com

Emily,

Please look at these slides to understand what it takes to get DGEMM to work fast. OpenBLAS goes much beyond these optimizations, but this is a good starting point.