J matrix multiplication performance

403 views
Skip to first unread message

Thomas McGuire

unread,
May 29, 2024, 2:16:36 AMMay 29
to fo...@jsoftware.com
I had forked J9.6 and added the apple M1/M2 patch Elijah Stone had placed in the old forum. I did see about a 33% decrease in the time it took to do the following multiply:

A =: 3000 4000?.@$0
B =: 4000 2000?.@$0
C =: A +/ . * B

I compiled with USE_OPENMP=1 and used brew to download libomp. I just thought I would see a more dramatic change. Anyone have any experience with this. This takes about 4.7 seconds when I time it.

Also on the old forum archive was a discussion about using threading to speed this up. But on my regular J9.6 beta 8 (downloaded from the install directory at the jsoftware site), I see no appreciable performance enhancement. I do see that it engages more than 1 core.

The threading discussion indicated that using:

0&T.@''^:4 ‘'

should provide a noticable speed up in calculating. But it takes about 6.3 seconds on my M2 max MacBook threads or no threads.

Any insight would be appreciated.

Elijah Stone

unread,
May 29, 2024, 2:29:19 AMMay 29
to fo...@jsoftware.com
openmp should not affect anything in this case

33% was in the same ballpark as the speedup i saw from using the apple matrix multiply
> To unsubscribe from this group and stop receiving emails from it, send an email to forum+un...@jsoftware.com.
>
>

Thomas McGuire

unread,
May 29, 2024, 2:56:35 AMMay 29
to fo...@jsoftware.com
My understanding of the Accelerate framework that Apple provides is that this operation is taking place on a matrix multiply coprocessor available to certain cores on the M2 chip. Not quite what I was hoping for.

In researching this it looks as though Numpy has taken some dedicated shaders Apple made for the Metal API and are able to force some matrix multiples onto the GPU. Do you know anything about doing that?

I’m not sure what I would do with it yet but it seems pretty cool that my high level J code could run on the GPU without me having to use separate API calls and go through the tedium of setting up a DLL interface.

Thanks for your late night reply, btw (at least in my timezone)

Tom McGuire

Thomas McGuire

unread,
May 29, 2024, 5:29:16 AMMay 29
to fo...@jsoftware.com
A correction it is PyTorch that has incorporated access to the Apple silicon GPUs not Numpy.

Tom McGuire

Ak

unread,
May 29, 2024, 10:00:04 AMMay 29
to fo...@jsoftware.com
How many threads have you spun up?  (1T.'')
Are you using threadpools other than threadpool '0'?
How many cores are available?  (8 T.'')
What is the instruction are you using to run the command?

Ak

Devon McCormick

unread,
May 29, 2024, 11:05:02 AMMay 29
to fo...@jsoftware.com
This is nice.  I get good speedups, with diminishing returns, as I add threads.  This is on an Intel i9-10900F CPU @ 2.80GHz desktop:

   A =: 3000 4000?.@$0
   B =: 4000 2000?.@$0
   6!:2 'C=. A +/ . * B'
1.28888

   0&T.@''^:4 ''
   1 T. ''
4
   6!:2 'C=. A +/ . * B'
0.265315
   
   0&T.@''^:8 ''
12
   1 T. ''
12
   6!:2 'C=. A +/ . * B'
0.182302

   8 T.''
20 63


--

Devon McCormick, CFA

Quantitative Consultant


Thomas McGuire

unread,
May 29, 2024, 11:24:41 AMMay 29
to fo...@jsoftware.com
So on my J6.9-beta8 downloaded from jsoftware installation site, I am not getting any benefit from threading. Is there some configuration I am missing? This is on Macbook M2 max architecture.

Ak

unread,
May 29, 2024, 3:53:49 PMMay 29
to fo...@jsoftware.com
An idea.


(0 T. ])"0 (2# 1 2 3 4)    NB. Spinup 8 threads in 4 pools.

aa=: 30000 4000?.@$0       NB. Larger array
bb=: 4000 2000?.@$0

5 (6!:2,7!:2@])' cc=:      aa  +/ . * bb'
5 (6!:2,7!:2@])' cc=:     ; aa t.''''  +/ . * bb'
5 (6!:2,7!:2@])' cc=:     ; aa +/ . * t. (<''worker'';1) bb'

With a larger set I notice a reported  (~5 % ) time improvement and a dramatic space improvement 5.5 e8 to 3kb for 5 runs each.


Ak

-------- Original message --------
From: Thomas McGuire <tmcgu...@gmail.com>
Date: 5/29/24 09:24 (GMT-07:00)
Subject: Re: [Jforum] J matrix multiplication performance

Henry Rich

unread,
May 29, 2024, 4:00:32 PMMay 29
to fo...@jsoftware.com
Interesting.  If you have N threads in threadpool 0, x +/ . * y automatically gives each thread (#x)%N rows of x to calculate the product of.  This is pretty good until the number of rows gets 'too small', probably ~128 or so.

Henry Rich


Thomas McGuire

unread,
Jun 7, 2024, 8:42:02 AMJun 7
to fo...@jsoftware.com
Taking Raul’s statement to heart I did some more digging to see if I could use Apple’s Metal Performance Shader (MPS) API within J. As proof of concept I used the FFI available in the ‘dll’ library. 

First not being an Apple developer the first problem with using Apple’s Metal API is that Apple likes to steer you toward Swift. There are some Metal APIs that can be used directly in C/C++ but not the MPS API. It is available to Objective-C, which it turns out if you write a typical C style function call it is accessible to J via the FFI facility in J.

There are some limitations. The MPS API only allows an 8000 x 8000 maximum for matrix multiplication. It also only uses 4 byte floats (aka float32). This is different from J which uses double as the underlying floating point representation. The Apple BLAS library also uses doubles. Which makes it easier to integrate right in the Source code as Elijah Stone had done. 

I have a confession to make. Not being a swift or objective-c programmer I took an example swift program and took out the salient features I need and then asked ChatGPT to translate that into objective-c. It didn’t take much hand modification from there to get the MPS version of matrix product up and running and tested in a C calling program. I asked ChatGPT to make some J test code and write the FFI but it failed miserably and I had to do that myself. 

If you have an M2 machine and are interested I made it installable into J via the ‘install’ command from GitHub. 

install 'github:tmcguirefl/JmpsMP@main'
load 'tmcguirefl/mpsMP/mpsmp.ijs

then run: testMP 0 

Here are my results with impressive timing on the MPS multiply

Regular J matrix product

1018.25 1013.55 999.638 1000.76 1015.25 1032.3 1006.25 1022.34 1019.53 1014.35


Now with Apple Metal Performance Shader API

1018.25 1013.55 999.638 1000.76 1015.25 1032.3 1006.25 1022.33 1019.53 1014.35


Standard J MP : 6.08373 6.71108e7

MPS MP : 0.066262 5.45267e8


Yes that appears to be about 100x speed up. So I guess we know why GPU processing is favored for neural network / AI stuff.


All the code is available on Github so I won’t reproduce it here except for the links:







On May 29, 2024, at 12:01 PM, Raul Miller <rauld...@gmail.com> wrote:

You might be the best person to work out the details of that kind of
implementation.

--
Raul

On Wed, May 29, 2024 at 8:14 AM Thomas McGuire <tmcgu...@gmail.com> wrote:

Thomas McGuire

unread,
Jun 8, 2024, 3:41:39 AMJun 8
to fo...@jsoftware.com

LdBeth

unread,
Jun 8, 2024, 12:17:21 PMJun 8
to fo...@jsoftware.com
Actually, I think I have enough experience with ObjC to make this work.

I don't have M2 but only an Intel MacBook that runs macOS Monterey,
but the Metal API is still available for the version of OS I
installed. And I just managed to write an objc matrix multiplication
example that works.


> It also only uses 4 byte floats (aka float32).
Supported data types are listed at
https://developer.apple.com/documentation/metalperformanceshaders/mpsdatatype?language=objc

Yes, there are complex16, complex32, float15, float32, uint32, uint64, but no
float64.

ldbeth
mps.m

LdBeth

unread,
Jun 8, 2024, 9:16:26 PMJun 8
to fo...@jsoftware.com
Since the linked repo does not contain the source code of the shared lib for
some reason, I just rolled my own, and not a surprise the performance
gain is noticeable on Intel Mac

Here is the full code, the command to compile the shared library is

clang -framework Foundation -framework CoreGraphics -framework Metal -framework MetalPerformanceShaders -fobjc-arc -shared -o libmps.dylib libmps.m

mps.ijs
libmps.m

bill lam

unread,
Jun 9, 2024, 5:08:26 AMJun 9
to fo...@jsoftware.com
For USE_OPENMP binaries, you can test with BLAS enabled
save this to a script gm1.ijs

A=: 3000 4000?.@$0
B=: 4000 2000?.@$0
NP=: */ 3000 4000 2000

NB. using blas
0 (9!:58)"0 i.3        NB.  +/ .*  always use blas
t1=: 6!:2'c2=: A+/ .*B'
echo ' sec',~ ":t1
echo 'blas ',' GFlop ',~ 0j3": 2*NP%(t1)*1e9

start jconsole with OMP environment variable

bill:~% OMP_NUM_THREADS=1 jc ~/gm1.ijs
13.1143 sec
blas 3.660 GFlop
   ^D%                                                                                                                          
bill:~% OMP_NUM_THREADS=4 jc ~/gm1.ijs
3.96833 sec
blas 12.096 GFlop
   ^D%                                                                                                                          
bill:~% OMP_NUM_THREADS=8 jc ~/gm1.ijs
2.85146 sec
blas 16.833 GFlop



Thomas McGuire

unread,
Jun 9, 2024, 7:23:23 AMJun 9
to fo...@jsoftware.com
Yeah I had a brain freeze using my mail reader, ended up sending a good email to Raul personally instead of the forum. When I fixed that I forgot the links to the code. It took me a couple of days to figure that out and send the links in, sorry. But I like your objective-C better than mine. 

So the next question is how to integrate it under the covers. For Elijah Stone's Apple libblas.dylib usage that can be added in at compile time by checking the architecture being built. This is nice in that the runtime only has the code for the particular architecture. 

Adding in MPSMatrixMultiplication to the J source code and deciding when to use mps vs libblas for a particular matrix product. I was thinking of some combination of matrix size and print precision.
- under a certain size requirement the interpreter would use the libblas or legacy approach, above that size use the mpsMatrixMult code.  Then if you wanted to force a libblas  (or legacy) approach, so you could gain precision over speed,  the interpreter could look for  a certain threshold in the print precision. With something like that I can program in J and let the interpreter decide how to carry out the calculation.  

I have written some J code to overcome the 8000x8000 limitation mps has by breaking the multiplication into pieces from an algorithm I found here on Wikipedia (see the divide and conquer routine section:


My J implementation is attached. It’s called rMP for recursive MatrixProduct.  A 12000x14000 X 14000x13000 matrix product took 4.26 seconds with my version of the mpsMP.  When I get a chance I will use it with LDBeth's MPSMatrixMul code.

Tom McGuire
rmp.ijs
Reply all
Reply to author
Forward
0 new messages