pico-gpt2-j Ryan Huang's small gpt2 written in J

122 views
Skip to first unread message

Thomas McGuire

unread,
Mar 24, 2025, 5:08:18 AMMar 24
to fo...@jsoftware.com
I had found this implementation via ArrayPortal.com array search engine. Bob Therriault had made an anouncement that they moved this search tool to the web (it used to be a local app you ran in J). So I had been studying LLMs off and on for the past 6 months and on a whim I typed GPT2 into ArrayPortal and up pop Ryan Huang’s picogpt-in-j which is a J implementation of GPT2. It was based off a “ridiculously small" GPT2 in C. There is a little more detail at the github site readme page. 



I haven’t been able to test this implementation on an AVX2 version of J. But after pointing out a problem with J to Henry. Henry was kind enough to give me a work around. So now I was set on trying this out on my Apple Macbook Pro M2 chip laptop. But with Elijah Stone’s patch to run the matrix multiplication with the M2 accelerate framework so the multiplication takes place with Apple’s specialized AMX instructions. 

On a non-optimized J installation the 1558M model takes about 167sec to output 40 tokens. Using the apple accelerate framework the same run takes only 28 seconds. Highly tolerable speed to be able to work in J. 

The nice thing about the GPT2 implementation is that it uses a model that already exists on Hugging Faces. You just need to pull the files off of  Hugging Faces and the GPT2 J implementation will complete an open ended prompt. 

After cloning the 3 J files that comprise picogpt-in-j to a local directory I followed the directions on obtaining the model data off Hugging Faces using the download.sh script included with the J files. 


-rw-r--r--  1 501  20  1067 Mar 16 01:24 LICENSE

-rw-r--r--  1 501  20  3041 Mar 16 01:24 README.md

-rwxr-xr-x  1 501  20  1023 Mar 16 01:24 download.sh

-rw-r--r--  1 501  20  1898 Mar 23 14:06 encoder.ijs

-rw-r--r--  1 501  20  2270 Mar 16 01:24 gpt2.ijs

drwxr-xr-x  6 501  20   192 Mar 16 01:36 models

-rw-r--r--  1 501  20  1757 Mar 16 01:24 utils.ijs


./models:

total 2936

drwxr-xr-x  4 501  20      128 Mar 16 01:25 124M

drwxr-xr-x  4 501  20      128 Mar 16 01:36 1558M

-rw-r--r--  1 501  20   456318 Mar 16 01:25 merges.txt

-rw-r--r--  1 501  20  1042301 Mar 16 01:25 vocab.json


./models/124M:

total 1070528

-rw-r--r--  1 501  20        665 Mar 16 01:25 config.json

-rw-r--r--  1 501  20  548105171 Mar 16 01:30 model.safetensors


./models/1558M:

total 12562176

-rw-r--r--  1 501  20         689 Mar 16 01:36 config.json

-rw-r--r--  1 501  20  6431829964 Mar 16 01:58 model.safetensors


The models get stored in a subdirectory that specifies its name. 





Now I can just follow the first example: 


j9.6src/jsource/jlibrary/bin/jconsole gpt2.ijs

Loading tokenizer...

  Reading merges.txt

  Reading vocab.json

  Processing vocab

  Building lookup verbs

Done.

Load (or switch) model with `model 'SIZE'` (124M, 355M, 774M, 1558M)

Then generate with [tokens to gen (default: 40)] gen 'PROMPT'

   model '1558M'

Loading model: 1558M

  Processing header

  Reading data

Done.

   gen 'Alan Turing theorized that computers would one day become'

 so powerful that they would be able to think like humans.


In the 1950s, he proposed a way to build a computer that could think like a human. He called it the "T so powerful that they would be able to think like humans.


In the 1950s, he proposed a way to build a computer that could think like a human. He called it the “T





This was just an informational posting. I figured there may be others that would like to see a simple LLM in J. 




Jan-Pieter Jacobs

unread,
Mar 24, 2025, 5:48:18 AMMar 24
to fo...@jsoftware.com

Thanks for pointing me to this interesting experiment, and for the write up.

I'm very much interested in this, and will give it a spin on an AVX2 computer soon, and let you know how it goes.

Jan-Pieter.


To unsubscribe from this group and stop receiving emails from it, send an email to forum+un...@jsoftware.com.
picoGPT-in-j.png

Thomas McGuire

unread,
Mar 24, 2025, 1:11:53 PMMar 24
to fo...@jsoftware.com
I forgot to give you Henry Rich’s fix:

Replace the 2nd to last line in encoder.ijs with 

encode =: {{ ; {{vocab_i bpe cs {~ (bs{a.)&i. y}} each pat rxall utf8 y }}


I have attached the whole file if you prefer


Tom McGuire


encoder.ijs

More Rice

unread,
Mar 24, 2025, 2:24:33 PMMar 24
to fo...@jsoftware.com
But with Elijah Stone’s patch to run
> the matrix multiplication with the M2
> accelerate framework so the multiplication
> takes place with Apple’s specialized AMX
> instructions. 

How may I apply this patch? (I’m on M4. I wanted to take a look at J’s Apple silicon optimization in general.)

Thanks


Sent from mobile


Thomas McGuire

unread,
Mar 24, 2025, 3:05:58 PMMar 24
to fo...@jsoftware.com
I pulled the patch file from the old forum and have attached it to this email 
place it in the jsrc directory from the cloned jsoftware jsource code from GitHub. 

patch gemm.c gemm.c.patch

to get this to build I added the folllowing to the build script “build_libj.sh” (see orange highlighted area):

darwin/j64arm*) # darwin arm

  TARGET=libj.dylib

  CFLAGS="$common $macmin -march=armv8-a+crc -mno-outline-atomics -DC_CRC32C=1 -DSYSTEM_BLAS=1"

  LDFLAGS=" -dynamiclib -install_name libj.dylib -lm -ldl $LDOPENMP $LDTHREAD $macmin -framework Accelerate"

  OBJS_AESARM=" aes-arm.o "

  SRC_ASM="${SRC_ASM_IOS}" 

  GASM_FLAGS="$macmin"

  FLAGS_SLEEF=" -DENABLE_ADVSIMD "

  FLAGS_BASE64=" -DHAVE_NEON64=1 "

 ;;



Finally after you build don’t forget to run "./cpbin.sh to move the executables into their appropriate directory. I forgot and had an old binary there and kept wonedering why my jconsole wasn’t picking up any speed improvement. 

Tom McGuire

gemm.c.patch

More Rice

unread,
Mar 26, 2025, 1:13:17 AMMar 26
to fo...@jsoftware.com
Oh, email thread from 3 years ago - the interface sounds fairly stable then.

Thank you for the detailed step, Tom.  I wanted to try Accelerate for a while, but didn't know where to poke in jsource tree to see the effect with the least amount of effort.  This is perfect.

Next month after I've time to experiment more, I want to revive that 3 year old email thread to see what we can do to help upstream the acceleration. (I'm hopeful we can do more than matrix multiplication acceleration.)

Thank you.

Maurice

To unsubscribe from this group and stop receiving emails from it, send an email to forum+un...@jsoftware.com.



On Mar 24, 2025, at 2:24 PM, More Rice <mrmor...@gmail.com> wrote:

But with Elijah Stone’s patch to run
> the matrix multiplication with the M2
> accelerate framework so the multiplication
> takes place with Apple’s specialized AMX
> instructions. 

How may I apply this patch? (I’m on M4. I wanted to take a look at J’s Apple silicon optimization in general.)

Thanks


Sent from mobile

bill lam

unread,
Mar 26, 2025, 3:19:32 AMMar 26
to fo...@jsoftware.com
I believe both Mac Intel and Mac silicon have the accelerate framework.

To unsubscribe from this group and stop receiving emails from it, send an email to forum+un...@jsoftware.com.
On Mar 24, 2025, at 2:24 PM, More Rice <mrmor...@gmail.com> wrote:

But with Elijah Stone’s patch to run
> the matrix multiplication with the M2
> accelerate framework so the multiplication
> takes place with Apple’s specialized AMX
> instructions. 

How may I apply this patch? (I’m on M4. I wanted to take a look at J’s Apple silicon optimization in general.)

Thanks


Sent from mobile

More Rice

unread,
Mar 26, 2025, 3:16:14 PMMar 26
to fo...@jsoftware.com
Thanks Bill.

Are you hoping to pick up possible AVX-512 acceleration when adopting Accelerate? (From jwiki, I only see up to AVX2.)

Intel is on its way out though …

Maurice

Sent from mobile

bill lam

unread,
Mar 26, 2025, 6:31:05 PMMar 26
to fo...@jsoftware.com
I don't think so.

Regarding the performance on (+/  .*) 
It should already be multithreading optimized for both avx2 and arm64, and its speed should be comparable to accelerate framework. But you need to create a thread pool (using the T. conjunction) by yourself to enable multithreaded  (+/  .*) 

Apple accelerate framework is used in the lapack2 addon too.

Thomas McGuire

unread,
Mar 27, 2025, 2:45:13 PMMar 27
to fo...@jsoftware.com
So on my M2 MacBook Pro I can add threads to a J session with T. 
I will see the number of threads in the activity monitor increase to the number of threads created. 

However I see no speed up of the matrix multiply +/.*

This is on a straight up j9.7 without any specialized build, just the version that comes from the website. 

Previously on an AVX2 machine I was able to see a speed up from using threads just by setting them with T.

So not sure this really works on arm64, at least the apple M2 version

Tom McGuire

More Rice

unread,
Mar 27, 2025, 5:52:38 PMMar 27
to fo...@jsoftware.com
Hello,

Journeyman here. I’m not a user of T. (Too many builtin verbs/adverbs/conjunctions to learn first.)

May I also have your multi-threaded J sentence, Tom?  For comparison it is good practice to minimize unnecessary variables to focus on what we’re trying to hunt down. 

Let me try it out tonight on my Intel (baseline) and then arm64 (additional silicon data sample) HW.  I think there is an instrumentation tool on macOS. I hope it can show SW serialization, if any there. 


Thank you.

Maurice

Sent from mobile


bill lam

unread,
Mar 27, 2025, 11:05:28 PMMar 27
to fo...@jsoftware.com
J has a blas routine for (+/. *) which is disabled by default.
There is an OPENMP switch that is disabled by default when building binaries.
The attached is a script to benchmark the performance using different methods

lapack use accelerate framework on Mac,
linux or windows requires lapack package from distro or from the lapack2 addon.
The result of "always blas 84.219 GFlop " was tested using JE binaries built using OPENMP enabled.
it is 19 GFlop if J binary was built without OPENMP

j9.7.0-beta1/j64arm/darwin/commercial/www.jsoftware.com/2025-03-28T10:35:34/clang-16-0-0/SLEEF=1
threads 4
OMP_NUM_THREADS=
 never blas 5.461 GFlop
always blas 84.219 GFlop
     lapack 142.831 GFlop
gmat.ijs

bill lam

unread,
Mar 28, 2025, 2:19:10 AMMar 28
to fo...@jsoftware.com
I changed JE to use the accelerate framework cblas instead of its own blas, and it runs noticeably faster.

macbook m1 :
 never blas 5.718 GFlop
always blas 166.114 GFlop
     lapack 130.384 GFlop

macbook m1 under rosetta 2
 never blas 20.477 GFlop
always blas 37.775 GFlop
     lapack 35.604 GFlop

Apparently J (+/. *) is not optimized for arm64, at least it should be as fast as that under rosetta 2.

Marcin Żołek

unread,
Mar 28, 2025, 2:45:34 AMMar 28
to fo...@jsoftware.com
Matrix multiplication with LAPACK in J is incredibly fast on macOS M1! I simplified your banchmark to print seconds instead of GFlops. I noticed that verb 9!:58 is used to turn on/off BLAS and I can't find its documentation on J Wiki. Could you please put the description in NuVoc?

I also noticed verb mm does not work properly for not-square matrices (changing lines 27, 28 in attached benchmark to their commented versions causes assertion failure for result of LAPACK). Can this be fixed by changing the definition of verb mm to support rectangular matrices?

Simplified benchmark:

{{
if. UNAME-:'Linux' do.
 liblapack=: 'liblapack.so.3'
elseif. UNAME-:'Darwin' do.
 liblapack=: '/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/vecLib'
elseif. UNAME-:'Win' do.
 liblapack=: 'libopenblas.dll'
elseif. do.
 'not supported' assert 0
end.
dgemm=: (liblapack,' dgemm_ n *c *c *i *i *i *d *d *i *d *i *d *d *i')&cd
try. 16!:1[1 64 catch. end.
EMPTY
}} ''

mm=: {{)d
k=. ,{.$x
c=. (k,k)$17.2
_2 {:: dgemm (,'N');(,'N');k;k;k;(,2.5-1.5);y;k;x;k;(,1.5-1.5);c;k
}}

t =: {{
echo 9!:14 ''
echo 'threads ', ": 1 T. ''
echo 'OMP_NUM_THREADS=', (''"_)^:(0&-:) (2!:5)'OMP_NUM_THREADS'

A =: 4000 4000?.@$0  NB. A =: 3000 4000?.@$0
B =: 4000 4000?.@$0  NB. B =: 4000 2000?.@$0

echo 'never blas'
_1 (9!:58)"0 i.3
echo 6!:2 'c1 =. A +/ .* B'

echo 'always blas'
0 (9!:58)"0 i.3
echo 6!:2 'c2 =. A +/ .* B'
assert. c1 -: c2

echo 'lapack'
echo 6!:2 'c3 =. A mm B'
assert. c1 -: c3

EMPTY
}}

t ''


Thanks,
Marcin

bill lam

unread,
Mar 28, 2025, 3:39:01 AMMar 28
to fo...@jsoftware.com
x 9!:58 y is intended to debug JE and is not documented. Basically it sets the threshold of matrix size to use blas instead of the default algorithm.
x: positive=threshold   0=always blas   _1=never blas
y is 0=integer. 1=float. 2=complex
monadic for querying current threshold

the official reference of dgemm is available at netlib her
you can change the k to m n k accordingly.

Note that lapack is fortran and assume matrix uses column major storage so that in general it needs to transpose matrix before and after calling lapack routine.

GFlop is used instead of time in seconds in gemm benchmarks, the size of the matrix is unimportant.
GPU speed is measured in TFlops

I have pushed the patch for mac blas to the JE repos, you can build from source on your own or download the binary snapshot at

Thomas McGuire

unread,
Mar 28, 2025, 7:41:41 AMMar 28
to fo...@jsoftware.com
So there is no explicit threaded code. I am talking about the threading that has been built in to matrix multiply in J (+/.*).  The base code to test this is as follows (This is running j9.7.0-beta1): 

   NB. The following is done after starting a fresh jqt window

a=. ?1e3 2e3$0

b=. ?2e3 3e3$0

100 timex 'a +/ . * b'

0.355397


NB. number of cores,maxthreads on my M2 macbook

8 T. ''

12 63

NB. spin up N-1 threads in threadpool 0

{{0 T.0}}^:] <: {. 8 T. ''

11



a=. ?1e3 2e3$0

b=. ?2e3 3e3$0

100 timex 'a +/ . * b'

0.361707


Just so you don’t think jqt is the problem here is a jconsole version:


$ ijconsole

   a=. ?1e3 2e3$0

   b=. ?2e3 3e3$0

   100 timex 'a +/ . * b'

0.360285   

   {{0 T.0}}^:] <: {. 8 T. ''

11

   100 timex 'a +/ . * b'

0.365402

I believe a while ago I tested this on an Intel Mac and saw a speed up with using threads in this example.



Just for completeness here is Bill Lam’s benchmark run on my machine:

j9.7.0-beta1/j64arm/darwin/commercial/www.jsoftware.com/2025-03-12T13:42:25/clang-15-0-0/SLEEF=1

threads 0

OMP_NUM_THREADS=

never blas 7.598 GFlop

always blas 33.678 GFlop

lapack 504.122 GFlop

{{0 T.0}}^:] <: {. 8 T. ''

11

t1''

j9.7.0-beta1/j64arm/darwin/commercial/www.jsoftware.com/2025-03-12T13:42:25/clang-15-0-0/SLEEF=1

threads 11

OMP_NUM_THREADS=

never blas 7.628 GFlop

always blas 32.778 GFlop

lapack 525.529 GFlop


Tom McGuire

bill lam

unread,
Mar 28, 2025, 9:07:12 AMMar 28
to fo...@jsoftware.com
Can you download the m64.zip from
extract the libj.dylib and replace your libj.dylib (backup it first)
and run the benchmark to check the speed of blas?

Henry Rich

unread,
Mar 28, 2025, 9:13:57 AMMar 28
to fo...@jsoftware.com
Multithreading of +/ . * happens in 2 ways:

1. On AVX2 platforms (i. e. x64), the internal JE code runs in multiple threads, which you must create with T. .

2. On platforms using BLAS, T. doesn't enter into it, and BLAS runs in multiple threads if it is configured to do so.

9!:58 is used to control this.  Bill Lam, who has access to non-x64 platforms, is the authority on it.

Henry Rich

Thomas McGuire

unread,
Mar 28, 2025, 9:21:42 AMMar 28
to fo...@jsoftware.com

m64.zip downloaded and replaced in my regular j9.7.0-beta1 as requested


j9.7.0-beta1/j64arm/darwin/commercial/www.jsoftware.com/2025-03-28T07:00:49/clang-15-0-0/SLEEF=1

threads 0

OMP_NUM_THREADS=

never blas 9.531 GFlop

always blas 581.905 GFlop

lapack 487.535 GFlop

{{0 T.0}}^:] <: {. 8 T. ''

11

t1''

j9.7.0-beta1/j64arm/darwin/commercial/www.jsoftware.com/2025-03-28T07:00:49/clang-15-0-0/SLEEF=1

threads 11

OMP_NUM_THREADS=

never blas 9.640 GFlop

always blas 601.690 GFlop

lapack 540.344 GFlop

bill lam

unread,
Mar 28, 2025, 9:33:06 AMMar 28
to fo...@jsoftware.com
Henry, I'm confused. When BLAS is not used, then avx emulation on arm64 architecture is enabled by EMU_AVX2 so that the same
JE avx2 code should be run by both avx2 and arm64 cpu. Why is there no speed up arm64 running in multithreading?

There is a file cipfloatmm_t.h  , is it redundant?

Henry Rich

unread,
Mar 28, 2025, 9:36:12 AMMar 28
to fo...@jsoftware.com
You are right, arm64 counts as x64 for this purpose.

cipfloatmm_t.h seems to be redundant.

Henry Rich


bill lam

unread,
Mar 28, 2025, 10:08:13 AMMar 28
to fo...@jsoftware.com
lf your mac intel has avx2 cpu. Then you can copy the libjavx2.dylib from the m64.zip to replace the libj.dylib there.

Thomas McGuire

unread,
Mar 28, 2025, 10:48:32 AMMar 28
to fo...@jsoftware.com
no mine is a arm64 M2

Tom McGuire

More Rice

unread,
Mar 29, 2025, 12:50:19 AMMar 29
to fo...@jsoftware.com
// Unmodified jsource build
j9.7.0-beta1/j64arm/darwin/commercial/www.jsoftware.com/2025-03-28T23:33:29/clang-16-0-0/SLEEF=1
threads 0
OMP_NUM_THREADS=
 never blas 11.931 GFlop
always blas 725.595 GFlop
     lapack 614.053 GFlop

// Same as above but only replaced the libj.dylib from Bill's m64.zip as instructed
j9.7.0-beta1/j64arm/darwin/commercial/www.jsoftware.com/2025-03-28T20:42:47/clang-15-0-0/SLEEF=1
threads 0
OMP_NUM_THREADS=
 never blas 12.037 GFlop
always blas 692.794 GFlop
     lapack 597.567 GFlop 

(We're all on macOS/arm64 in this thread.)

On my machine, the GFlop for "always blas" and "lapack" can fluctuate more than 40 GFlop when retried.  Can't say which one is faster/slower - looks the same to me.

Accelerate's blas, I read, like some blas' implementations out there, has threading control. Why bother with openmp when the implementation can do asymmetric core aware optimization on its own? It should scale better across apple silicon variants, no?


Maurice

bill lam

unread,
Mar 29, 2025, 1:33:03 AMMar 29
to fo...@jsoftware.com
Your "Unmodified jsource build" already incorporated the cblas patch.
Multithreading can not guarantee repeatability, you may run more times and take the average.

More Rice

unread,
Mar 29, 2025, 3:11:37 AMMar 29
to fo...@jsoftware.com
😮 ... so quick. (Yes, found your commit.)

No wonder the results are indistinguishable.

Thanks a lot Bill. (I'll profile it later.)

Maurice

Raul Miller

unread,
Mar 29, 2025, 4:54:28 PMMar 29
to fo...@jsoftware.com
Instead of transposing the matrices, can't you swap the argument order?

--
Raul

Henry Rich

unread,
Mar 29, 2025, 5:10:51 PMMar 29
to fo...@jsoftware.com
Good point.

Henry Rich

bill lam

unread,
Mar 29, 2025, 5:40:13 PMMar 29
to fo...@jsoftware.com
For the lapack mm inside gmat.ijs, it didn't transpose or swap, by because both arguments are the same.

A mp A <=> |: ( |:A ) mp ( |:A )

bill lam

unread,
Mar 30, 2025, 11:46:54 AMMar 30
to fo...@jsoftware.com
If y'all have mac M4, please consider adding your computer in this page,
To compare the performance differences between apple m1 and apple m4.
Reply all
Reply to author
Forward
0 new messages