pico-gpt2-j Ryan Huang's small gpt2 written in J

Thomas McGuire

unread,

Mar 24, 2025, 5:08:18 AMMar 24

to fo...@jsoftware.com

I had found this implementation via ArrayPortal.com array search engine. Bob Therriault had made an anouncement that they moved this search tool to the web (it used to be a local app you ran in J). So I had been studying LLMs off and on for the past 6 months and on a whim I typed GPT2 into ArrayPortal and up pop Ryan Huang’s picogpt-in-j which is a J implementation of GPT2. It was based off a “ridiculously small" GPT2 in C. There is a little more detail at the github site readme page.

NPN/picoGPT-in-j: J port of "An unnecessarily tiny implementation of GPT-2 in NumPy."

github.com

I haven’t been able to test this implementation on an AVX2 version of J. But after pointing out a problem with J to Henry. Henry was kind enough to give me a work around. So now I was set on trying this out on my Apple Macbook Pro M2 chip laptop. But with Elijah Stone’s patch to run the matrix multiplication with the M2 accelerate framework so the multiplication takes place with Apple’s specialized AMX instructions.

On a non-optimized J installation the 1558M model takes about 167sec to output 40 tokens. Using the apple accelerate framework the same run takes only 28 seconds. Highly tolerable speed to be able to work in J.

The nice thing about the GPT2 implementation is that it uses a model that already exists on Hugging Faces. You just need to pull the files off of Hugging Faces and the GPT2 J implementation will complete an open ended prompt.

After cloning the 3 J files that comprise picogpt-in-j to a local directory I followed the directions on obtaining the model data off Hugging Faces using the download.sh script included with the J files.

-rw-r--r-- 1 501 20 1067 Mar 16 01:24 LICENSE

-rw-r--r-- 1 501 20 3041 Mar 16 01:24 README.md

-rwxr-xr-x 1 501 20 1023 Mar 16 01:24 download.sh

-rw-r--r-- 1 501 20 1898 Mar 23 14:06 encoder.ijs

-rw-r--r-- 1 501 20 2270 Mar 16 01:24 gpt2.ijs

drwxr-xr-x 6 501 20 192 Mar 16 01:36 models

-rw-r--r-- 1 501 20 1757 Mar 16 01:24 utils.ijs

./models:

total 2936

drwxr-xr-x 4 501 20 128 Mar 16 01:25 124M

drwxr-xr-x 4 501 20 128 Mar 16 01:36 1558M

-rw-r--r-- 1 501 20 456318 Mar 16 01:25 merges.txt

-rw-r--r-- 1 501 20 1042301 Mar 16 01:25 vocab.json

./models/124M:

total 1070528

-rw-r--r-- 1 501 20 665 Mar 16 01:25 config.json

-rw-r--r-- 1 501 20 548105171 Mar 16 01:30 model.safetensors

./models/1558M:

total 12562176

-rw-r--r-- 1 501 20 689 Mar 16 01:36 config.json

-rw-r--r-- 1 501 20 6431829964 Mar 16 01:58 model.safetensors

The models get stored in a subdirectory that specifies its name.

Now I can just follow the first example:

j9.6src/jsource/jlibrary/bin/jconsole gpt2.ijs

Loading tokenizer...

Reading merges.txt

Reading vocab.json

Processing vocab

Building lookup verbs

Done.

Load (or switch) model with `model 'SIZE'` (124M, 355M, 774M, 1558M)

Then generate with [tokens to gen (default: 40)] gen 'PROMPT'

model '1558M'

Loading model: 1558M

Processing header

Reading data

Done.

gen 'Alan Turing theorized that computers would one day become'

so powerful that they would be able to think like humans.

In the 1950s, he proposed a way to build a computer that could think like a human. He called it the "T so powerful that they would be able to think like humans.

In the 1950s, he proposed a way to build a computer that could think like a human. He called it the “T

This was just an informational posting. I figured there may be others that would like to see a simple LLM in J.

Jan-Pieter Jacobs

unread,

Mar 24, 2025, 5:48:18 AMMar 24

to fo...@jsoftware.com

Thanks for pointing me to this interesting experiment, and for the write up.

I'm very much interested in this, and will give it a spin on an AVX2 computer soon, and let you know how it goes.

Jan-Pieter.

To unsubscribe from this group and stop receiving emails from it, send an email to forum+un...@jsoftware.com.

picoGPT-in-j.png

Thomas McGuire

unread,

Mar 24, 2025, 1:11:53 PMMar 24

to fo...@jsoftware.com

I forgot to give you Henry Rich’s fix:

Replace the 2nd to last line in encoder.ijs with

encode =: {{ ; {{vocab_i bpe cs {~ (bs{a.)&i. y}} each pat rxall utf8 y }}

I have attached the whole file if you prefer

Tom McGuire

encoder.ijs

More Rice

unread,

Mar 24, 2025, 2:24:33 PMMar 24

to fo...@jsoftware.com

> But with Elijah Stone’s patch to run

> the matrix multiplication with the M2

> accelerate framework so the multiplication

> takes place with Apple’s specialized AMX

> instructions.

How may I apply this patch? (I’m on M4. I wanted to take a look at J’s Apple silicon optimization in general.)

Thanks

Sent from mobile

Thomas McGuire

unread,

Mar 24, 2025, 3:05:58 PMMar 24

to fo...@jsoftware.com

I pulled the patch file from the old forum and have attached it to this email

place it in the jsrc directory from the cloned jsoftware jsource code from GitHub.

patch gemm.c gemm.c.patch

to get this to build I added the folllowing to the build script “build_libj.sh” (see orange highlighted area):

darwin/j64arm*) # darwin arm

TARGET=libj.dylib

CFLAGS="$common $macmin -march=armv8-a+crc -mno-outline-atomics -DC_CRC32C=1 -DSYSTEM_BLAS=1"

LDFLAGS=" -dynamiclib -install_name libj.dylib -lm -ldl $LDOPENMP $LDTHREAD $macmin -framework Accelerate"

OBJS_AESARM=" aes-arm.o "

SRC_ASM="${SRC_ASM_IOS}"

GASM_FLAGS="$macmin"

FLAGS_SLEEF=" -DENABLE_ADVSIMD "

FLAGS_BASE64=" -DHAVE_NEON64=1 "

;;

Finally after you build don’t forget to run "./cpbin.sh” to move the executables into their appropriate directory. I forgot and had an old binary there and kept wonedering why my jconsole wasn’t picking up any speed improvement.

Tom McGuire

gemm.c.patch

More Rice

unread,

Mar 26, 2025, 1:13:17 AMMar 26

to fo...@jsoftware.com

Oh, email thread from 3 years ago - the interface sounds fairly stable then.

Thank you for the detailed step, Tom.  I wanted to try Accelerate for a while, but didn't know where to poke in jsource tree to see the effect with the least amount of effort.  This is perfect.

Next month after I've time to experiment more, I want to revive that 3 year old email thread to see what we can do to help upstream the acceleration. (I'm hopeful we can do more than matrix multiplication acceleration.)

Thank you.

Maurice

To unsubscribe from this group and stop receiving emails from it, send an email to forum+un...@jsoftware.com.

On Mar 24, 2025, at 2:24 PM, More Rice <mrmor...@gmail.com> wrote:

> But with Elijah Stone’s patch to run
> the matrix multiplication with the M2
> accelerate framework so the multiplication
> takes place with Apple’s specialized AMX
> instructions.

How may I apply this patch? (I’m on M4. I wanted to take a look at J’s Apple silicon optimization in general.)

Thanks

Sent from mobile

bill lam

unread,

Mar 26, 2025, 3:19:32 AMMar 26

to fo...@jsoftware.com

I believe both Mac Intel and Mac silicon have the accelerate framework.

To unsubscribe from this group and stop receiving emails from it, send an email to forum+un...@jsoftware.com.

On Mar 24, 2025, at 2:24 PM, More Rice <mrmor...@gmail.com> wrote:

> But with Elijah Stone’s patch to run
> the matrix multiplication with the M2
> accelerate framework so the multiplication
> takes place with Apple’s specialized AMX
> instructions.

How may I apply this patch? (I’m on M4. I wanted to take a look at J’s Apple silicon optimization in general.)

Thanks

Sent from mobile

More Rice

unread,

Mar 26, 2025, 3:16:14 PMMar 26

to fo...@jsoftware.com

Thanks Bill.

Are you hoping to pick up possible AVX-512 acceleration when adopting Accelerate? (From jwiki, I only see up to AVX2.)

Intel is on its way out though …

Maurice

Sent from mobile

bill lam

unread,

Mar 26, 2025, 6:31:05 PMMar 26

to fo...@jsoftware.com

I don't think so.

Regarding the performance on (+/ .*)

It should already be multithreading optimized for both avx2 and arm64, and its speed should be comparable to accelerate framework. But you need to create a thread pool (using the T. conjunction) by yourself to enable multithreaded (+/ .*)

Apple accelerate framework is used in the lapack2 addon too.

Thomas McGuire

unread,

Mar 27, 2025, 2:45:13 PMMar 27

to fo...@jsoftware.com

So on my M2 MacBook Pro I can add threads to a J session with T.

I will see the number of threads in the activity monitor increase to the number of threads created.

However I see no speed up of the matrix multiply +/.*

This is on a straight up j9.7 without any specialized build, just the version that comes from the website.

Previously on an AVX2 machine I was able to see a speed up from using threads just by setting them with T.

So not sure this really works on arm64, at least the apple M2 version

Tom McGuire

More Rice

unread,

Mar 27, 2025, 5:52:38 PMMar 27

to fo...@jsoftware.com

Hello,

Journeyman here. I’m not a user of T. (Too many builtin verbs/adverbs/conjunctions to learn first.)

May I also have your multi-threaded J sentence, Tom? For comparison it is good practice to minimize unnecessary variables to focus on what we’re trying to hunt down.

Let me try it out tonight on my Intel (baseline) and then arm64 (additional silicon data sample) HW. I think there is an instrumentation tool on macOS. I hope it can show SW serialization, if any there.

Thank you.

Maurice

Sent from mobile

bill lam

unread,

Mar 27, 2025, 11:05:28 PMMar 27

to fo...@jsoftware.com

J has a blas routine for (+/. *) which is disabled by default.

There is an OPENMP switch that is disabled by default when building binaries.

The attached is a script to benchmark the performance using different methods

lapack use accelerate framework on Mac,

linux or windows requires lapack package from distro or from the lapack2 addon.

The result of "always blas 84.219 GFlop " was tested using JE binaries built using OPENMP enabled.

it is 19 GFlop if J binary was built without OPENMP

j9.7.0-beta1/j64arm/darwin/commercial/www.jsoftware.com/2025-03-28T10:35:34/clang-16-0-0/SLEEF=1
threads 4
OMP_NUM_THREADS=
never blas 5.461 GFlop
always blas 84.219 GFlop
lapack 142.831 GFlop

gmat.ijs

bill lam

unread,

Mar 28, 2025, 2:19:10 AMMar 28

to fo...@jsoftware.com

I changed JE to use the accelerate framework cblas instead of its own blas, and it runs noticeably faster.

macbook m1 :

never blas 5.718 GFlop
always blas 166.114 GFlop
lapack 130.384 GFlop

macbook m1 under rosetta 2

never blas 20.477 GFlop
always blas 37.775 GFlop
lapack 35.604 GFlop

Apparently J (+/. *) is not optimized for arm64, at least it should be as fast as that under rosetta 2.

Marcin Żołek

unread,

Mar 28, 2025, 2:45:34 AMMar 28

to fo...@jsoftware.com

Matrix multiplication with LAPACK in J is incredibly fast on macOS M1! I simplified your banchmark to print seconds instead of GFlops. I noticed that verb 9!:58 is used to turn on/off BLAS and I can't find its documentation on J Wiki. Could you please put the description in NuVoc?

I also noticed verb mm does not work properly for not-square matrices (changing lines 27, 28 in attached benchmark to their commented versions causes assertion failure for result of LAPACK). Can this be fixed by changing the definition of verb mm to support rectangular matrices?

Simplified benchmark:

{{

if. UNAME-:'Linux' do.

liblapack=: 'liblapack.so.3'

elseif. UNAME-:'Darwin' do.

liblapack=: '/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/vecLib'

elseif. UNAME-:'Win' do.

liblapack=: 'libopenblas.dll'

elseif. do.

'not supported' assert 0

end.

dgemm=: (liblapack,' dgemm_ n *c *c *i *i *i *d *d *i *d *i *d *d *i')&cd

try. 16!:1[1 64 catch. end.

EMPTY

}} ''

mm=: {{)d

k=. ,{.$x

c=. (k,k)$17.2

_2 {:: dgemm (,'N');(,'N');k;k;k;(,2.5-1.5);y;k;x;k;(,1.5-1.5);c;k

}}

t =: {{

echo 9!:14 ''

echo 'threads ', ": 1 T. ''

echo 'OMP_NUM_THREADS=', (''"_)^:(0&-:) (2!:5)'OMP_NUM_THREADS'

A =: 4000 4000?.@$0 NB. A =: 3000 4000?.@$0

B =: 4000 4000?.@$0 NB. B =: 4000 2000?.@$0

echo 'never blas'

_1 (9!:58)"0 i.3

echo 6!:2 'c1 =. A +/ .* B'

echo 'always blas'

0 (9!:58)"0 i.3

echo 6!:2 'c2 =. A +/ .* B'

assert. c1 -: c2

echo 'lapack'

echo 6!:2 'c3 =. A mm B'

assert. c1 -: c3

EMPTY

}}

t ''

Thanks,

Marcin

bill lam

unread,

Mar 28, 2025, 3:39:01 AMMar 28

to fo...@jsoftware.com

x 9!:58 y is intended to debug JE and is not documented. Basically it sets the threshold of matrix size to use blas instead of the default algorithm.

x: positive=threshold 0=always blas _1=never blas

y is 0=integer. 1=float. 2=complex

monadic for querying current threshold

the official reference of dgemm is available at netlib her

* https://www.netlib.org/lapack/explore-html/dd/d09/group__gemm_ga1e899f8453bcbfde78e91a86a2dab984.html

you can change the k to m n k accordingly.

Note that lapack is fortran and assume matrix uses column major storage so that in general it needs to transpose matrix before and after calling lapack routine.

GFlop is used instead of time in seconds in gemm benchmarks, the size of the matrix is unimportant.

GPU speed is measured in TFlops

* https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html

I have pushed the patch for mac blas to the JE repos, you can build from source on your own or download the binary snapshot at

* https://github.com/jsoftware/jsource/releases/tag/build

Thomas McGuire

unread,

Mar 28, 2025, 7:41:41 AMMar 28

to fo...@jsoftware.com

So there is no explicit threaded code. I am talking about the threading that has been built in to matrix multiply in J (+/.*). The base code to test this is as follows (This is running j9.7.0-beta1):

NB. The following is done after starting a fresh jqt window

a=. ?1e3 2e3$0

b=. ?2e3 3e3$0

100 timex 'a +/ . * b'

0.355397

NB. number of cores,maxthreads on my M2 macbook

8 T. ''

12 63

NB. spin up N-1 threads in threadpool 0

{{0 T.0}}^:] <: {. 8 T. ''

11

a=. ?1e3 2e3$0

b=. ?2e3 3e3$0

100 timex 'a +/ . * b'

0.361707

Just so you don’t think jqt is the problem here is a jconsole version:

$ ijconsole

a=. ?1e3 2e3$0

b=. ?2e3 3e3$0

100 timex 'a +/ . * b'

0.360285

{{0 T.0}}^:] <: {. 8 T. ''

11

100 timex 'a +/ . * b'

0.365402

I believe a while ago I tested this on an Intel Mac and saw a speed up with using threads in this example.

Just for completeness here is Bill Lam’s benchmark run on my machine:

j9.7.0-beta1/j64arm/darwin/commercial/www.jsoftware.com/2025-03-12T13:42:25/clang-15-0-0/SLEEF=1

threads 0

OMP_NUM_THREADS=

never blas 7.598 GFlop

always blas 33.678 GFlop

lapack 504.122 GFlop

{{0 T.0}}^:] <: {. 8 T. ''

11

t1''

j9.7.0-beta1/j64arm/darwin/commercial/www.jsoftware.com/2025-03-12T13:42:25/clang-15-0-0/SLEEF=1

threads 11

OMP_NUM_THREADS=

never blas 7.628 GFlop

always blas 32.778 GFlop

lapack 525.529 GFlop

Tom McGuire

bill lam

unread,

Mar 28, 2025, 9:07:12 AMMar 28

to fo...@jsoftware.com

Can you download the m64.zip from

https://github.com/jsoftware/jsource/releases/tag/build

extract the libj.dylib and replace your libj.dylib (backup it first)

and run the benchmark to check the speed of blas?

Henry Rich

unread,

Mar 28, 2025, 9:13:57 AMMar 28

to fo...@jsoftware.com

Multithreading of +/ . * happens in 2 ways:

1. On AVX2 platforms (i. e. x64), the internal JE code runs in multiple threads, which you must create with T. .

2. On platforms using BLAS, T. doesn't enter into it, and BLAS runs in multiple threads if it is configured to do so.

9!:58 is used to control this. Bill Lam, who has access to non-x64 platforms, is the authority on it.

Henry Rich

Thomas McGuire

unread,

Mar 28, 2025, 9:21:42 AMMar 28

to fo...@jsoftware.com

m64.zip downloaded and replaced in my regular j9.7.0-beta1 as requested

j9.7.0-beta1/j64arm/darwin/commercial/www.jsoftware.com/2025-03-28T07:00:49/clang-15-0-0/SLEEF=1

threads 0

OMP_NUM_THREADS=

never blas 9.531 GFlop

always blas 581.905 GFlop

lapack 487.535 GFlop

{{0 T.0}}^:] <: {. 8 T. ''

11

t1''

j9.7.0-beta1/j64arm/darwin/commercial/www.jsoftware.com/2025-03-28T07:00:49/clang-15-0-0/SLEEF=1

threads 11

OMP_NUM_THREADS=

never blas 9.640 GFlop

always blas 601.690 GFlop

lapack 540.344 GFlop

bill lam

unread,

Mar 28, 2025, 9:33:06 AMMar 28

to fo...@jsoftware.com

Henry, I'm confused. When BLAS is not used, then avx emulation on arm64 architecture is enabled by EMU_AVX2 so that the same

JE avx2 code should be run by both avx2 and arm64 cpu. Why is there no speed up arm64 running in multithreading?

There is a file cipfloatmm_t.h , is it redundant?

Henry Rich

unread,

Mar 28, 2025, 9:36:12 AMMar 28

to fo...@jsoftware.com

You are right, arm64 counts as x64 for this purpose.

cipfloatmm_t.h seems to be redundant.

Henry Rich

bill lam

unread,

Mar 28, 2025, 10:08:13 AMMar 28

to fo...@jsoftware.com

lf your mac intel has avx2 cpu. Then you can copy the libjavx2.dylib from the m64.zip to replace the libj.dylib there.

Thomas McGuire

unread,

Mar 28, 2025, 10:48:32 AMMar 28

to fo...@jsoftware.com

no mine is a arm64 M2

Tom McGuire

More Rice

unread,

Mar 29, 2025, 12:50:19 AMMar 29

to fo...@jsoftware.com

// Unmodified jsource build
j9.7.0-beta1/j64arm/darwin/commercial/www.jsoftware.com/2025-03-28T23:33:29/clang-16-0-0/SLEEF=1
threads 0
OMP_NUM_THREADS=
 never blas 11.931 GFlop 
always blas 725.595 GFlop 
     lapack 614.053 GFlop 

// Same as above but only replaced the libj.dylib from Bill's m64.zip as instructed
j9.7.0-beta1/j64arm/darwin/commercial/www.jsoftware.com/2025-03-28T20:42:47/clang-15-0-0/SLEEF=1
threads 0
OMP_NUM_THREADS=
 never blas 12.037 GFlop 
always blas 692.794 GFlop 
     lapack 597.567 GFlop 

(We're all on macOS/arm64 in this thread.)

On my machine, the GFlop for "always blas" and "lapack" can fluctuate more than 40 GFlop when retried.  Can't say which one is faster/slower - looks the same to me.

Accelerate's blas, I read, like some blas' implementations out there, has threading control. Why bother with openmp when the implementation can do asymmetric core aware optimization on its own? It should scale better across apple silicon variants, no?

Maurice

bill lam

unread,

Mar 29, 2025, 1:33:03 AMMar 29

to fo...@jsoftware.com

Your "Unmodified jsource build" already incorporated the cblas patch.

Multithreading can not guarantee repeatability, you may run more times and take the average.

More Rice

unread,

Mar 29, 2025, 3:11:37 AMMar 29

to fo...@jsoftware.com

😮 ... so quick. (Yes, found your commit.)

No wonder the results are indistinguishable.

Thanks a lot Bill. (I'll profile it later.)

Maurice

Raul Miller

unread,

Mar 29, 2025, 4:54:28 PMMar 29

to fo...@jsoftware.com

Instead of transposing the matrices, can't you swap the argument order?

--
Raul

Henry Rich

unread,

Mar 29, 2025, 5:10:51 PMMar 29

to fo...@jsoftware.com

Good point.

Henry Rich

bill lam

unread,

Mar 29, 2025, 5:40:13 PMMar 29

to fo...@jsoftware.com

For the lapack mm inside gmat.ijs, it didn't transpose or swap, by because both arguments are the same.

A mp A <=> |: ( |:A ) mp ( |:A )

bill lam

unread,

Mar 30, 2025, 11:46:54 AMMar 30

to fo...@jsoftware.com

If y'all have mac M4, please consider adding your computer in this page,

https://code.jsoftware.com/wiki/Vocabulary/JCPUBenchmarks#Results

To compare the performance differences between apple m1 and apple m4.

Reply all

Reply to author

Forward