testing LLM code generation

93 views
Skip to first unread message

Thomas McGuire

unread,
May 6, 2025, 12:46:23 AMMay 6
to fo...@jsoftware.com
I have been lurking on reddit r/llamalocal guaging the status of LLMs to run locally to perform coding. 
on my 32GB MacBook Pro M2max I can run many of the favorites on the reddit site. 

This begs the question what prompts would be useful in evaluating how well an LLM codes?

Many people hide their favorites close to the vest. The fear is if they publish online it will be trained on in a next generation of the LLM and no longer test capability. 

- One that I use because it’s easy to specify is: code a function that will sum the integers from 1 to the parameter given to the function. 

- Another I saw mentioned was to sum the even integers in a list of integers. 

- Another was a test of code completion, take a medium sized code base with good unit testing and gut portions of the function bodies. Have LLM complete the code and then run the unit tests. The total number of passed tests is a measure of the LLMs code completion capabilities.

I downloaded the following model from Hugging faces: qwen2.5-coder-32b-instruct-q4_k_m.gguf and am running it under llama.cpp. 
 
For the sum of integers from one the qwen2.5-coder LLM had trouble with the positioning of increment verb. Even when I told it twice what it should do it still insisted in reversing its position. Here is the pertinent exchange:

---------

> your placement of the the increment verb `>:` after the `i.` makes the `>:` appear dyadic instead of the proper use before `i.` in its monadic form as increment

You are correct, and I apologize for the mistake. The `>:` verb should be used before `i.` to increment the value, not after it. Let's correct the definitions.


### Using Tacit Definition


Here's the corrected tacit definition:


```j

sumToN =: 13 : '+/ i.>: y'

```


### Using Explicit Definition


Here's the corrected explicit definition:


```j

sumToN =: 3 : 0

  numbers =. i.>: y

  +/ numbers

)

```

---------

In the first exchange (that I didn’t show) the LLM did suggest 1+i. as a possible alternative. So when I asked it to code with that alternative I got the following: 

---------

> use the `1 +` version and don't use increment (>:)        

Certainly! Using the `1 + i. y` approach is a straightforward and clear way to generate the integers from 1 to `y`. Here are both the tacit and explicit definitions using this method.


### Using Tacit Definition


Here is the corrected tacit definition:


```j

sumToN =: 13 : '+/ 1 + i. y'

```


### Using Explicit Definition


Here is the corrected explicit definition:


```j

sumToN =: 3 : 0

  numbers =. 1 + i. y

  +/ numbers

)

```

---------

Here the LLM gets it right. Likely because `+` verb is used similarly in almost every programming language. 

I thought this was interesting that I am running a quantized model on my laptop and it’s getting close on the J code. I can get some considerable memory on a MacBook Pro. So the next one I buy I may be able to run a llama-4-maverick model. Since it is a mixture of experts LLM there is a way to split the MOE and the transformer across CPU and GPU to fit the whole thing IF you have enough VRAM. 

For the dull languages like C++, Java this LLM model had no trouble writing sumEven or sumIntegers. 

If anyone has some suggestions on what would be an easy to specify difficult for an LLM to program in J please let me know. It would be nice to collect some of these test so we can identify an LLM that may be truely useful to the J community. 

Thomas McGuire

unread,
May 6, 2025, 6:11:31 AMMay 6
to forum, Thomas McGuire
testing out Claude 3.7 sonnet online because I can't run it locally, it does a much better job. 
the chatter on reddit is that if your willing to get the paid service it is even better. 

sum integers 1 to N
NB. Verb that sums integers from 1 to n sumToN =: +/ @: >: @: i. NB. Alternative implementations sumToN2 =: [: +/ 1 + i. sumToN3 =: [: +/ [: >: i. sumToN4 =: [: +/ [: (1&+) i. NB. Mathematical formula version (n * (n + 1) ÷ 2) sumToNFormula =: [ * 0.5 * >:

sum even numbers:
NB. Verb that sums even integers in an array sumEvens =: +/ @: (] #~ 0 = 2 | ])



Thomas McGuire

unread,
May 7, 2025, 4:46:24 PMMay 7
to fo...@jsoftware.com

The article B. Johnson and T. Menzies, "AI Over-Hype: A Dangerous Threat (and How to Fix It)," in IEEE Software, vol. 41, no. 6, pp. 131-138, Nov.-Dec. 2024, doi: 10.1109/MS.2024.3439138, was looking at the short comings of AI LLMs. But their example is a good test to use for future AI LLMs. It is simple to propose and the human solution is easy to understand. In a nutshell you want to put numbers into bins. You do it fitted to a guassian distribution that is estimated. Their proposed distribution is as follows (transposed into J):


pecdf =: verb define

0 1 pcdf y

:

'mu sd' =: x

z =. sd %~ y - mu

1 - 0.5 * ^(_0.717*z)- 0.416*z*z

)


This is only good for z >: 0


The bin function is defined in the prompt below.


It turns out that chatGPT4o struggled with reworking this problem. The authors had to stop the chatbot from thinging about it. In my here in J (they were using python) Claude 3.7 Sonnet had trouble reworking the function even though it thought of the correct answer:

  • The current pecdf function provides an approximation of the CDF for z ≥ 0 using the formula:
    1 - 0.5 * ^(_0.717*z) - 0.416*z*z
  • For negative z values, we can use the symmetry property of the normal distribution:
    CDF(-z) = 1 - CDF(z)


The authors turned to StackExchange and the human solution to this is the following (that I have transposed into J):


cdf1 =: monad define

1 - 0.5 * ^(_0.717*y)- 0.416*y*y

)


ecdf =: verb define

0 1 ecdf y

:

'mu sd' =. x

z =. sd %~ y - mu

if. z < 0 do.

1 - cdf1 -z

return.

end.

cdf1 z

)



PROMPT:

the bin problem uses an estimate to the cumulative distribution function

in order to decide what “bin” (for some number of bins N) a number belongs in. The definition of the "bin"

function in the J programming language from jsoftware and a proposed but not

fully functional estimated cumulative distribution function follows the text

of this prompt delimited by "===". The proposed estimated cumulative distribution

(named pecdf in the code) function is only valid for z greater than or equal to zero.

Please rework "pecdf" to create the needed "ecdf" that works for any value of z.

both above and below zero.

===

bin =: verb define

15 0 1 ecdf x

:

'N mu sd' =. x

<. N * (mu,sd) ecdf y

)



pecdf =: verb define

0 1 pcdf y

:

'mu sd' =: x

z =. sd %~ y - mu

1 - 0.5 * ^(_0.717*z)- 0.416*z*z

)

===

Pascal Jasmin

unread,
Aug 6, 2025, 4:50:43 PMAug 6
to fo...@jsoftware.com


I haven't tried Claude yet.

A problem that is hard enough to test LLMs with is Solve Chess mate in X moves.

ninja.ai was pretty bad, but it did work for a sudoku solver.  Perhaps it has a J web solution access to solve it.  (Chess not done in J)

copilot is worse, but it is unlimited with windows 11, so I can teach it J, and it tells me the world benefits from my corrections.  Copilot did not understand amend function as an example, attempting to use {~ to implement it.

qwen 3-235b ... https://chat.qwen.ai/  (qwen 2.5 is said to be good local approachable LLM 24gb)

seems to provide a better starting point, and better improvements to code, though all of the LLMs understand the nature of the requests given.

Something relatively hard, that you have to help the AI with and deal with correcting them is the best way to test them, IMO.  You can also jump from limited free sessions to other sessions giving them input from other AI.  Copilot will rule the world because it trains from massaged corrections

all of the models seem to have organized the code well, just that the other 2 were far from working.  Enough problems with qwen 3 too, just closer to working.

kimi.com (released last week) is hyped from leaderboard scores, and I will try that next.
To unsubscribe from this group and stop receiving emails from it, send an email to forum+un...@jsoftware.com.

Thomas McGuire

unread,
Aug 10, 2025, 6:03:12 AMAug 10
to fo...@jsoftware.com
Yes I would be interested in anything you find out with these models in relation to J. Lately I have been trying to run models that I can run on my own hardware. 
I have a GTEK AMD Max 395 with shared GPU memory of 128GB. It is able to run the newest open source release from OpenAi: openai-gpt-oss-128b-experimental.

LMStudio worked with AMD to get this model to run. It is surprisingly quick though I don’t know where to find the stats for LMStudio to tell you how many tokens per second. 

I have found that for J many times a detailed prompt (I suppose the case with all good answers from LLMs) can be the key to getting working code. 

Most of these LLMs have no clue how to do any complex tacit programming. They spit out an answer quickly but it is just delusional J programming. 

However on my Kadane test case I tweaked the prompt as follows and got what appears to be a valid implementation:

PROMPT:
the J language is an array programming language from jsoftware. You are a helpful J language programming assistant. Write the code for a verb (function in normal programming parlance) that will find the maximum subsequence sum of an array of numbers. You don't have to keep track of the indices of the subsequence just return the maximum sum. Use explicit definition to define the verb.

THE ANSWER:

NB. maxSubSum y ← maximum sum of any contiguous sub‑array of y

maxSubSum =: 3 : 0 NB. explicit definition

if. 0 = #y do. 0 return. end. NB. empty argument → 0


NB. If every element is negative the classic Kadane algorithm

NB. would return 0 (the empty subsequence). Often we want the

NB. largest (least‑negative) element instead, so handle that case.

maxNeg =. >./ y NB. greatest element of y

if. maxNeg < 0 do. maxNeg NB. all numbers ≤ 0 → return it

else.

p =. 0 , +/\ y NB. prefix sums with a leading 0

m =. <./\ p NB. running minimum of the prefixes

>./ p - m NB. maximum (current‑prefix – min‑so‑far)

end.

)

maxSubSum _2 1 _3 4 _1 2 1 _5 4

6

maxSubSum 2 3 _8 7 _1 2 3

11

maxSubSum 5 4 1 7 8

25


————————

An aside on all of this. I have been thinking that the JHS provided by J would be a good Agentic platform. These LLMs that code can give a fairly consice JSON encapsulated answer to run a function or perform shell commands and provide that as a sequence of commands. Running JHS from a local account on your machine you should be able to create a web interface that lets you ask the LLM to set up an entire project and, by creating a J agent to listen in, respond to the JSON that’s returned. You can run straight forward J verbs to create directories, files, etc. 

Tom McGuire

Devon McCormick

unread,
Aug 12, 2025, 8:22:30 PMAug 12
to fo...@jsoftware.com
We talked about this topic extensively at today's NYCJUG meeting.  Take a look at the notes - https://code.jsoftware.com/wiki/NYCJUG/2025-08-12.  Let me know what you think and let's keep the discussion going.

To unsubscribe from this group and stop receiving emails from it, send an email to forum+un...@jsoftware.com.


--

Devon McCormick

Flâneur


Jose Mario Quintana

unread,
Aug 13, 2025, 4:30:56 PMAug 13
to fo...@jsoftware.com

SuperGrok...

C
code a function that will sum the integers from 1 to the parameter given to the function in the J programming language

A
sum_to_n =: +/ @ (i. @ >:)

C
code a function that sums even integers in an array

A
sum_even =: +/ @ (2&| # ]) @ ]

This function:

Uses 2&| to test for even numbers (returns 0 for even, 1 for odd).
# filters the array to keep only elements where 2&| is 0 (even numbers).
+/ sums the filtered array.
@ ] applies the operation to the input array.

Example usage:
j
sum_even 1 2 3 4 5 6
12  NB. Sums 2 + 4 + 6

C
sum_even 1 2 3 4 5 produces 9 (sums 1 3 5). Try again?

A

Apologies for the confusion. The previous function summed odd numbers instead of even numbers due to an error in the filtering logic. In J, 2&| returns 0 for even numbers and 1 for odd numbers, so we need to filter where 2&| is 0. Here's the corrected function to sum even integers in an array:

sum_even =: +/ @ (#~ 0 = 2&|)

This function:

2&| checks if numbers are even (returns 0 for even, 1 for odd).
0 = 2&| creates a mask where 1 indicates even numbers.
#~ filters the array to keep only even numbers.
+/ sums the filtered even numbers.

Example usage:
j
sum_even 1 2 3 4 5
6  NB. Sums 2 + 4

Pascal Jasmin

unread,
Aug 13, 2025, 6:22:39 PMAug 13
to fo...@jsoftware.com
read your notes, and was waiting to respond to Tom until more progress made, but an update.

I asked LLM's to make a "solve in X mate chess program".  My successful workfolw to get legalmoves generation started with kimi K2, claude sonnet 4 made a partially working previous file, working fairly easily. Qwen 3 245 has been useful in expanding the program including a capability of understanding my kv.ijs library, and generating good board scoring functions.  I could have probably used Qwen 3's output into Claude's input to get a working program.  I just tried K2 and it was good enough.  My experience is based on models having some understanding of Chess.  Qwen 3 often makes suggestions compared to stockfish, and so may have especially trained on chess models.

Only the above models limited the hairpulling experience of not understanding } or cell { from tables, and not understanding calling dyads and decomposing arguments inside the dyad functions.  Asking models for coding style calling conventions is a hairpulling experience.  Open AI's open source 120g model, and copilot were bad for my task.  Sonet 4 is good and fast, partly because it separates code file output into a separate pane, and updates just parts of it based on requests, instead of regenerating whole file.

Even bad LLMs were good at organizing code, and having an understanding of prompt requests, and ignoring them after the general skeleton would have completed basic functionality quicker.  My interest in chess is tangential to creating a generic search function in J, and I did not have the patience to generate a "legalmoves" from chess position from scratch, and no one before me has ever published one.  Part of my process is learning how to talk to LLMs though all of them seem to understand my prompts even if they are bad at J.

Some workflow advice in this late stage of working with Qwen 3 is that instead of testing the LLM for how stupid its response could possibly be, just give it corrected code if you are done testing its stupidity.  It is still not 100% at remembering 100% of the file, which is hairpulling later on.

I find the "vibe coding" workflow useful now.   After several days of fighting/testing stupidity, and moving away from models that were too frustrating, this suits my concentration commitment level.  Qwen 3 can provide genuinely useful insights.  Qwen 3 gets slower and slower the longer you use it, which is great, because it considers your full past history before opening its mouth.  thoughtfulness > speed 100% of the time.

For local LLM sizes, afaik everything is written in python with either pytorch (no native C api, and hard to find) or tensorflow (2.0) (has C API).  Though I haven't looked, modifying the models to call J with generated files, and then fix their output based on bugs, before ever spitting out a token to the user would be a big improvement.

Having a TDD framework (or just any extra J script do testing) and teaching another model to understand framework and generate tests on function signatures would be a big productivity boost
Reply all
Reply to author
Forward
0 new messages