threading T./t: help

40 views
Skip to first unread message

Pascal Jasmin

unread,
Aug 27, 2025, 4:51:11 PMAug 27
to Forum
I'm using a 7940hs (amd) processor.  win 11.

https://code.jsoftware.com/wiki/Vocabulary/tcapdot  thank you for updating this recently.

   8 T. ''  NB. it is considered 8 core 16 thread processor.

16 63

   timespacex '(+/ .*) / ? 2 3000 3000 $ 1e10'

1.7172 6.71092e8

{{0 T.0}}^:8 ''

8

timespacex '(+/ .*) / ? 2 3000 3000 $ 1e10'

0.368438 6.71092e8

{{0 T.0}}^:8 ''

16

timespacex '(+/ .*) / ? 2 3000 3000 $ 1e10'

0.311748 6.71092e8

timespacex '(+/ .*) / ? 2 3000 3000 $ 1e10'

0.315998 6.71092e8

Despite advice in wiki, 16 threads is 20% speed improvement over 8.  It seems that adding this line to startup or in any file is good, except that adding it to files would mean more threads created on every load.  If adding it to startup is smartest, why not J system doing it automatically?

{{0 T.0}}^:8 ''

24

timespacex '(+/ .*) / ? 2 3000 3000 $ 1e10'

0.331289 6.71092e8  NB. now performance goes down.

I do not understand coremask and when it would help.  I don't understand threadpools, and why I might want multiple.ones.  If you expect to use threads very often, is lingertime of 30s or 60s a good number.  15&T. seems like a good idea even if all threads are in "lingertime state"?



   2 T. 0

24 0 24   NB. all 24 threads still active and completed.

  14 T. 0 120

0  NB. lingertime is initially 0.

   timespacex '(+/ .*) / ? 2 3000 3000 $ 1e10'  NB. repeated many times successively.

0.321233 6.71092e8

15 T. 0  NB. always add this before thread use?

timespacex '(+/ .*) / ? 2 3000 3000 $ 1e10'

0.318415 6.71092e8  NB. improvment even at 24 threads.

deleting threads doesn't seem to fit description.  returns i.0 0 instead of 1 even when it works.  i.0 0 is invalid argument to 55 T. and so the following command has to be repeated 8 times to delete 8 threads

55 T. '' NB. repeat a total of 8 times

2 T. 0

16 0 16

15 T. 0

timespacex '(+/ .*) / ? 2 3000 3000 $ 1e10'

0.318592 6.71092e8  NB. 24 threads with 15 T. 0 was same speed.

56 threads is much slower at 0.53s

55 (i.0)"_@T.^:24 '' NB. workaround to delete many threads that shouldn't be needed.


    2 T. 0

32 0 32

15 T. 0

timespacex '(+/ .*) / ? 2 3000 3000 $ 1e10'

0.318143 6.71092e8  NB. 32 is same as 16 and 24!

for my machine anyway, why wouldn't my startup create 32 threads instead of 16.  are there any hidden cost differences?

though, for inversion, 16 threads seems to improve

15 T. 0

timespacex '%. ? 3000 3000 $ 1e10'

0.977166 8.38864e8

55 (i.0)"_@T.^:16 ''  NB. 32 to 16 threads

  15 T. 0

timespacex '%. ? 3000 3000 $ 1e10'

0.848331 8.38864e8


The above was merely an introduction :(

I wish to develop a generic search function that could by called hybrid (deep vs breath first) or "windowed depth" search where for a window size (4, I think for my processor) searches the most promissing 4 highest scoring moves to search deeper on.  I'm using my nesting dictionary library https://github.com/Pascal-J/kv  that permits recursive search to return a new dictionary up the move/ply chain and where the main (and sub) thread can update the dictionary hierarchy below it without any (seemingly) possible lock conflcts, because every thread returns a complete new dictionary that is to be slotted into their parent.

Generic, though Chess, as an example pseudocode

Starting state for simplicity is board with all legal moves each having a simple score, but generating all legal moves from current board position if they have not been already  

For top "window size" moves, go deeper generating all legal moves and simple scores, for remaining legal moves generate "enhanced score", which in chess would consider positional factors that are more computationally intensive than the simple score formula.  Each of these are done in their own thread with latter process updating the root/current level dictionary.  Does delegating the update of the root/current dictionary to a thread prevent race conditions with the return values of the other 4 threads?  Can/should the "enhanced score" simply be done without thread delegation, since it doesn't depend on the subthread results?


The reason for 4/16 window size is that when search is told to go deeper on the 4 top candidates, if those 4 moves already had been generated/explored, it is a command to go 4 wide deeper along their top 4 move scores, and so 16 thread use is likely.  Now, if 32 thread allocation performs the same as 16, then there is significant optionality.  use 8 wide window?  add top 1 to 4 unexplored moves to threaded depth search (those previously unsearched, do not spawn an extra 4 threads of depth search)?

While the above is the core search function, there are 2 other higher functions.  The intermediate one, returns the number of total new plies searched from core threaded function.  The top function accumulates these and stops the iterations based on exceeding a ply limit that is passed to it.  Overall system can be tuned to system to "think" for 3-5 seconds or whatever time limit based on ply limit, and keeps a search state that can be resumed further, or make the greediest move after the ply limit has been breached.

any thoughts on what "thread window size" I should be using, or other threading considerations?

btw, if placing a thread (0&T.)  command in your .ijs files, instead of {{0 T.0}}^:] <: {. 8 T. ''


{{0 T.0}}^:] 0 >. (1 T. '') -~ <: {. 8 T. ''   NB. prevent adding more threads than current allocation (threads deleted to 0 sometimes crashes J before this command)

15  NB. slower matrix multiplication benchmarks than 16 threads, but better inversion benchmark.









Henry Rich

unread,
Aug 27, 2025, 10:06:55 PMAug 27
to fo...@jsoftware.com
So many questions, so many questions-posed-as-statements.  I do better
responding to simple questions.

The result of (8 T. '') tells you something about your machine (the
number of cores the OS thinks you have) and something about JE (the max
# threads allowed in a threadpool).

Unfortunately, Windows treats a core with hyperthreading as 2 cores,
which brings to mind Lincoln's favorite joke:

Q. How many legs does a dog have if you call the tail a leg?
A. Four.  Calling a tail a leg doesn't make it one.

Calling a second hyperthread a core doesn't make it one.  AFAICS the
second thread in a core doesn't add much, because JE doesn't leave much
spare time for it.

If you have 8 hyperthreaded cores you should use coremask to ensure that
your threads are running on different cores.  Alas, there is no
standardization for this.  The threading routines understand the concept
of a mask for cores, but it is up to the OS what those bits mean.

On Windows, you can use the utility

Windows> coreinfo64

to see what coremasks to use.  Define 1 thread per core, mapped to
different physical cores, for best performance on compute-bound jobs. 
(For jobs where the threads are mostly waiting, such as I/O
multiplexing, you can have as many threads as you like)

Henry Rich
> To unsubscribe from this group and stop receiving emails from it, send an email to forum+un...@jsoftware.com.
>

bill lam

unread,
Aug 27, 2025, 10:40:48 PMAug 27
to fo...@jsoftware.com
IIRC hyperthread can be disabled in bios.

Pascal Jasmin

unread,
Aug 28, 2025, 10:44:32 AMAug 28
to fo...@jsoftware.com
I observed that on my system, 16 to 32 threads allocation had minimal degradation at 32, and was faster than 7,8 or 15 threads for (GMP assisted?) matrix benchmarks.  25% speed improvement at 16 vs 7 threads.

Is the reason for n-1 thread allocation to allow the main thread to not compete for resources, or to be "nicer" to OS and other tasks?  windows stays fairly responsive even if I try a ridiculously large matrix at 15 or 16 threads (though jbreak will not interupt it, and this could be a hint to an advantage to giving main thread some priority).  Does it make sense to use n (instead of n-1) threads allocation if the main thread will mainly be waiting for other tasks to complete (GMP use included)?

apparent bug, lingertime (14 T. 0, anynumber including _)  always returns 0.1 for previous setting.  For matrix benchmark, repeating mutiple timex calls does not seem to be impacted by any lingertime setting.

A Jqt crash can occur by creating threads, deleting all threads,  +/ t. 'worker' 1 2 3 (returns 6 result), but 2 T. 0 returns 3 0 0 instead of 0 0 0.  Creating 16 threads ({{0 T.0}}^:] 0 >. (1 T. '') -~ {. 8 T. '') crashes J possibly in 1 T. '', because 2 T. 0 is inconsistent.  2 T. 0 is 0 0 0 after t. call in new session where no threads created/deleted.

What use cases for threadpools have you imagined?  If threadpool 0 is used for J primitives, would tasks executed in threadpool 1 still be able to access/benefit form accelerated primitives?

Henry Rich

unread,
Aug 28, 2025, 11:09:32 AMAug 28
to fo...@jsoftware.com
Please supply a J segment that crashes, and I will fix it.

n-1 worker threads gives 1 thread per core (counting the master thread),
/PROVIDED/ the OS spreads the threads to different cores, which you can
ensure with coremask.  That should keep the cores pretty busy,
especially on matrix multiply which saturates the ALUs.

In my measurements of heavy floating-point loads, my poor little laptop
turns the clock rate down about as fast as I add threads, giving little
improvement.  A machine with a better cooling system would run several
times faster.  Tuning such a load will require careful analysis.

lingertime would not affect matrix-multiply performance. lingertime is
useful when you have a J task (perhaps in the master thread) that is
dispatching tasks that depend on the results of previous tasks.  In such
a case there might be a gap of a few microseconds where no task is ready
to run.  lingertime causes the thread to poll for a new thread rather
than going into a wait state that might take a few more microseconds to
wake up out of.

Threads in any pool use threadpool 0 for the accelerated primitives. 
Since one thread per core suffices to keep the system busy, other
threadpools would be needed only for applications that are not
compute-bound.  An example would be a socket multiplexer where each
thread handles a few sockets.  The threads would normally be idle but
you might want a lot of them.

A single jbreak (ATTENTION) causes execution to stop at the end of the
current sentence.  It wouldn't interrupt matrix multiply.  A second
jbreak (BREAK) tries to interrupt what is running, but it is checked
only at certain places, notably whenever memory is allocated or during
some compute-bound primitives.  +/ . * does not check - I think it should.

Henry Rich

Pascal Jasmin

unread,
Aug 28, 2025, 2:34:19 PMAug 28
to fo...@jsoftware.com
Thank you, Henry.

Determining thread allocation for J seems impossible, even with deep core information from external tools not bundled with J, because it further depends on bios power (quiet, balanced, performance) plan, SMT enabling, and throttling effects from previous 2.   An external tool/database would be needed to know if core 0 is OS cores 0 8 (amd I think), 0 1 (intel I think) or 0 4 (SMT in bios turned off, and so wrong because not SMT).  If the OS is not smart enough to spread among cores today, if it can, then it could get smarter tomorrow.

Coremask seems like a nightmare to implement portably.  I don't believe that there is a C OS function that permits setting a thread as soft (suggestion) instead of hard affinity.  Could coremask either allow a boolean vector, or provide a function example in documentation to convert boolean to little endian format, and what happens if it is longer than OS cores (or can function trim boolean mask to cores available to avoid crash, though wraparound behaviour could seem ok)?

An alternative management system that is available to C, and would be far more intuitive to my imagined threading applications, is setting a group thread priority for an entire threadpool, with option to set thread 0 in that pool 1 notch higher priority than the rest.

I could then imagine creating 3 total threadpools with 0 1 2 having for my system 16 8 39 threads, where pool 0 has the best J primitives acceleration, pool 1 has automatically higher priority due to less competition for threads, and pool 2 has the lowest effective priority due to highest competition.  On my own system, assuming I understand the core mapping correctly, i could choose to give thread 0 in each pool its own core, and give just 5 cores for rest of threadpool 2 "to further starve them", but this approach would stall cores compared to simply higher priority for main thread (0) in pool that also takes work when all threads are busy, but also may just be waiting (most of the time?) for all of the other threads to finish instead.  In my imagined immediate (and in general) application, I would not set thread 0 of a pool as higher priority than others, because it would simply be in "hurry up then wait" 

Maybe when a/main task is waiting for pyxes to complete, it can boost the priority one level for that (waiting pyxes) task?

If you have more threads than physical/virtual cores, and you start a task with t. 'worker' does it get assigned to a core even if all cores are busy?  and would not be likely to move even if another core became free before its assigned core is free?

Ak

unread,
Aug 28, 2025, 2:40:40 PMAug 28
to fo...@jsoftware.com
Hi Pascal

When you load a fresh session do you get a difference in performance when you load your threads with this line instead?
     {{0 T. 0}}^:] <: {: 8 T. ''

Are you running these tests in console or Qt?

Please also share the utilization differences as well (task manager captures).



Ak

Ak

unread,
Aug 28, 2025, 2:49:15 PMAug 28
to fo...@jsoftware.com
Hi Pascal,

I use this line to clear threads during a session. 
 
    clear_threads =: {{55 T. ''}} ^:(1 T. ])
clear_threads''

Ak

-------- Original message --------
From: 'Pascal Jasmin' via forum <fo...@jsoftware.com>
Date: 8/27/25 14:51 (GMT-07:00)
To: Forum <fo...@jsoftware.com>
Subject: [Jforum] threading T./t: help

Henry Rich

unread,
Aug 28, 2025, 3:06:09 PMAug 28
to fo...@jsoftware.com
If you quail at coremask, IIUC thread priority is even less standardized
across OSs.  That's why we don't have primitives for it.

Threads are assigned to cores by the OS, and are dispatched by the OS
whenever the OS feels like it.  Tasks are put onto a task queue for a
threadpool.  A thread that is running but has no task will try to take
one off the task queue.  If it finds no task, it waits until there is
one.  If multiple running threads are assigned to the same core, they
will share the CPU, the exiguous L1 caches, and the L2 cache, and will
probably get worse performance than a single thread would.

Don't imagine that multithreading is a quick performance boost. It can
be wonderful, but only for the right kind of task: high ratio of
processing to bytes in & out, and not much reference to D3$ or DRAM. 
Matrix multiplication is a good example.

Timing multithreading requires that you consider the cost of having
results in different cores.  The slowdown may not show up until a later
primitive refers to the result.

Henry Rich

Pascal Jasmin

unread,
Aug 28, 2025, 3:11:59 PMAug 28
to fo...@jsoftware.com
{{0 T.0}}^:] 0 >. (1 T. '') -~ {. 8 T. ''

is the best performance for me on core J functions (GMP).  ie. all 16 cores instead of 15.


the 0 >. (1 T. '') -~  part is there in case I already asked for 16 cores.  such as being part of a load 'file' command.  If I used your version (whether 15 or 16), it would add an additional 16 threads.

this is in qt.  Original comment included benchmark at different thread numbers.  15 vs 16 threads are both very close.  Some 15s beat all of the 16s.  But has to be run "warm" (repeat last command quickly).  It may be a good argument to do 15 instead of 16 just to always half a core spare?

Henry Rich

unread,
Aug 28, 2025, 6:15:42 PMAug 28
to fo...@jsoftware.com
lingertime is limited to a maximum of 0.1 second.

Can you give me a J sequence that will produce the crash you describe?

Henry Rich

Pascal Jasmin

unread,
Aug 28, 2025, 7:43:46 PMAug 28
to fo...@jsoftware.com
The following console sequence will crash jqt 5.1

{{0 T.0}}^:] 0 >. (1 T. '') -~ {. 8 T. ''

16

55 (i.0)"_@T.^:16 ''




{{0 T.0}}^:] 0 >. (1 T. '') -~ {. 8 T. ''

16

14 T. 0 200000

0

14 T. 0 200000

0.1

timespacex '+/ t. ''worker''"1 i.3 3000000'

0.0350165 2.34885e8

55 (i.0)"_@T.^:16 ''  NB. executed longer than 0.1 seconds after last command finished




2 T. '' NB. out of sync after delete

1 0 0

{{0 T.0}}^:] 0 >. (1 T. '') -~ {. 8 T. ''  NB. crash here

Henry Rich

unread,
Aug 29, 2025, 11:21:53 AMAug 29
to fo...@jsoftware.com
This is fixed for the next beta.  The code to support lingertime had a
missing () .  Workaround: don't use 14&T.

Henry Rich

Ak

unread,
Aug 31, 2025, 8:57:34 PMAug 31
to fo...@jsoftware.com
But did you try it?

I don't know the structure of your particular algorithm, or how you have chosen to create threads to execute it. I was only considering what the performance delta between my line and your line might give in a fresh session.


Ak

Pascal Jasmin

unread,
Sep 2, 2025, 10:15:38 AMSep 2
to fo...@jsoftware.com
Any chance that the max lingertime could be extended to 30 seconds?  This might allow console interaction that expects to resume thread use.


A general question about what work J does to manage threads compared to OS work.

Assuming 16 threads are assigned with 8&T., if I start 24 tasks with t., my understanding is 

t. 'worker' will wait for a thread to be idle before starting the task.
t. 0 may use main thread to complete task if all threads are busy.  

I notice that for matrix acceleration (GMP?) there is not 100% peak cpu utilization either at 16 or 32 threads allocated.  All my benchmarks are done with 80 browser tabs in background, and idle cpu utilization ranges from 6 to 12%, average around 8% fyi.  Over 7000 active OS threads "at idle", and from hardware monitor software, no core is truly idle.

From Nuvoc, "If you have threads that are often idle, consider allocating them in a different threadpool. That will avoid the thrashing that would result if you had too many threads in threadpool 0."

This suggests "high involvement from J engine" and a strategy for extra threadpools.

The application I wish to make is data parallel "tree search navigation"  where tasks make other branching tasks then wait (should mean idle to J?) for pyxes to be filled, and updating their data chunk, and returning to their waiting parent task.  It's not important if any individual task is efficient, nor that system is responsive during delegation phase.  Just that whole task tree completes quickly.

If J does a lot of management, then deciding to let J manage a small number of threads can be meaningful compared to setting a maximal amount of threads and letting the OS try to saturate all cores on CPU.

Henry Rich

unread,
Sep 2, 2025, 10:38:57 AMSep 2
to fo...@jsoftware.com
You have lingertime all wrong.  When a thread is lingering, it is
refusing to go into a wait state, instead doing a low-power spin loop. 
The cost is that the thread/core is tied up during the lingertime; the
gain is that if another task comes along, the thread doesn't have to
wait for the OS to dispatch it.  The dispatch time is on the order of
milliseconds or less (the full processor state must be restored). 
Perhaps I should reduce the max lingertime, but certainly not increase it.

If you have not stopped your browser you should not trust any timings
you get.  All those threads wake up from time to time. Not only do they
take a little CPU time; they also timeslice with running threads.  If
your matrix multiply has been split up into 16 blocks that each take 10
ms, and a browser thread comes along and preempts a single core for 10
ms, that will show up in task manager as an additional 6% load on the
CPU, but it will have delayed your matrix multiply, which is waiting for
the last block to finish, by 10ms: a 50% performance loss.

J's threading is quite simple and fast.  Each threadpool has a task
queue.  When a task is added to the queue, all threads are awakened by
the OS and they grab for the queue, which is implemented at the machine
level, with atomic instructions, not OS mutexes.  When a thread (usually
the master) has put more tasks on the queue than there are threads, each
thread will take the next as soon as it finishes one, with very little
wasted time.  By measurement, the arbitration time is negligible if the
task runs for more than (100*#cores) ns.

Having more threads than cores achieves nothing if the tasks are all
computation, and adds OS task-level delay.  Whether a hyperthreaded core
benefits from having a thread is uncertain - my guess is no, but your
data suggests otherwise.

If your threads often go into an I/O wait, you might want to have more
threads so that there will be enough not waiting to give the cores work
to do.

Henry Rich
Reply all
Reply to author
Forward
0 new messages