I'm using a 7940hs (amd) processor. win 11.
https://code.jsoftware.com/wiki/Vocabulary/tcapdot thank you for updating this recently.
8 T. '' NB. it is considered 8 core 16 thread processor.
16 63
timespacex '(+/ .*) / ? 2 3000 3000 $ 1e10'
1.7172 6.71092e8
{{0 T.0}}^:8 ''
8
timespacex '(+/ .*) / ? 2 3000 3000 $ 1e10'
0.368438 6.71092e8
{{0 T.0}}^:8 ''
16
timespacex '(+/ .*) / ? 2 3000 3000 $ 1e10'
0.311748 6.71092e8
timespacex '(+/ .*) / ? 2 3000 3000 $ 1e10'
0.315998 6.71092e8
Despite advice in wiki, 16 threads is 20% speed improvement over 8. It seems that adding this line to startup or in any file is good, except that adding it to files would mean more threads created on every load. If adding it to startup is smartest, why not J system doing it automatically?
{{0 T.0}}^:8 ''
24
timespacex '(+/ .*) / ? 2 3000 3000 $ 1e10'
0.331289 6.71092e8 NB. now performance goes down.
I do not understand coremask and when it would help. I don't understand threadpools, and why I might want multiple.ones. If you expect to use threads very often, is lingertime of 30s or 60s a good number. 15&T. seems like a good idea even if all threads are in "lingertime state"?
2 T. 0
24 0 24 NB. all 24 threads still active and completed.
14 T. 0 120
0 NB. lingertime is initially 0.
timespacex '(+/ .*) / ? 2 3000 3000 $ 1e10' NB. repeated many times successively.
0.321233 6.71092e8
15 T. 0 NB. always add this before thread use?
timespacex '(+/ .*) / ? 2 3000 3000 $ 1e10'
0.318415 6.71092e8 NB. improvment even at 24 threads.
deleting threads doesn't seem to fit description. returns i.0 0 instead of 1 even when it works. i.0 0 is invalid argument to 55 T. and so the following command has to be repeated 8 times to delete 8 threads
55 T. '' NB. repeat a total of 8 times
2 T. 0
16 0 16
15 T. 0
timespacex '(+/ .*) / ? 2 3000 3000 $ 1e10'
0.318592 6.71092e8 NB. 24 threads with 15 T. 0 was same speed.
56 threads is much slower at 0.53s
55 (i.0)"_@T.^:24 '' NB. workaround to delete many threads that shouldn't be needed.
2 T. 0
32 0 32
15 T. 0
timespacex '(+/ .*) / ? 2 3000 3000 $ 1e10'
0.318143 6.71092e8 NB. 32 is same as 16 and 24!
for my machine anyway, why wouldn't my startup create 32 threads instead of 16. are there any hidden cost differences?
though, for inversion, 16 threads seems to improve
15 T. 0
timespacex '%. ? 3000 3000 $ 1e10'
0.977166 8.38864e8
55 (i.0)"_@T.^:16 '' NB. 32 to 16 threads
15 T. 0
timespacex '%. ? 3000 3000 $ 1e10'
0.848331 8.38864e8
The above was merely an introduction :(
I wish to develop a generic search function that could by called hybrid (deep vs breath first) or "windowed depth" search where for a window size (4, I think for my processor) searches the most promissing 4 highest scoring moves to search deeper on. I'm using my nesting dictionary library
https://github.com/Pascal-J/kv that permits recursive search to return a new dictionary up the move/ply chain and where the main (and sub) thread can update the dictionary hierarchy below it without any (seemingly) possible lock conflcts, because every thread returns a complete new dictionary that is to be slotted into their parent.
Generic, though Chess, as an example pseudocode
Starting state for simplicity is board with all legal moves each having a simple score, but generating all legal moves from current board position if they have not been already
For top "window size" moves, go deeper generating all legal moves and simple scores, for remaining legal moves generate "enhanced score", which in chess would consider positional factors that are more computationally intensive than the simple score formula. Each of these are done in their own thread with latter process updating the root/current level dictionary. Does delegating the update of the root/current dictionary to a thread prevent race conditions with the return values of the other 4 threads? Can/should the "enhanced score" simply be done without thread delegation, since it doesn't depend on the subthread results?
The reason for 4/16 window size is that when search is told to go deeper on the 4 top candidates, if those 4 moves already had been generated/explored, it is a command to go 4 wide deeper along their top 4 move scores, and so 16 thread use is likely. Now, if 32 thread allocation performs the same as 16, then there is significant optionality. use 8 wide window? add top 1 to 4 unexplored moves to threaded depth search (those previously unsearched, do not spawn an extra 4 threads of depth search)?
While the above is the core search function, there are 2 other higher functions. The intermediate one, returns the number of total new plies searched from core threaded function. The top function accumulates these and stops the iterations based on exceeding a ply limit that is passed to it. Overall system can be tuned to system to "think" for 3-5 seconds or whatever time limit based on ply limit, and keeps a search state that can be resumed further, or make the greediest move after the ply limit has been breached.
any thoughts on what "thread window size" I should be using, or other threading considerations?
btw, if placing a thread (0&T.) command in your .ijs files, instead of {{0 T.0}}^:] <: {. 8 T. ''
{{0 T.0}}^:] 0 >. (1 T. '') -~ <: {. 8 T. '' NB. prevent adding more threads than current allocation (threads deleted to 0 sometimes crashes J before this command)
15 NB. slower matrix multiplication benchmarks than 16 threads, but better inversion benchmark.