encode =: {{
pat1 =. '(*UTF)(*UCP)''s|''t|''re|''ve|''m|''ll|''d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+'
11 T. mutex NB. acquire mutex
pretok=. pat1 rxall utf8 y
13 T. mutex NB. release mutex
; {{vocab_i bpe cs {~ (bs{a.)&i. y}} each pretok
}}
Now this works but it is technically bottlenecking when doing the 'pretok=. pat1 real utf8 y’.
I placed pat1 within the encode since it was originally global and I thought I was stepping on memory I had some control over. But the pattern
inside the function or outside I get the same error.
without the mutex statements I get the following error when I run the encoder in parallel using parallel each (‘peach’)
peach=: (t.'')(&>)
encode__encoder peach storylist
|index error in rxfrom, executing dyad <;.0
|starting index out of bounds (value=97303, axis len=648) in cell of x with path 7
| ;{{vocab_i bpe cs {~ (bs{a.)&i. y}}each pat1 rxall utf8 y
Press ENTER to inspect
Wondering if anyone has some intimate knowledge of the RE implementation and knows if this may be fixable or is there just too much common memory that will get stepped on in a multithreaded environment.
rxall is shorthand for an rxmatches rxfrom combination (hence the error mentioning rxfrom), the ‘cut' dyad (<;.0) is part of the rxfrom definition so not sure why that would be a problem and it’s doing a reverse adverb.
Tom McGuire
To unsubscribe from this group and stop receiving emails from it, send an email to forum+un...@jsoftware.com.
setwd=:1!:44
pwd=:1!:43
setwd jpath '~/devLLM/picoGPT-in-j'
load 'encoder.ijs' NB. from picoGPT-in-j directory
load 'utils.ijs' NB. from picoGPT-in-j directory
MODELS_DIR
models
encoder =: MODELS_DIR conew 'encoder'
Loading tokenizer...
Reading merges.txt
Reading vocab.json
Processing vocab
Building lookup verbs
Done.
{{0 T.0}}^:] <: {. 8 T. ''
11
peach=: (t.'')(&>)
storylist =: 10000$<556$'Now is the time for all good men to come to the aid of their country. Now is the time for all good men to come to the aid of their country.'
encode__encoder peach storylist
|index error in rxfrom, executing dyad <;.0
|starting index out of bounds (value=_6710525949, axis len=556) in cell of x with path 1
| ;{{vocab_i bpe cs {~ bs i. a. i. >y}}each pat rxall utf8 y
Press ENTER to inspect
On a macbook pro M2 max chip using Jconsole
9!:14''
j9.7.0-beta3/j64arm/darwin/commercial/www.jsoftware.com/2025-04-03T02:17:42/clang-15-0-0/SLEEF=1
NB. I have distilled this down to a few lines of J code
require 'regex'
pat =: '(*UTF)(*UCP)''s|''t|''re|''ve|''m|''ll|''d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+'
storylist =: 10000$<556$'Now is the time for all good men to come to the aid of their country. Now is the time for all good men to come to the aid of their country.'
{{0 T.0}}^:] <: {. 8 T. '' 11
peach=: (t.'')(&>)
pat&rxall peach storylist
JE has crashed, likely due to an internal bug. Please report the code which caused the crash, as well as the following printout, to the J forum.
|index error in rxfrom, executing dyad <;.0
|starting index out of bounds (value=_136784836, axis len=556) in cell of x with path 15
| pat&rxall peach storylist
Could not generate stack trace: no debug info in Mach-O executable (-1)
Could not generate stack trace: no debug info in Mach-O executable (-1)
Could not generate stack trace: no debug info in Mach-O executable (-1)
Could not generate stack trace: no debug info in Mach-O executable (-1)
Could not generate stack trace: no debug info in Mach-O executable (-1)
Could not generate stack trace: no debug info in Mach-O executable (-1)
Could not generate stack trace: no debug info in Mach-O executable (-1)
Could not generate stack trace: no debug info in Mach-O executable (-1)
Could not generate stack trace: no debug info in Mach-O executable (-1)
Could not generate stack trace: no debug info in Mach-O executable (-1)
Could not generate stack trace: no debug info in Mach-O executable (-1)
Could not generate stack trace: no debug info in Mach-O executable (-1)
Could not generate stack trace: no debug info in Mach-O executable (-1)
Could not generate stack trace: no debug info in Mach-O executable (-1)
Could not generate stack trace: no debug info in Mach-O executable (-1)
Could not generate stack trace: no debug info in Mach-O executable (-1)
Could not generate stack trace: no debug info in Mach-O executable (-1)
Could not generate stack trace: no debug info in Mach-O executable (-1)
Could not generate stack trace: no debug info in Mach-O executable (-1)
Could not generate stack trace: no debug info in Mach-O executable (-1)
Could not generate stack trace: no debug info in Mach-O executable (-1)
Could not generate stack trace: no debug info in Mach-O executable (-1)
Could not generate stack trace: no debug info in Mach-O executable (-1)
Could not generate stack trace: no debug info in Mach-O executable (-1)
Could not generate stack trace: no debug info in Mach-O executable (-1)
Could not generate stack trace: no debug info in Mach-O executable (-1)
Could not generate stack trace: no debug info in Mach-O executable (-1)
-----------------------------------------------------------------------------
Abort trap: 6
logout
Saving session...
...copying shared history...
...saving history...truncating history files...
...completed.
To unsubscribe from this group and stop receiving emails from it, send an email to forum+un...@jsoftware.com.
To unsubscribe from this group and stop receiving emails from it, send an email to forum+un...@jsoftware.com.
NB. pretok.ijs - wraps regex in a semi thread safe manner
NB. use in place of regex for Byte Pair Encoding pretokenization
NB. it will use a regular expression and uses some of regex.ijs
NB. under the covers
require 'regex'
coclass 'jpretok'
create =: 3 : 0
NB. y holds the arguments used to creat the class
NB. assignments made inside verbs are private to the instance
ptpattern=: y
msg=. ,2
off=. ,2
flg=. (PCRE2_UTF_jregex_*RX_OPTIONS_UTF8_jregex_)+PCRE2_MULTILINE_jregex_*RX_OPTIONS_MULTILINE_jregex_
ptcomp=: 0 pick rc=. jpcre2_compile_jregex_ (,y);(#y);flg;msg;off;<<0
'msg off'=. 4 5{rc
if. 0=ptcomp do.
regerror msg,off
return.
end.
pthandle=: 0
ptmatch=: 0 pick jpcre2_match_data_create_from_pattern_jregex_ (<ptcomp);<<0
ptnsub=: 0 pick jpcre2_get_ovector_count_jregex_ <<ptmatch
EMPTY
)
destroy =: 3 : 0
NB. release any resources acquired
codestroy'' NB. release the instance
)
NB. =========================================================
ptmatch1=: 3 : 0
ptmatchtab 0 pick jpcre2_match_jregex_ (<ptcomp);(,y);(#y);0;0;(<ptmatch);<<0
)
NB. =========================================================
ptmatch2=: 3 : 0
's p'=. y
ptmatchtab 0 pick jpcre2_match_jregex_ (<ptcomp);(,s);(#s);p;PCRE2_NOTBOL_jregex_;(<ptmatch);<<0
)
NB. =========================================================
NB. get match table
ptmatchtab=: 3 : 0
if. y >: 0 do.
p=. 0 pick jpcre2_get_ovector_pointer_jregex_ <<ptmatch
'b e'=. |:_2 [\ memr p,0,(2*ptnsub),4
_1 0 (I.b=_1) } b,.e-b
elseif. y=_1 do.
,:_1 0
elseif. do.
regerror y
end.
)
NB. =========================================================
ptmatches=: 4 : 0
echo x
'p n'=. 2 {. boxopen x
echo p
echo n
NB. regcomp p
NB. ACQUIRE MUTEX HERE
m=. ptmatch1 y
if. _1 = {.{.m do. i.0 1 2 return. end.
s=. 1 >. +/{.m
r=. ,: m
while. s <#y do.
if. _1 = {.{.m=. ptmatch2 y;s do. break. end.
s=. (s+1) >. +/ {.m
r=. r, m
end.
if. #n do. n{"2 r end.
NB. RELEASE MUTEX
)
ptfrom=: ,."1@[ <;.0 ]
ptall =: {."2@ptmatches ptfrom ]
mypretok =: 0&".@> pat conew 'jpretok'
pat& ptall__mypretok 'Now is the time for all good men to come to the aid of their country'
(*UTF)(*UCP)'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+
(*UTF)(*UCP)'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+
┌───┬───┬────┬─────┬────┬────┬─────┬────┬───┬─────┬───┬────┬────┬───┬──────┬────────┐
│Now│ is│ the│ time│ for│ all│ good│ men│ to│ come│ to│ the│ aid│ of│ their│ country│
└───┴───┴────┴─────┴────┴────┴─────┴────┴───┴─────┴───┴────┴────┴───┴──────┴────────┘
On Jun 24, 2025, at 9:42 AM, Henry Rich <henry...@gmail.com> wrote:The problem is not just the compiled regexp, which could probably be shared. The regex package uses public names for internal communication. When you call rxmatches, it calls regcomp which creates 4 public names, to hold the pattern, the matches, and info about the matches. These public names are then filled by calls to pcre2 and further used by rxfrom to copy the data. This is all workable in a single-threaded system.
With multiple tasks, the tasks are writing to the same public names simultaneously. This is chaos: one task stores an array into a name and then passes the array into pcre2 by address. While pcre2 is running, another task assigns the name, which frees the block pcre2 is writing. That block gets reused while pcre2 is writing to it. Crash.
Numbered locales don't cost much. All we need here is a way for each task to keep its public names to itself.
Henry Rich
On 6/24/2025 9:20 AM, Raul Miller wrote:I don't think the overhead of locales/objects is desirable here.
If I understand correctly, you need one instance of the compiled
regexp per thread. Presumably the compiled instance holds state
information during the evaluation of a regex match.
That said, it looks like rxcomp explicitly prevents this from
happening. So I am not sure I understand the difference between the
tests that succeeded and the tests that failed.