Is there an efficient way to catenate to an APL vector in a #CALLed assembly language routine?

18 views
Skip to first unread message

/ Rav

unread,
May 30, 2022, 4:47:07 PM5/30/22
to APLWin
Background:  I'm using APL+Win version 15.1.01 under Windows 10 Professional.  I'm writing various assembly language routines to be called via #CALL from APL+Win.  I'm assembling using MASM, and am including the .486 and .XMM directives so that I can use SIMD extension instructions, in particular SSE 4.2 instructions.

In some of my routines I'm returning an explicit result which is a vector of integers.  The length of the result could be anything from length 0 (empty vector) to whatever APLs vector length limits are.  The length of the result depends, of course, on the processing of the particular arguments being passed in.

I'm using ISS (Interpreter Support Service) 3 (Create Variable) to create the initial empty vector result.  The problem is that there is no (documented) ISS to catenate to an existing variable.  For now, at least, I only need to catenate one integer at a time as I go.  So what I am currently doing is creating TWO result variables within the ASM routine, and alternating between them.  When I need to catenate another element, I recreate the "other" result variable (ISS 3 erases the old version automatically) with length 1+ the current result variable, copy the data elements of the current result variable to the "other" result variable, then finally insert the new value at the end.  So each time I need to catenate ONE more integer value, I have to completely recreate the result variable.

This is obviously very inefficient, and I'm wondering if anyone here has come up with a better way of doing this sort of catenation within the ASM routine.

Ideas I've already thought of include:

1.  Using ISS 6 (make unique copy of variable) to make a copy of the current result variable.  The problem is you can't specify a longer length.  And even if you could, it would still be almost as inefficient.

2.  Just having one result variable and keep recreating it but with an ever-increasing length.  The problem is that this destroys all the existing values in the data area because it first erases the current variable.

3.  Running the routine with two passes, the first pass to determine the total number of elements that will be needed in the result, then the second pass to create the result variable and fill it in.  This of course has its own inefficiency, namely having to process everything twice.

4. Pre-allocate a temporary "longest possible" result variable, fill it in, then copy it to a final result variable of the proper length copying only the actual result values.  Since "longest possible" could theoretically mean a vector of length around 2*31 elements, this is impractical and grossly inefficient.

5.  Attempting to "fool" APL+Win and manipulate the m-entry's length, number of elements, shape word and data area under the covers.  I would be able to do that to a certain point, but it requires relocating the m-entry when something else is "in the way" of it getting any larger, which I do not know how to do, and is highly dependent on the internals of APL+Win which can and do change.

6.  Using undocumented ISS routines that do what I need, which of course may or may not exist.

APL+Win itself obviously "knows" how to do efficient catenation.  In particular, with in-place catenation (for example:  a{<-}1 2 3   &   a,{<-}4), APL+Win is obviously smart enough to do the catenation in place, if possible.

So:  Does anyone have any experience with this?  Thanks.  / Rav

Davin Church

unread,
May 30, 2022, 6:40:07 PM5/30/22
to APLWin
It's been decades since I've written ASM subroutines for APL, but you're going to have to use an ASM-based approach to this problem - the interpreter won't do it for you.

First of all, how are you determining the content (and therefore the length) of the result? WHY don't you know the final result size? For instance, if I catenate vectors A & B then the result size will be (⍴A)+⍴B. That's the sort of thing that APL uses to do it's magic - it can usually predetermine the result size.

If you can't know for some reason, then you'll probably want to use basic tricks that have been around since the 50s & 60s and that we sometimes still use at the APL level when we have to.

Davin

/ Rav

unread,
May 30, 2022, 7:28:27 PM5/30/22
to APLWin
Yes, an ASM-based solution is what I'm looking for, but was hoping that there was some sort of built-in support from the interpreter for catenation within a ⎕CALLable assembly language routine, as there is for certain other things (such as creating variables, erasing variables, copying variables, coercing datatypes). The reason I can't know in advance what the final result size is is because this is a routine that, among other things, finds all the occurrences of a substring within a larger string, and returns all the indices as integers where the substring is found.  APL+Win's built-in string searching function ⎕SS returns a boolean vector, and it's final size CAN be known beforehand since it's the same length as the left argument.  But when returning the result as an integer vector of indices I can't know how many hits (and therefore elements in the result) there will be beforehand (without wasting time as outlined in my first post).  I suppose I could internally create that boolean vector and then do the equivalent of a /⍳ on it when done with the searching, but that would require creating the boolean vector itself (which could take quite a bit of memory), which would then have to be summed to determine the final result (integer vector) size, and THEN iterated on to produce the integer indices.  So that would be counterproductive, since what I'm after is as efficient an algorithm as I can get.

What are some of the tricks you're referring to?  Perhaps I would find them useful.  Thanks.  / Rav

Davin Church

unread,
May 30, 2022, 9:09:51 PM5/30/22
to APLWin
Well, first of all you know that a maximum size for your result is no bigger (in elements) than your large input string. You could preallocate that much space, fill up what you need, and then copy the used portion when you're done. There's no need to assume a worst case of 2*31.

Or, running the search twice shouldn't be THAT expensive, and I think SIMD instructions may be able to give you some help there (with things like finding the next occurrence of the first character of the search string).

For that matter, you could run a partial pre-search just for one of the less-common characters in the search string and see how many times that character appears in the input. This might well be easy to do with SIMD and it can give you a more-reasonable maximum size for your result.

Or, what do you think about using ⎕SS in APL (that's already ASM-optimized) and passing the result of that into your routine in the first place? Counting packed bits yourself can be done extremely quickly by having a predefined table of 256 numbers from 0-8 (the number of bits in that index-byte) and doing a lookup & sum by bytes. (Or you can do it by word, with a bigger table.) (That's basically how APL does a +/ of a boolean internally.) Or perhaps there is an SIMD instruction to help you there, either with the counting or the lookup/sum.

If you're doing ⎕SS in APL then you might also consider doing the /⍳ there as well - I think it may have been optimized in the interpreter, but you can do some timings if you want to consider that.

Or, if you've got to do things the hard way, try this algorithm...
  1. Start by guessing how big the result will be and create a result array that big.
  2. Start processing your search and filling in your result array by subscripting in the next integer position found.
  3. If the search ends without filling your array, now you know the result size. Allocate a new variable and copy the final data to be returned.
  4. If the result fills up before reaching the end, create a new result array that's much larger (e.g. twice as large) and copy your results to date to it. Dispose of the previous array.
  5. Write your previously-found search index to the now-available result subscript.
  6. Loop back to step 3.
This algorithm has the advantage that you only need to resize/copy data a few times. You can tune your algorithm by changing the size increase at each step - for instance, instead of making it 100% bigger each time make it only 25% bigger or 500% bigger. Or you can dynamically tune the resizing based upon how far along you are in the search string, so if you've only searched 10% of the input then resize it to 1000% as your next guess.

Might any of those ideas help?

Davin

Roy Sykes

unread,
May 30, 2022, 11:05:44 PM5/30/22
to APLWin
I have a package of over 250 machine code functions for APL+WIN
called MACfns.  Among them is STS (string search) and INDS (of
nonzeros in a vector).  INDS  string STS substring is much faster
than []SS alone.  Thus, the two passes over the Boolean vector
from STS (one to determine the count, and hence size of the result,
and the other to fill in the indices) is the fastest way to do it using
the mechanisms of the ISS (Interpreter Support Services).

Davin's suggested mechanism is overly complicated, and suffers
from the need to (perhaps repeatedly) copy data -- quite a slow
mechanism in today's processors.  It further uses more (and also
unpredictably more) workspace storage

The same is particularly true with the each-derived functions in
MACfns (like SHAPEe).  It is much faster to scan the argument(s)
once to compute the size of the result, then compute and store the
actual results, than the way the APL interpreter does it (building
up the result by allocating one item at a time).

Roy

/ Rav

unread,
May 30, 2022, 11:11:02 PM5/30/22
to APLWin
Yes, a number of things you suggest are very helpful and I'll probably implement some of them.

I've already implemented the base string search using SSE 4.2 instructions and gotten speedups generally in the area of 10 times faster than SS (it varies widely, of course), so no point running SS first in APL and passing that result into my routine, summing it, etc.

I really like your idea of creating the result variable in "blocks" of multiple integers.  As you said, at its largest it only needs to have as many elements as the string being searched.  Even if I allocate, just for example, 100 potential result integers at a time, that would be an enormous savings over increasing the size by just one at a time, and would probably satisfy the total number of typical hits.  Then, as you said, finally create a final result variable with the correct final length and copy the data over.  My thinking was fixated on catenating one index (one integer) at a time since I find them one at a time.  Thank you.  / Rav

/ Rav

unread,
May 30, 2022, 11:27:59 PM5/30/22
to APLWin
Thank you, Roy.  Great to hear from you!  I'll try it both ways -- one pass and increasing the size of the result var in blocks as necessary, and two passes as you suggest -- and compare the results.  Either way, I'm enjoying getting back into some assembly programming in combination with my recreational APL projects.  As much as I love APL because it's so high level I love assembly because it's so low-level.  A fantastic combination and a lot of fun.  / Rav (PJR in 666 BOX)

Davin Church

unread,
May 30, 2022, 11:33:20 PM5/30/22
to APLWin
Roy, can he call MACfns from within an assembly routine or is it only callable from APL? He said he's doing other things besides the searching and I don't know what else may be involved.

Roy Sykes

unread,
May 30, 2022, 11:47:48 PM5/30/22
to APLWin
MACfns can only be called via []CALL from the APL+Win workspace.
Reply all
Reply to author
Forward
0 new messages