[Sbcl-help] Expense of compiling vops

Andrew Sengul

unread,

Mar 17, 2023, 1:30:28 AM3/17/23

to sbcl

My vop compilation system is coming along well but I find that compiling
the expressions is surprisingly expensive. Evaluating code for a fairly
simple vop causes nearly 5mb to be consed and takes 8ms. I tried evaling
some other code fragments and found that this isn't specific to SBCL's
vops, evaling a 180-line function from April led to consing of about
10mb and took 19ms. Evaluating the same code in CCL took 1.4mb/12ms and
in ECL it took 375k/2ms. Can anyone shed some light on the reasons for
this overhead? Are there any declarations or other parameters that can
be used to mitigate it? The ASM kernels I'm generating are proving to be
quite fast but the time taken to build them is now a high fraction of
the total time used in processing arrays with them. Thanks,

Andrew

_______________________________________________
Sbcl-help mailing list
Sbcl...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sbcl-help

Christophe Rhodes

unread,

Mar 17, 2023, 7:15:31 AM3/17/23

to Andrew Sengul, sbcl

Andrew Sengul <m...@imagegardenphoto.com> writes:

> My vop compilation system is coming along well but I find that compiling
> the expressions is surprisingly expensive. Evaluating code for a fairly
> simple vop causes nearly 5mb to be consed and takes 8ms. I tried evaling
> some other code fragments and found that this isn't specific to SBCL's
> vops, evaling a 180-line function from April led to consing of about
> 10mb and took 19ms. Evaluating the same code in CCL took 1.4mb/12ms and
> in ECL it took 375k/2ms. Can anyone shed some light on the reasons for
> this overhead? Are there any declarations or other parameters that can
> be used to mitigate it? The ASM kernels I'm generating are proving to be
> quite fast but the time taken to build them is now a high fraction of
> the total time used in processing arrays with them. Thanks,

Exactly how are you evaluating the code?

For one-shot code, effectively just macroexpanding and then calling
functions that are already defined and optimized, you might find that
using a non-compiling evaluator will help. You could try setting
sb-ext:*evaluator-mode* to :interpret and seeing if that causes a
measurable change in behaviour. If it does, and it's not in the right
direction (or it's not enough of a change), you could try building sbcl
with the :sb-fasteval feature and repeating the experiment.

If that doesn't help, you'll need to provide more details: measurements;
profile graphs; specific recipes.

Christophe

Douglas Katzman via Sbcl-help

unread,

Mar 17, 2023, 8:27:57 AM3/17/23

to Andrew Sengul, sbcl

For context, SBCL is not particularly great in terms of compilation speed, nor do the developers tend to focus on sheer speed of the compiler itself. Almost anything is considered fair game in terms of slowing down the compiler if it makes the resulting code better. By adding custom vops as you have, you have placed yourself in the category of compiler developer, and then asked why your code might have made the compiler slower. There are many users who have wished for (declare (optimize (compilation-speed))) to have any impact. I would guess you're one of them. So you should probably start investigating why the compiler is, in general, not super fast, and in particular why after using define-vop to the extent you have.

To back up though, I'll say that I can't quite parse the complaint itself. If "compiling vops" is taken to mean literally "it is slow to execute define-vop", well, yes, that's not even a thing that anybody has cared about, ever. Not users, and not developers, for the reason that vops are supposed to be static. It sounds like you might not be using them statically, just guessing from your prior emails. (And you do realize by now that define-vop is not a "blessed" thing at all, right?)

But if I assume that you didn't actually mean "compiling vops" but rather that holistically your compilation is slower than desired after adding in vops (or even without adding them in?), my suggestion would be to run the statistical time profiler and the exact space profiler (which tends to be faster than and generally more useful than the statistical profiler in its allocation-profiling mode) on the entire compiler. While it's possible to gather both profiles at once, you should not, because you want to exclude the time overhead of space profiling and vice-versa.
Example:
* (sb-sprof:with-profiling (:sample-interval .0001 :report :flat)
(compile-file "src/compiler/srctran"))

For the cons profiler, add --with-cons-profiling to your build features. Then do anything that compiles. In this example, it's literally COMPILE-FILE but it can be whatever you want to profile.
* (sb-aprof:aprof-run 'compile-file :arguments '("src/compiler/srctran"))
% Bytes Count Function
------- ----------- --------- --------
6.2 16461328 SB-C::MAKE-TN
84.6 13928816 79141 SB-C:TN
15.4 2532512 79141 SIMPLE-BIT-VECTOR
...

Based on those profiles there may be some insight such as where adding some memoization would help.

As to other lisps - I'm not sure it's a relevant comparison, unless you're saying that those other lisps also allow you to write assembly code, and do so in a way that works better.

Doug

Andrew Sengul

unread,

Mar 19, 2023, 11:31:21 AM3/19/23

to Douglas Katzman, cs...@cantab.net, sbcl

On 3/17/23 5:27 AM, Douglas Katzman wrote:
> To back up though, I'll say that I can't quite parse the complaint
> itself. If "compiling vops" is taken to mean literally "it is slow to
> execute define-vop", well, yes, that's not even a thing that anybody
> has cared about, ever. Not users, and not developers, for the reason
> that vops are supposed to be static. It sounds like you might not be
> using them statically, just guessing from your prior emails. (And you
> do realize by now that define-vop is not a "blessed" thing at all, right?)

That is correct, I am generating vops dynamically. I've attached some
example code at the end of this message. I build the define-vop forms
and (eval) them at runtime because they depend on information that's not
available until then. The functions I'm building are customized for the
specific dimensions and element width of the array being transformed and
they compose multiple array transformations into a single loop. This
means that, for example, you could crop, mirror, stretch and rotate a
.png image without needing to write the intermediate stages of that
transformation into memory. You start with the original image in memory,
the function takes it and writes the cropped, mirrored, stretched,
rotated output image into memory without needing to store the
intermediate stages of the transformation as would be done in a more
conventional image processing approach.

On 3/17/23 4:14 AM, Christophe Rhodes wrote:
>
> For one-shot code, effectively just macroexpanding and then calling
> functions that are already defined and optimized, you might find that
> using a non-compiling evaluator will help. You could try setting
> sb-ext:*evaluator-mode* to :interpret and seeing if that causes a
> measurable change in behaviour.

Thanks, that approach has an upside and a downside. I tried evaluating a
vop form normally and it took 9ms/5.5mb consed/36mil cycles. Evaluating
the same form in :interpret mode took 1ms/65kb consed/2.3mil cycles. So
that's an improvement. But then when I try running the function SBCL
freezes until I send an interrupt to stop it. Is something in the
:interpret mode preventing the vop from compiling correctly?

I was wondering if it could be possible to directly engage with the
functions behind define-vop that convert the forms into binary opcodes,
thus skipping all the compiler's optimization heuristics. Those compiler
features shouldn't be necessary since I'm specifying the exact assembly
instructions to generate.

I'll try some of the other profiling tips and see what I can find out.
Appreciate the feedback.

Andrew

Sample code:

(eval
'(progn
    (sb-c:defknown vop-ph
        ((unsigned-byte 64))
        (unsigned-byte 64)
        (sb-c:foldable sb-c:flushable sb-c:movable)
      :overwrite-fndb-silently
      t)
    (unless (fboundp 'vop-ph)
      (proclaim '(special vop-ph))
      (setf (symbol-function 'vop-ph)
            (lambda (a) a)))
    (sb-c:define-vop (vop-ph)
      (:policy :fast-safe)
      (:translate vop-ph)
      (:args (st-arg :scs (sb-vm::unsigned-reg)))
      (:arg-types sb-vm::unsigned-num)
      (:temporary (:sc sb-vm::unsigned-reg :from :eval :offset
sb-vm::rax-offset)
                  ra)
      (:temporary (:sc sb-vm::unsigned-reg :from :eval) other)
      ... more temporary specs...
      (:generator 1 ...instruction forms...))
    (setf (symbol-function 'vop-ph)
          (lambda (a)
            (declare (optimize (speed 3) (safety 0)))
            (vop-ph a)))))

Christophe Rhodes

unread,

Mar 19, 2023, 4:50:28 PM3/19/23

to Andrew Sengul, sbcl

Andrew Sengul <m...@imagegardenphoto.com> writes:

> On 3/17/23 4:14 AM, Christophe Rhodes wrote:
>>
>> For one-shot code, effectively just macroexpanding and then calling
>> functions that are already defined and optimized, you might find that
>> using a non-compiling evaluator will help. You could try setting
>> sb-ext:*evaluator-mode* to :interpret and seeing if that causes a
>> measurable change in behaviour.
>
> Thanks, that approach has an upside and a downside. I tried evaluating a
> vop form normally and it took 9ms/5.5mb consed/36mil cycles. Evaluating
> the same form in :interpret mode took 1ms/65kb consed/2.3mil cycles. So
> that's an improvement. But then when I try running the function SBCL
> freezes until I send an interrupt to stop it. Is something in the
> :interpret mode preventing the vop from compiling correctly?

Well, if you evaluated the function definition in :interpret mode, then
it will be attempting to interpret. Judging by your definition below

>     (setf (symbol-function 'vop-ph)
>           (lambda (a)
>             (declare (optimize (speed 3) (safety 0)))
>             (vop-ph a)))))

this won't end well. If you are in *evaluator-mode* = :compile, then
this lambda will be compiled, and the compiler will use the templates
(VOPs) it knows about. If you are in *evaluator-mode = :interpret, the
this lambda won't be explicitly compiled, so the interpreter will
interpret it, and in order to evaluate a function call to VOP-PH it will
attempt to call the function... VOP-PH.

You can explicitly request compilation using the COMPILE function:

(setf (symbol-function 'vop-ph)
(compile nil '(lambda (a)
(declare (optimize speed (safety 0)))
(vop-ph a))))

> I was wondering if it could be possible to directly engage with the
> functions behind define-vop that convert the forms into binary
> opcodes, thus skipping all the compiler's optimization
> heuristics. Those compiler features shouldn't be necessary since I'm
> specifying the exact assembly instructions to generate.

Defining VOPs for known functions *is* how to directly engage with the
functions behind it. No other "optimization heuristics" will be
applied. (With the usual caveat that this is not a supported interface
at all.)

> The functions I'm building are customized for the specific dimensions
> and element width of the array being transformed and they compose
> multiple array transformations into a single loop.

I wonder to what extent it might be possible to parameterize this some
other way, so that you compile a somewhat more generic kernel that takes
constant (:INFO, in VOP language) arguments for some of these.
Dimensions and element-width feel like things that could be :info
parameters, for example. Then you would not need a new VOP each time
you applied the same operation on different-sized or different-width
arrays.

Likewise, it might be that the loop over elements could be regular lisp
(as simple as nested DOTIMES) and the kernel could be custom. Again,
this might mean generating rather fewer distinct VOPs.

VOPs were designed substantially before I started working on SBCL, but
judging by the apparent properties of their use they seem to me to be
intended as reusable components to implement generic-ish operations
expressed in regular Common Lisp, and the distance that the user strays
from that will tend to increase the friction in using them. I would
suggest that the "normal" interface to use SBCL as a compiler for
another language would be to translate that language into Common Lisp
source, applying all the language-specific optimizations that you might
want to (in this case, recognizing a sequence of array transformations
and deforesting -- detabling? -- the intermediate arrays), emitting
Common Lisp source, and then allowing the SBCL compiler to optimize that
Common Lisp source to machine code. (And then defining VOPs for
particular operations in the translated language that do not map well to
Common Lisp, or are not optimized well by the SBCL compiler).

Christophe

Andrew Sengul

unread,

Mar 20, 2023, 11:48:07 PM3/20/23

to Christophe Rhodes, sbcl

On 3/19/23 1:49 PM, Christophe Rhodes wrote:
> You can explicitly request compilation using the COMPILE function:
>
> (setf (symbol-function 'vop-ph)
> (compile nil '(lambda (a)
> (declare (optimize speed (safety 0)))
> (vop-ph a))))

Perfect, it works now. Cuts the eval time dramatically.

>
> I wonder to what extent it might be possible to parameterize this some
> other way, so that you compile a somewhat more generic kernel that takes
> constant (:INFO, in VOP language) arguments for some of these.
> Dimensions and element-width feel like things that could be :info
> parameters, for example. Then you would not need a new VOP each time
> you applied the same operation on different-sized or different-width
> arrays.

I expect that would involve adding more conditions in the vop, for
instance depending on element type I would need a condition that chooses
between doing MOV EAX, [addr] or MOV AX, [addr] and so on. That would
add to the number of instructions. Given that the intention for the vop
generation system is processing large arrays with millions or billions
of elements, it seems like spending 1ms compiling a custom vop is better
for the sake of minimizing the number of instructions run per array
element. The explicit vops also mean that I can inline numbers that I
would otherwise need to take as arguments and put in registers, adding
more instructions and using more register space.

Also, I was looking at the (:info) options for some of SBCL's vops but
it's not clear to me how they work. For instance in this vop
(https://github.com/sbcl/sbcl/blob/master/src/compiler/x86-64/array.lisp#L433)
you have an addend symbol in the :info form and it must be a number
since it's used in arithmetic but where does the value come from?

> Likewise, it might be that the loop over elements could be regular lisp
> (as simple as nested DOTIMES) and the kernel could be custom. Again,
> this might mean generating rather fewer distinct VOPs.

Doing the loop in a discrete ASM kernel should be faster in any case
since I can store values in registers during the loop instead of needing
any cache access. Looping in CL is how I've been doing things in April
up to this point. The discrete ASM kernel generation is a further
optimization that can be employed for combinations of a few supported
operations. I've seen a speed boost of 3-4x from the ASM kernels
compared to the CL implementation.

> I would
> suggest that the "normal" interface to use SBCL as a compiler for
> another language would be to translate that language into Common Lisp
> source, applying all the language-specific optimizations that you might
> want to (in this case, recognizing a sequence of array transformations
> and deforesting -- detabling? -- the intermediate arrays), emitting
> Common Lisp source, and then allowing the SBCL compiler to optimize that
> Common Lisp source to machine code.

That's how April has been implemented up to this point; to be more
specific it compiles APL to CL code which generates a tree of CLOS
objects whose methods are called to generate a function that implements
an array transformation. The array transformation has hitherto been
accomplished by chaining together functions that may fetch or change
data from an array or do transformations on array indices. These index
conversion functions are optimized for specific types of coordinate
encoding using declarations and specified integer types, but I find the
speed is still substantially slower than an ASM kernel; I suspect it's
because the system of chained functions, no matter how optimized those
functions are, involves a lot of movement of data between cache and CPU
between functions and the CL compiler doesn't always make the best
choices for generating ASM. The dynamic vops allow me to tailor the
function for precisely the transformations being done and keep the
computation within the registers as much as possible.

Andrew

Christophe Rhodes

unread,

Mar 21, 2023, 5:12:13 AM3/21/23

to Andrew Sengul, sbcl

Andrew Sengul <m...@imagegardenphoto.com> writes:

>> I wonder to what extent it might be possible to parameterize this some
>> other way, so that you compile a somewhat more generic kernel that takes
>> constant (:INFO, in VOP language) arguments for some of these.
>> Dimensions and element-width feel like things that could be :info
>> parameters, for example. Then you would not need a new VOP each time
>> you applied the same operation on different-sized or different-width
>> arrays.
>
> I expect that would involve adding more conditions in the vop, for
> instance depending on element type I would need a condition that chooses
> between doing MOV EAX, [addr] or MOV AX, [addr] and so on. That would
> add to the number of instructions.

It need not add to the number of instructions emitted by the vop: you
can use the information in :INFO parameters to decide which instructions
(or which forms of instructions) to emit.

> Also, I was looking at the (:info) options for some of SBCL's vops but
> it's not clear to me how they work. For instance in this vop
> (https://github.com/sbcl/sbcl/blob/master/src/compiler/x86-64/array.lisp#L433)
> you have an addend symbol in the :info form and it must be a number
> since it's used in arithmetic but where does the value come from?

:INFO arguments are compile-time constants.

To take a perhaps simpler example, consider ARRAY-RANK= at
https://github.com/sbcl/sbcl/blob/master/src/compiler/x86-64/array.lisp#L114.
That is used for example (ARRAY-RANK= array 2), where the ARRAY is an
object passed at runtime but the value being tested against is not.

If the array rank were stored as a word in the array header, this would
not be very useful, because the generated code would be much the same as
code emitted for (= (ARRAY-RANK array) 2). But in fact we store
something slightly different to encode the rank. If we know nothing
about what we are using the array-rank for, then we have to decode the
rank from the stored information, which costs instructions; if we know
that we are comparing the array rank against a known-at-compile-time
constant, we can instead compare the encoded rank from the array header
with an encoding of the constant (that's the (ENCODE-ARRAY-RANK RANK) in
the ARRAY-RANK= VOP).

Importantly for this discussion, we don't need to define separate
ARRAY-RANK=0, ARRAY-RANK=1, ARRAY-RANK=2, ARRAY-RANK=3, ... VOPs: the
VOP receives the constant argument as a parameter and can perform
ordinarly Lisp logic on that parameter.

>> I would
>> suggest that the "normal" interface to use SBCL as a compiler for
>> another language would be to translate that language into Common Lisp
>> source, applying all the language-specific optimizations that you might
>> want to (in this case, recognizing a sequence of array transformations
>> and deforesting -- detabling? -- the intermediate arrays), emitting
>> Common Lisp source, and then allowing the SBCL compiler to optimize that
>> Common Lisp source to machine code.
>
> That's how April has been implemented up to this point; to be more
> specific it compiles APL to CL code which generates a tree of CLOS
> objects whose methods are called to generate a function that implements
> an array transformation. The array transformation has hitherto been
> accomplished by chaining together functions that may fetch or change
> data from an array or do transformations on array indices.

OK. Without seeing concrete examples I can't be sure, but I think that
what you're describing is effectively compiling your April code to a set
of "out of line" functions. If that is the case, then you could
potentially do better by compiling your April code to (still Common
Lisp) code fragments and using regular Common Lisp COMPILE on it. This
might not be as good as (effectively) hand-tuned assembly, but I'd be
surprised if it weren't substantially better than chained function
calls, and it would have the advantage of portability (both across
architectures and across Common Lisp implementations).

Christophe

Reply all

Reply to author

Forward