[Sbcl-help] comparing simple-array and displaced array performance

Anticrisis

non lue,

13 sept. 2023, 22:58:3213/09/2023

à sbcl...@lists.sourceforge.net

Simple arrays are really fast and often generate SIMD operations. I am wondering if we can get similar code generation when displacing to simple arrays. This seems doable because the same analysis being done on loops over simple arrays should apply, while the overhead of a displaced array could happen outside the loop.

Adding two large simple-arrays is extremely fast with the appropriate type declarations. With a general type declaration, it is 5x slower. And with displaced arrays, it is 9x slower. Output is below, and I've attached the source file with benchmark code.

;; ; processing (GC :FULL ...)
;; ; processing (BENCH-SIMPLE)
;; - SAMPLES TOTAL MINIMUM MAXIMUM MEDIAN AVERAGE DEVIATION
;; REAL-TIME 100 0.96 0 0.02 0.01 0.0096 0.0028
;; RUN-TIME 100 0.961262 0.00883 0.019248 0.009154 0.009613 0.001402
;; USER-RUN-TIME 100 0.941652 0.000009 0.013764 0.009153 0.009417 0.00137
;; SYSTEM-RUN-TIME 100 0.019886 0 0.019248 0 0.000199 0.001915
;; PAGE-FAULTS 100 0 0 0 0 0 0.0
;; GC-RUN-TIME 100 0 0 0 0 0 0.0
;; BYTES-CONSED 100 32752 0 32752 0 327.52 3258.783
;; EVAL-CALLS 100 0 0 0 0 0 0.0
;; ; processing (GC :FULL ...)
;; ; processing (BENCH-SIMPLE-GENERAL)
;; - SAMPLES TOTAL MINIMUM MAXIMUM MEDIAN AVERAGE DEVIATION
;; REAL-TIME 100 5.340001 0.049999 0.060001 0.05 0.0534 0.004737
;; RUN-TIME 100 5.335809 0.051912 0.060029 0.052958 0.053358 0.001482
;; USER-RUN-TIME 100 5.306036 0.029028 0.060032 0.052958 0.05306 0.002777
;; SYSTEM-RUN-TIME 100 0.030087 0 0.02994 0 0.000301 0.002979
;; PAGE-FAULTS 100 0 0 0 0 0 0.0
;; GC-RUN-TIME 100 0 0 0 0 0 0.0
;; BYTES-CONSED 100 0 0 0 0 0 0.0
;; EVAL-CALLS 100 0 0 0 0 0 0.0
;; ; processing (GC :FULL ...)
;; ; processing (BENCH-DISPLACED)
;; - SAMPLES TOTAL MINIMUM MAXIMUM MEDIAN AVERAGE DEVIATION
;; REAL-TIME 100 9.12 0.089999 0.100001 0.09 0.0912 0.00325
;; RUN-TIME 100 9.126611 0.089089 0.102918 0.090797 0.091266 0.001971
;; USER-RUN-TIME 100 9.116704 0.089094 0.098197 0.0908 0.091167 0.001594
;; SYSTEM-RUN-TIME 100 0.010222 0 0.01008 0 0.000102 0.001003
;; PAGE-FAULTS 100 0 0 0 0 0 0.0
;; GC-RUN-TIME 100 0 0 0 0 0 0.0
;; BYTES-CONSED 100 0 0 0 0 0 0.0
;; EVAL-CALLS 100 0 0 0 0 0 0.0

disp-array.lisp

Michał "phoe" Herda via Sbcl-help

non lue,

14 sept. 2023, 02:36:4514/09/2023

à sbcl...@lists.sourceforge.net

Isn't that impossible in the general case because ADJUST-ARRAY can
change the array that the first array is displaced to? The second array
can then no longer be a simple-array (it can itself be displaced) or it
can have a different element type, both of which will break the
assumptions of your optimized code.

> _______________________________________________
> Sbcl-help mailing list
> Sbcl...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/sbcl-help

_______________________________________________
Sbcl-help mailing list
Sbcl...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sbcl-help

Anticrisis

non lue,

14 sept. 2023, 03:09:3114/09/2023

à sbcl...@lists.sourceforge.net

But these are not adjustable arrays? (make-array :adjustable nil) by default. Maybe I don't understand what you are saying. The base array is a simple-array, and you want to benefit from the optimisations applied to loops over simple-arrays, when using (non-adjustable) arrays displaced to simple-arrays. Or does the standard specify that displaced arrays must be adjustable?

Aha, so I'm now reading this: CLHS: Issue ADJUST-ARRAY-NOT-ADJUSTABLE Writeup (lispworks.com) it's a bit confusing, but it does seem like the implementation has some choices about what it considers a simple-array. For the purposes of loop optimisation, couldn't the implementation operate on the displaced-to array inside the loop, rather than checking at each iteration that it may have changed? Maybe only at (safety 0)?

With consequences undefined if someone decides to adjust it inside the loop? Which I don't want to do, otherwise I would have used (make-array :adjustable t). It seems like the standard isn't entirely clear on that point and there are options for implementers.

Michał "phoe" Herda via Sbcl-help

non lue,

14 sept. 2023, 03:28:1114/09/2023

à sbcl...@lists.sourceforge.net

First of all, apologies: the part "it can have a different element type" is actually forbidden by the spec.

Still, look at this example:

CL-USER> (defvar *a* (make-array 5 :initial-contents '(1 2 3 4 5)
                                   :element-type '(unsigned-byte 8)))
*A*

CL-USER> (defvar *b* (make-array 5 :displaced-to *a*
                                   :element-type '(unsigned-byte 8)))
*B*

CL-USER> (eq (array-displacement *b*) *a*)
T

CL-USER> (defvar *c* (make-array 5 :displaced-to *a*
                                   :element-type '(unsigned-byte 8)))
*C*

CL-USER> (adjust-array *b* 5 :displaced-to *c*)
#<(VECTOR (UNSIGNED-BYTE 8) 5) {1011F11FEF}>

CL-USER> (eq * *b*)
T

CL-USER> (eq (array-displacement *b*) *c*)
T

Here, ADJUST-ARRAY is used to destructively (see the EQ calls) mutate *B*, so that it is displaced to a displaced array *C*, rather than the simple array *A*. This means that we now have two layers of indirection (B -> C -> A) rather than just one (B -> A), at which point your code that was compiled with just one layer of indirection in mind is going to fail.

(The destructive behavior is not mandated by the spec, but it's made possible, and SBCL makes use of this to avoid copying.)

I assume that changing this behavior would require new specialized types for displaced arrays, including their expected indirection level, so that the compiler can get the information it needs to try and emit optimized code...

---------------------------

...but it gets worse, possibly preventing the above optimization. There is no guarantee that this indirection won't change during the course of a single function.

Think of the above example, but multithreaded: one thread is accessing *B* in a loop while another thread calls ADJUST-ARRAY on *B* (or *C*!) and effectively changes the storage accessed by the first thread. The proper way to deal with this would be to respect the indirection, which means that this indirection needs to find its way into the final code for each array access, and this is going to prevent optimizations.

The above can't easily get "fixed" by the compiler. However, if you want to get a "static" hold on the actual storage and not have it moved by other threads during your loop, your function can probably call ARRAY-DISPLACEMENT itself until it gets a non-displaced array, and then it can access that instead.

Anticrisis

non lue,

14 sept. 2023, 03:48:4614/09/2023

à sbcl...@lists.sourceforge.net

I see now, I didn't think about multiple levels of displacement and how that would make compiler optimisation possibly less appropriate. So the right solution is to repeatedly call ARRAY-DISPLACEMENT, adding up the offsets, until you get NIL, then do your tight loop on that last array, which we know is a simple-array because we made it that way... I think that's a reasonable burden, let me try it out.

Thank you!

Répondre à tous

Répondre à l'auteur

Transférer