Re: Optimizing the performance of TransformationMatrix

61 views
Skip to first unread message

Chris Harrelson

unread,
Sep 11, 2020, 4:26:04 PM9/11/20
to Richard Townsend, Andrew Beggs, pain...@chromium.org
I agree with you that almost all uses don't really need that level of precision...

Any thoughts on my question about why it's slower? If it's specific to particular ARM GPUs on mobile that is something to consider.

On Fri, Sep 11, 2020 at 12:47 PM Richard Townsend <Richard....@arm.com> wrote:
(+paint-dev on CC)

So, assuming that values-in must equal values-out for things like DOMMatrix, couple of additional things come to mind:
  •  It may be feasible to implement TransformationMatrix so that it support single and double precision (e.g. via templating) if we can find APIs which don’t require double precision 
  • Given that most of the things that go back and forth between JS and Blink are in screen space, there may be a sensible maximum precision (e.g. I would not expect a positioning of 0.01224px to produce something visible), I wonder if there’s a way to exploit that to make things stable.
The code’s worked in double-precision in this way since at least 2016 (probably for much longer).

Best
Richard 


From: Chris Harrelson <chri...@chromium.org>
Sent: Friday, September 11, 2020 7:45 pm
To: Richard Townsend
Cc: Andrew Beggs
Subject: Re: Optimizing the performance of TransformationMatrix
 
Hi Richard,

Interesting investigation!

First question: do you mind if I cc pain...@chromium.org?

More comments inline.

On Fri, Sep 11, 2020 at 10:15 AM Richard Townsend <Richard....@arm.com> wrote:
Hi Chris,

Andrew Beggs (CC'd) is interning with the Arm Chromium team and has been looking into the performance of TransformationMatrix operations. We've found that reducing the internal precision of some operations (from double to float) gives a nice performance boost (up to 5% on blink_perf.svg, see table) with potentially more to come for 32-bit Arm systems and from the additional optimizations enabled by this (essentially, halving the precision means that Arm's floating point SIMD extension can handle more matrix elements at once.)

Is there a reason to expect the performance always to lag for double vs float by a few percent, regardless of CPU? Is the reason it's slower that there are fewer double registers, or more memory cache faults?
 

Here's a typical result on a Pixel 2, 64-bit:


Explanation of each column:
  • float is replacing some hot operations inside transformation_matrix.h (as well as the matrix storage format) with float instead of double
  • double is the current baseline
  • extra-floaty is the hot operations out of transformation_matrix.h, with affine_transform.cc converted to float too
  • really-floaty is extra-floaty with every operation (including constructors etc) replaced with floats.
It seems that even with very little optimization in place, reducing the precision could improve performance on this area (and potentially improve framerate / smoothness on things like MotionMark). Before we go implement all these optimizations, polish the patches for upstream etc, I'd like to quickly get your thoughts on a few things:
  • Has anybody previously tried this?
Not that I can recall right now.
 
  • Is there any particular reason why TransformationMatrix operations are computed with double precision?
I'm assuming it is because it leads to greater numerical stability, though the code is very old. Was it always double, have you checked the code history? 
    • There's around 218 rendering differences with these optimizations in place, but they're mostly indistinguishable to the naked eye (see below for a typical example).
    • Biggest issue: getClientRects etc returning very slightly different values (e.g. 0.0000019073486328125 instead of 0)
  • If we progress in a really-floaty direction, this might affect the current Intel optimizations in transformation_matrix.h. We hope that Intel will also be faster, but it might be slower until we can change the optimizations, and we lack the expertise within the team to reoptimize things for Intel. (Is there anybody we could loop in on the Google side that'd be interested in working on this?)
  • There's a specification issue with the interaction with JavaScript: double-precision values must be given out by the V8 bindings, but the specification does not currently state that anything must be stored/internally computed with double precision. Is it worth expanding / refining the spec in this area?
DOMMatrix also requires doubles, and is I think backed by TransformationMatrix. Changing to float for the values will make values not round-trip or lose precision, and therefore not be web compatible.
 

Looking forward to hearing your thoughts on this one.

Best
Richard
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Philip Rogers

unread,
Sep 11, 2020, 4:43:22 PM9/11/20
to Chris Harrelson, Richard Townsend, Andrew Beggs, pain...@chromium.org
If it's possible to use float, can we switch blink::TransformationMatrix to one of the existing matrix classes? There is a lot of duplicate logic in this area, such as SkTransform44 (deprecated), SkM44 (new), gfx::Transform, and blink::TransformationMatrix.

--
You received this message because you are subscribed to the Google Groups "paint-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to paint-dev+...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/paint-dev/CA%2BN6QZtJRUC6y5Oh4Rdn%2BsKiyf-hqJhvrPfD3d-W7NAjL1aNCQ%40mail.gmail.com.

Richard Townsend

unread,
Sep 11, 2020, 5:00:16 PM9/11/20
to Chris Harrelson, Andrew Beggs, pain...@chromium.org

Richard Townsend

unread,
Sep 11, 2020, 5:00:16 PM9/11/20
to Chris Harrelson, Andrew Beggs, pain...@chromium.org
We haven’t (yet) got into the deep details of why floats are faster, but it’s probably a combination of:

- Less data read from / written to memory, better cache utilisation
- Scalar float operations (e.g. multiplying two floats versus two doubles) can be faster on some CPUs

For SIMD-accelerated operations that operate in 128-bit registers (e.g. matrix multiplication), I hope that dropping to single precision will let us process 2x the elements per cycle, so potentially a bit more performance to come (if this is feasible.)

Best
Richard


From: Chris Harrelson <chri...@chromium.org>
Sent: Friday, September 11, 2020 9:25:49 PM
To: Richard Townsend <Richard....@arm.com>
Cc: Andrew Beggs <Andrew...@arm.com>; pain...@chromium.org <pain...@chromium.org>

Xianzhu Wang

unread,
Sep 11, 2020, 5:17:12 PM9/11/20
to Chris Harrelson, Richard Townsend, Andrew Beggs, pain...@chromium.org
I did a bit of investigation on the performance of TransformationMatrix/SkMatrix/SkMatrix44 some time ago. Haven't got time to dig it further. The results might have been out-dated, which are available from the try results of https://chromium-review.googlesource.com/c/chromium/src/+/1771388

Here are what I found:
1. Performance of TransformationMatrix multiplication was much faster than SkMatrix44.
2. Performance of TransformationMatrix inversion was much slower than SkMatrix44.

The times were:
10*10000000 multiplications:
                                  arm32  arm64  x86_64
   TransformationMatrix assembly:  n/a   0.30s  0.15s
   TransformationMatrix C++:      0.75s  0.62s  0.19s
   SkMatrix(3x3):                 0.62s  0.40s  0.21s
   SkMatrix44:                    3.2s   1.43s  0.85s

10*10000000 inversions:
                                  arm32  arm64  x86_64
   TransformationMatrix assembly:  n/a   1.5s   n/a
   TransformationMatrix C++:      2.5s   1.8s   0.75s
   TransformationMatrix C++ loop manually unrolled:
                                  2.4s   1.6s   0.80s
   SkMatrix(3x3):                  [*]   0.41s  0.15s
   SkMatrix44:                     [*]   0.66s  0.28s

[*] not available because of incomplete log



On Fri, Sep 11, 2020 at 1:26 PM Chris Harrelson <chri...@chromium.org> wrote:
--

Richard Townsend

unread,
Oct 2, 2020, 7:20:09 AM10/2/20
to Xianzhu Wang, Chris Harrelson, Andrew Beggs, pain...@chromium.org

Cool, so we dug into it a bit more on this end. Using Xianzhu’s benchmarking code, Andrew managed to find a good optimization (5-10% faster on a Pixel 2, 64-bit) for TransformationMatrix::Inverse by using the Adjoint matrix. Unfortunately it didn’t produce a noticeable improvement the blink_perf.css/blink_perf.paint-etc benchmarks we can run via Pinpoint, but I’ll leave it here if anyone’s interested:

 

https://chromium-review.googlesource.com/c/chromium/src/+/2440097

 

So IMO this an interesting datapoint: there could be a few more optimizations in there for double-precision, but maybe not anything too dramatic. From studying the problem closely, it does seem that the reason why Inverse is unusually hard to optimize is because it’s difficult to load, add and multiply enough 64-bit floats together at the same time when you only have 128-bit SIMD registers to work with. Dropping to single-precision like SkMatrix44 means you can potentially multiply 4 numbers together at the same time (rather than 2), and that could a good part of the reason why SkMatrix44 inverse is so much faster, despite not using any SIMD that I can see. SkMatrix44’s multiplication similarly drops back to some fairly naive scalar code in some cases:

 

https://source.chromium.org/chromium/chromium/src/+/master:third_party/skia/src/core/SkMatrix44.cpp;l=416;drc=fd81134e0f39ca711ee71ca951f74411c0bd793f;bpv=1;bpt=1

 

There’s some potentially inefficient float -> double conversion going on, memcpys etc which could probably be looked at further. Unfortunately we’re out of time on this one for now (Andrew’s internship is nearly over), but I think there’s plenty to pick up when we’re next doing some optimization, especially if a dual-precision solution (single-precision TransformationMatrix for CSS transforms, double-precision for DOMMatrix?) might work. Do you think that’d be practical?

 

Best

Richard

Mason Freed

unread,
Oct 2, 2020, 12:50:32 PM10/2/20
to Richard Townsend, Xianzhu Wang, Chris Harrelson, Andrew Beggs, pain...@chromium.org
Thanks for looking into this Richard. I just wanted to add one more suggestion (or at least data point). Several years ago, I looked into changing gfx::Transform::IsInvertible() to use the determinant instead of actually performing the inversion. (See the abandoned CL here.) We do a lot of checking for invertibility, so it would seem that this should have a perf impact. However, I ran into a  problem with the check/change added here, which also checks the scaling of the inverted value, thus requiring full inversion. Perhaps someone with better linear algebra skills than me has better ideas, and could eliminate the inversion while keeping the scaling checks.

Thanks,
Mason


Reply all
Reply to author
Forward
0 new messages