Skia CPU : Setting Rotation makes Image rendering 40 times slow

455 views
Skip to first unread message

Bharat Ahuja

unread,
Nov 15, 2018, 6:40:25 AM11/15/18
to skia-discuss
Hi,
    We have a scenario where we need to render an image onto a raster SkSurface. The image is rendered via an SkShader, i.e. we create a shader from the SkImage and attach the shader to the SkPaint. When we are creating Skshader with a rotation matrix we can see that rendering takes almost 40 times more time than normal scenarios which seems little odd. I expect that with rotation pixel processing time would increase but this increase seems too much if i consider other graphics libraries. I tried to use windows' D2D software back end to understand additional time in rendering and it seems like D2D can do the rendering in almost 2 * (time taken with out rotation). Are we doing something wrong here? If not, can someone please explain why skia software performs badly in comparison to others? Is there any pipe line change happening underneath?

FYI we are using Skia m64 build. 

    int width = 500;
    int height = 500;

    SkImageInfo imageInfo = SkImageInfo::Make(width, height, kRGBA_8888_SkColorType, kPremul_SkAlphaType);

    size_t stride = width * 4;
    SkBitmap bitmap;
    (void)bitmap.tryAllocPixels(imageInfo,stride);
    uint32_t* pixels = reinterpret_cast< uint32_t* >( bitmap.getPixels() );
    for( int i = 0; i < width; i++ )
    for( int j = 0; j < height; j++ )
    {
      *pixels = i % 8 ? ( (255 << 24) + 255 ) : ( (255 << 24) + (255 << 8) + 255 );
       pixels++;
    }

    sk_sp<SkImage> image = SkImage::MakeFromBitmap( bitmap );
    SkPaint paint;
    SkMatrix matrix;
    matrix.setRotate(45);
    paint.setFilterQuality( kLow_SkFilterQuality );
    paint.setShader(image->makeShader(SkShader::kClamp_TileMode,SkShader::kClamp_TileMode,
                                      &matrix));

   SkRect rect = { (float)0, (float)0, (float)width , (float)height };

   auto start = std::chrono::high_resolution_clock::now(); 
   canvas->drawRect(rect, paint);
   auto stop = std::chrono::high_resolution_clock::now();
   long long duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start).count();
   //LOGE( "total time microseconds = %llu", duration );

with rotation matrix  // matrix.setRotate(45);

total time microseconds = 272965
total time microseconds = 275342
total time microseconds = 276618
total time microseconds = 274913
total time microseconds = 276534

with identity matrix  //  matrix.setIdentity( );

total time microseconds = 7996
total time microseconds = 3954
total time microseconds = 7328
total time microseconds = 4252
total time microseconds = 4350

with scaling matrix // matrix.setScale( 14.2, 1.8 );

total time microseconds = 15604
total time microseconds = 25151
total time microseconds = 15506
total time microseconds = 23355
total time microseconds = 18398




Mike Klein

unread,
Nov 15, 2018, 8:58:58 AM11/15/18
to skia-d...@googlegroups.com
Are you building with the MSVC compiler (cl.exe)?  Some parts of Skia's software backend are optimized only when compiled by Clang (clang-cl.exe).

--
You received this message because you are subscribed to the Google Groups "skia-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to skia-discuss...@googlegroups.com.
To post to this group, send email to skia-d...@googlegroups.com.
Visit this group at https://groups.google.com/group/skia-discuss.
For more options, visit https://groups.google.com/d/optout.

Bharat Ahuja

unread,
Nov 15, 2018, 9:13:14 AM11/15/18
to skia-discuss
Actually I *am* using clang.

I am building on a windows machine and using clang++ from androidndk.16.1.1
androidndk.16.1.1\toolchains\llvm\prebuilt\windows-x86_64\bin\clang++.exe

Andrew Ellerton

unread,
Nov 15, 2018, 9:20:05 AM11/15/18
to skia-d...@googlegroups.com
Hi Mike,
Although tangential to the OP's question, this has been on my mind for a while and now is a good time to ask: Is clang on windows a widely supported platform for skia? I ask because I tried (admittedly briefly) and then went back to vanilla msvc, in part because I found some old posts saying that clang wasn't widely supported.

Interested to get your view...

Regards 
Andrew 

Mike Klein

unread,
Nov 15, 2018, 9:21:46 AM11/15/18
to skia-d...@googlegroups.com
Yes, we're happy to support Clang across the board.  Chromium's using Clang exclusively now, even on Windows.

Andrew Ellerton

unread,
Nov 15, 2018, 9:23:15 AM11/15/18
to skia-d...@googlegroups.com
Wow that's great!

Are there any particular setup instructions you recommend? 

Mike Klein

unread,
Nov 15, 2018, 9:24:07 AM11/15/18
to skia-d...@googlegroups.com
Sorry, I saw the reference to Windows and D2D and assumed you were building for Windows.  You're targeting Android?  I'm not familiar with "androidndk.16.1.1"... is that somehow related to the r16 release of the Android NDK (https://developer.android.com/ndk/downloads/older_releases)?

Corey Lucier

unread,
Nov 15, 2018, 9:27:43 AM11/15/18
to skia-discuss
That particular build is very very old though, can you try with latest M70 or above if feasible?
Also are these profile times with a release build or debug? Debug Skia CPU can be very very slow in some code paths.

Bharat Ahuja

unread,
Nov 15, 2018, 9:29:14 AM11/15/18
to skia-discuss
yes sorry for the confusion yes i am using ndk r16 release. Actually i am building skia for android but why i mentioned windows and D2D is I wanted to understand whether other rendering engines also take such larger time on applying rotation matrix so i tried windows software D2D rendering engine on mobile platform with similar hardware capabilities as my android device.

Bharat Ahuja

unread,
Nov 15, 2018, 9:32:02 AM11/15/18
to skia-discuss
I 'll see if i can try the newer build. These are release build run time.

Corey Lucier

unread,
Nov 15, 2018, 9:33:25 AM11/15/18
to skia-discuss
Andrew,

In fact if you aren't using Clang on windows you are in for a much slower ride. The Skia team recently made some performance tradeoffs which essentially prefer Clang, resulting in builds with MSVC becoming much slower. 

Just FYI.


Andrew Ellerton

unread,
Nov 15, 2018, 9:47:33 AM11/15/18
to skia-d...@googlegroups.com
Thanks Corey. My current work is windows-focussed, and I definitely had a few raised eyebrow moments at the msvc compiled binary performance, and not in a good way ;)

I'll definitely give clang a go. Thanks for the heads up, to you and Mike. 


Mike Klein

unread,
Nov 15, 2018, 9:49:00 AM11/15/18
to skia-d...@googlegroups.com
Yep.  We definitely have intentionally made MSVC-based builds slower so that we can make Clang-based builds faster.  Or really, I should say, more capable and maintainable... most of the changes we've made restructure the software drawing backend so that it's not too much of a maintenance burden and help make it easy to keep correctness and feature parity with the other backends.

I should probably add that about 99% of our attention to Android and iOS is focused on performance when using GPU-accelerated drawing.  All those devices have GPUs, and if you want good performance you'll really want to use them.  Skia's GPU backend has been used by Chrome on Android for a long time, and is the backend for Flutter on Android and iOS, and lately draws the Android system UI itself.

Mike Reed

unread,
Nov 15, 2018, 11:39:25 AM11/15/18
to skia-d...@googlegroups.com
Some slow down is expected between identity bitmap draws and yours

1. rotated bitmaps means sampling memory(somewhat) out of order, so we get more cache misses
2. you've set kLow filter quality, which asks us to perform a blierp filter per pixel (which means read 4 pixels and compute a weighted average).
   - this setting is cheaper/ignored if the matrix is identity.

Bharat Ahuja

unread,
Nov 15, 2018, 11:59:14 AM11/15/18
to skia-discuss
yes i expect a slowdown and do understand theses details but 40 times seems too much 

Mike Klein

unread,
Nov 15, 2018, 12:32:59 PM11/15/18
to skia-d...@googlegroups.com
Want to post up your libskia.a / libskia.so somewhere and I can take a look at it to see what's up?  Also, can you post up as much detail as possible about how you're building Skia?

Bharat Ahuja

unread,
Nov 16, 2018, 6:38:30 AM11/16/18
to skia-discuss
seems like rotation scenario goes through RasterPipeline instead of Legacy Pipeline.

bool SkImageShader::IsRasterPipelineOnly(const SkMatrix& ctm, SkColorType ct, SkAlphaType at,
                                         SkShader::TileMode tx, SkShader::TileMode ty,
                                         const SkMatrix& localM) {
    if (ct != kN32_SkColorType) {
        return true;
    }
    if (at == kUnpremul_SkAlphaType) {
        return true;
    }
    if (tx != ty) {
        return true;
    }

    if (!ctm.isScaleTranslate()) {
        return true;
    }
    if (!localM.isScaleTranslate()) {
        return true;
    }
    return false;
}


location for stripped and unstripped so


we are building skia using  clang ndk r16 release with below flag settings

LOCAL_ARM_NEON = 1
LOCAL_CFLAGS = $(LOCAL_CFLAGS) -mfloat-abi=softfp -mfpu=neon
LOCAL_ASFLAGS = $(LOCAL_ASFLAGS) -mfpu=neon

CFLAGS output in console

  -target armv7-none-linux-androideabi -fdiagnostics-format=msvc -Werror=return-type  -Werror=unused-variable -Werror=address-of-packed-member -gcc-toolchain e:/androidndk.16.1.1/toolchains/arm-linux-androideabi-4.9/prebuilt/windows-x86_64 -D__stdcall= -D_stdcall= -D__cdecl= -D_cdecl= -D__thiscall= -D_thiscall= -D__vectorcall= -D_vectorcall= -D_fastcall=__fastcall  -fno-elide-type -fdiagnostics-show-template-tree -gdwarf-2 -fno-standalone-debug  -DNDEBUG -Os -fomit-frame-pointer  -Wno-c++1z-compat  -Wno-deprecated-declarations  -Wno-deprecated  -Wno-expansion-to-defined  -Wno-ignored-pragma-intrinsic  -Wno-multichar  -Wno-nonportable-include-path  -Wno-undefined-var-template  -Wno-write-strings  -mthumb  -march=armv7-a  -mfloat-abi=softfp  -D__ARM_ARCH_7__ -D__ARM_ARCH_7A__  -mfpu=vfp  -fmessage-length=0  -fasm-blocks  -fvisibility-inlines-hidden  -DUNICODE  -D_UNICODE  -DANDROID  -D__ANDROID_API__=21  -D_ANDROID  -DUNDER_ANDROID=1  -DUNDER_ANDROID2=1  -DANDROID_TEMP  -DUNDER_NONWIN=1  -DCOMPILER_GCCE  -DSAFECRT_VISIBLE  -DNSdk=  -DANDROID_NO_WARNING  -D_LCONV_DEFINED  -DNO_ENUM_FORWARD_DECLARATION  -DWIN32_LEAN_AND_MEAN  -D_WIN7_PLATFORM_UPDATE=1 -D_WCHAR_IS_ALWAYS_SIGNED  -DCRYPT_OID_INFO_HAS_EXTRA_FIELDS  -DWINVER=0x0A00  -D_WIN32_WINNT=0x0A00  -D_WIN32_IE=0x0800 -DDEVL=1 -DWIN32_LEAN_AND_MEAN=1 -DSTD_CALL -DSHIP -DNDEBUG -DDBG=0 -DFPO=1  -DSK_BUILD_FOR_APP -DSK_SUPPORT_LEGACY_TILED_BITMAPS -DSK_ARM_HAS_NEON  -DNOMINMAX    -DWINVER=0x0A00    -DISOLATION_AWARE_ENABLED=0  -D_XMMINTRIN_H_INCLUDED  -DARM  -D_ARM_  -D_M_ARM  -D_MoveToCoprocessor=_MoveToCoprocessorStub  -D__emit=__emitStub  -D__readfsdword=__readfsdwordStb  -mstackrealign -mstack-alignment=16 -fshort-wchar -fsigned-char  -ferror-limit=20  -fms-extensions  -fmsc-version=1700  -D__PRETTY_FUNCTION__=__FUNCTION__  -Wno-microsoft  -D MS_CLANG_MS_EXTENSIONS=1  -Wno-delete-non-virtual-dtor  -fshort-wchar  -D HAS_PRAGMA_MS_STRUCT=1  -frandom-seed=0  -fstack-protector  -Winvalid-pch  -D _FORTIFY_SOURCE=2  -fno-strict-overflow -Wno-attributes -Wno-comment -Wno-extra-tokens  -Wno-constant-logical-operand  -Wno-extern-initializer  -Wno-ignored-attributes  -Wno-invalid-token-paste  -Wno-microsoft  -Wno-missing-declarations  -Wno-null-dereference  -Wno-parentheses  -Wno-unused-value  -Wno-invalid-offsetof -ffunction-sections -fdata-sections  -mfloat-abi=softfp -mfpu=neon 

Mike Klein

unread,
Nov 16, 2018, 9:54:29 AM11/16/18
to skia-d...@googlegroups.com
That's funny, I'm able to disassemble your .so's, but they don't make a lot of sense inside: I'm seeing different instructions than I'd expect in known functions, and lots and lots of undecodeable instructions.  It's like objdump is not getting the format right.

Anyway, from what I can glean without looking at the disassembly per se, it would be worth using a newer version of Skia (say, m71).  That code in m64 is pretty old, and since then we've made some code generation improvements, and implemented the somewhat faster low-precision backend on more platforms like ARMv7.

What build system are you using to build Skia?  Is it something other than skia.org/user/build?  Some sort of Android.mk-based build system?  It looks like you have a funny mashup of Windows, x86, Android, and ARM build flags set, and I'm seeing a bunch of flags (like -Wno-c++1z-compat) that I'm pretty sure we've never had in our build configuration.  Between the two settings of -mfpu=vfp and -mfpu=neon, I'm not sure which wins.  I'm also not sure what the effect of -mfloat-abi=softfp will be here... I think nothing, but it's worth fiddling around with.

Bharat Ahuja

unread,
Nov 16, 2018, 11:35:41 AM11/16/18
to skia-discuss

yes, I know this is messy we have our own Android.mk-based build system and don't use skia build system. From my experience mfpu=neon wins.


anyways I'll try on the newer version and check if anything changes.

Corey Lucier

unread,
Dec 3, 2018, 8:18:03 AM12/3/18
to skia-discuss
Any update Bharat?

Abhishek Gupta

unread,
Dec 3, 2018, 8:39:16 AM12/3/18
to skia-discuss
Hi Corey,
We have currently paused that task (which resulted in this scenario) as upgrading to a newer version of Skia would take some time in our current setting.
We will post our results here once we have upgraded to a newer version.

Thanks,
Abhishek



On Monday, December 3, 2018 at 6:48:03 PM UTC+5:30, Corey Lucier wrote:
Any update Bharat?

Bharat Ahuja

unread,
Jul 31, 2019, 8:46:12 AM7/31/19
to skia-discuss
HI, Corey, we updated to M74 it seems like skia has improved but still not comparable to D2D or GDI Plus. I have seen in other threads people reporting the same behaviour( skia is lagging from other rendering libraries ).

Skia probably has the best performance I've seen for rendering, but one area that it seems to struggle is when drawing/processing images. I would really like if we can fix it. If you can help me figure out the bottleneck code, maybe I can help in fixing the issue?

Mike Klein

unread,
Jul 31, 2019, 10:29:34 AM7/31/19
to skia-d...@googlegroups.com
Want to post up your new binaries like you did last time?

--
You received this message because you are subscribed to the Google Groups "skia-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to skia-discuss...@googlegroups.com.

Bharat Ahuja

unread,
Jul 31, 2019, 10:59:02 AM7/31/19
to skia-discuss
Some performance numbers I measured

 {"Skia Running": false, "Bitmap Tranformed": false, "time in microseconds": 1303}
 {"Skia Running": false, "Bitmap Tranformed": true(45), "time in microseconds": 2462}
 {"Skia Running": true, "Bitmap Tranformed": false, "time in microseconds": 912}
 {"Skia Running": true, "Bitmap Tranformed": true(45), "time in microseconds": 7147}

Bharat Ahuja

unread,
Jul 31, 2019, 11:33:26 AM7/31/19
to skia-discuss

Mike Klein

unread,
Jul 31, 2019, 12:04:17 PM7/31/19
to skia-d...@googlegroups.com
Hmm, that looks a lot more sensible, and probably everything is now working as intended, though I'm still a little surprised by the quality of the code generation.  E.g. the function neon::lowp::clear(),

0006797c <neon::lowp::clear(neon::lowp::Params*, void**, unsigned short __vector(8), unsigned short __vector(8), unsigned short __vector(8), unsigned short __vector(8))>:
   6797c:       b5d0            push    {r4, r6, r7, lr}
   6797e:       af02            add     r7, sp, #8
   67980:       466c            mov     r4, sp
   67982:       f36f 0401       bfc     r4, #0, #2
   67986:       46a5            mov     sp, r4
   67988:       f1a7 0408       sub.w   r4, r7, #8
   6798c:       f851 2b04       ldr.w   r2, [r1], #4
   67990:       ef80 0050       vmov.i32        q0, #0  ; 0x00000000
   67994:       ef80 2050       vmov.i32        q1, #0  ; 0x00000000
   67998:       ef80 4050       vmov.i32        q2, #0  ; 0x00000000
   6799c:       ef80 6050       vmov.i32        q3, #0  ; 0x00000000
   679a0:       46a5            mov     sp, r4
   679a2:       e8bd 40d0       ldmia.w sp!, {r4, r6, r7, lr}
   679a6:       4710            bx      r2

That's fine, but I'd have expected something a little more streamlined,

00000000 <neon::lowp::clear(neon::lowp::Params*, void**, unsigned short __vector(8), unsigned short __vector(8), unsigned short __vector(8), unsigned short __vector(8))>:
   0:   f851 2b04       ldr.w   r2, [r1], #4
   4:   ef80 0050       vmov.i32        q0, #0  ; 0x00000000
   8:   ef80 2050       vmov.i32        q1, #0  ; 0x00000000
   c:   ef80 4050       vmov.i32        q2, #0  ; 0x00000000
  10:   ef80 6050       vmov.i32        q3, #0  ; 0x00000000
  14:   4710            bx      r2

Or for a function that's more complicated and more relevant to bicubic sampling,

00063dd8 <neon::bicubic_n3x(neon::Params*, void**, float __vector(4), float __vector(4), float __vector(4), float __vector(4))>:
   63dd8:       b5d0            push    {r4, r6, r7, lr}
   63dda:       af02            add     r7, sp, #8
   63ddc:       b084            sub     sp, #16
   63dde:       466c            mov     r4, sp
   63de0:       f36f 0403       bfc     r4, #0, #4
   63de4:       46a5            mov     sp, r4
   63de6:       680a            ldr     r2, [r1, #0]
   63de8:       efc7 0f50       vmov.f32        q8, #1  ; 0x3f800000
   63dec:       f102 0380       add.w   r3, r2, #128    ; 0x80
   63df0:       f963 2a0f       vld1.8  {d18-d19}, [r3]
   63df4:       a312            add     r3, pc, #72     ; (adr r3, 63e40 <neon::bicubic_n3x(neon::Params*, void**, float __vector(4), float __vector(4), float __vector(4), float __vector(4))+0x68>)
   63df6:       ef60 0de2       vsub.f32        q8, q8, q9
   63dfa:       f963 2acf       vld1.64 {d18-d19}, [r3]
   63dfe:       a314            add     r3, pc, #80     ; (adr r3, 63e50 <neon::bicubic_n3x(neon::Params*, void**, float __vector(4), float __vector(4), float __vector(4), float __vector(4))+0x78>)
   63e00:       f963 4acf       vld1.64 {d20-d21}, [r3]
   63e04:       466b            mov     r3, sp
   63e06:       ff40 2df2       vmul.f32        q9, q8, q9
   63e0a:       ff40 0df0       vmul.f32        q8, q8, q8
   63e0e:       ef42 2de4       vadd.f32        q9, q9, q10
   63e12:       f962 4a0f       vld1.8  {d20-d21}, [r2]
   63e16:       f502 7280       add.w   r2, r2, #256    ; 0x100
   63e1a:       ff40 0df2       vmul.f32        q8, q8, q9
   63e1e:       ffc7 2f58       vmov.f32        q9, #-1.5       ; 0xbfc00000
   63e22:       ef04 0de2       vadd.f32        q0, q10, q9
   63e26:       f943 0aef       vst1.64 {d16-d17}, [r3 :128]
   63e2a:       f942 0a0f       vst1.8  {d16-d17}, [r2]
   63e2e:       684a            ldr     r2, [r1, #4]
   63e30:       3108            adds    r1, #8
   63e32:       4790            blx     r2
   63e34:       f1a7 0408       sub.w   r4, r7, #8
   63e38:       46a5            mov     sp, r4
   63e3a:       bdd0            pop     {r4, r6, r7, pc}


But I'd have expected something a bit more like this,

00000000 <neon::bicubic_n3x(neon::Params*, void**, float __vector(4), float __vector(4), float __vector(4), float __vector(4))>:
   0:   680a            ldr     r2, [r1, #0]
   2:   efc7 0f50       vmov.f32        q8, #1  ; 0x3f800000
   6:   f102 0380       add.w   r3, r2, #128    ; 0x80
   a:   f963 2a8f       vld1.32 {d18-d19}, [r3]
   e:   a310            add     r3, pc, #64     ; (adr r3, 50 <neon::bicubic_n3x(neon::Params*, void**, float __vector(4), float __vector(4), float __vector(4), float __vector(4))+0x50>)
  10:   ef60 0de2       vsub.f32        q8, q8, q9
  14:   f963 2aef       vld1.64 {d18-d19}, [r3 :128]
  18:   a311            add     r3, pc, #68     ; (adr r3, 60 <neon::bicubic_n3x(neon::Params*, void**, float __vector(4), float __vector(4), float __vector(4), float __vector(4))+0x60>)
  1a:   f963 4aef       vld1.64 {d20-d21}, [r3 :128]
  1e:   f44f 7380       mov.w   r3, #256        ; 0x100
  22:   ef40 4df2       vmla.f32        q10, q8, q9
  26:   ffc7 2f58       vmov.f32        q9, #-1.5       ; 0xbfc00000
  2a:   ff40 0df0       vmul.f32        q8, q8, q8
  2e:   ff40 0df4       vmul.f32        q8, q8, q10
  32:   f962 4a83       vld1.32 {d20-d21}, [r2], r3
  36:   ef04 0de2       vadd.f32        q0, q10, q9
  3a:   f942 0a8f       vst1.32 {d16-d17}, [r2]
  3e:   f101 0208       add.w   r2, r1, #8
  42:   684b            ldr     r3, [r1, #4]
  44:   4611            mov     r1, r2
  46:   4718            bx      r3

I know that's a lot to digest, and I've not really worked through it myself, but it looks like mainly you've got,
   1) a bunch of extra bookkeeping that's sometimes breaking the ability of these code snippets to tail-call into each other;
   2) an extra memory write in bicubic_n3x that I can't really account for.

--
You received this message because you are subscribed to the Google Groups "skia-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to skia-discuss...@googlegroups.com.

Bharat Ahuja

unread,
Jul 31, 2019, 1:02:31 PM7/31/19
to skia-discuss
This might be related to some compiler flag? Do you see any setting we are missing? BTW since I have used  kLow_SkFilterQuality so won't it go into bilerp func?

Mike Klein

unread,
Jul 31, 2019, 1:05:15 PM7/31/19
to skia-d...@googlegroups.com
Oh, yes, kLow_SkFilterQuality will bilerp.

--
You received this message because you are subscribed to the Google Groups "skia-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to skia-discuss...@googlegroups.com.

Mike Reed

unread,
Aug 1, 2019, 9:09:58 AM8/1/19
to skia-d...@googlegroups.com

Bharat Ahuja

unread,
Aug 1, 2019, 10:45:51 AM8/1/19
to skia-discuss
Thanks mikel I'll try that and will update ...
To unsubscribe from this group and stop receiving emails from it, send an email to skia-d...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "skia-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to skia-d...@googlegroups.com.

Abhishek Gupta

unread,
Aug 5, 2019, 11:09:35 AM8/5/19
to skia-discuss
Hi Mike,

That change had no perf impact for our scenario ( which is expected as well since we are using low filter quality and the change was for bicubic ).


Regards,
Abhishek

Mike Reed

unread,
Aug 5, 2019, 11:23:56 AM8/5/19
to skia-d...@googlegroups.com
Hmmm, ok. I thought the issue was the bicubic was too slow.

To unsubscribe from this group and stop receiving emails from it, send an email to skia-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/skia-discuss/72957f54-b418-41f2-a5de-6e46e2d34219%40googlegroups.com.

Bharat Ahuja

unread,
Aug 8, 2019, 11:56:29 AM8/8/19
to skia-discuss
Hey Mike,

Please see the variance in numbers( not fiddle numbers, on my device ) when I am applying the transforms. it is getting almost around 8 times, whereas for other libs(GDI) variance is 2 -3 times.

 {"Skia Running": false, "Bitmap Tranformed": false, "time in microseconds": 1303}
 {"Skia Running": false, "Bitmap Tranformed": true(45), "time in microseconds": 2462}   
 {"Skia Running": true, "Bitmap Tranformed": false, "time in microseconds": 912}
 {"Skia Running": true, "Bitmap Tranformed": true(45), "time in microseconds": 7147}

Below is the simple skia fiddle repro that matches with our scenario. 

   int width = 500;
    int height = 500;

    SkImageInfo imageInfo = SkImageInfo::Make(width, height, kRGBA_8888_SkColorType, kPremul_SkAlphaType);

    size_t rowBytes = imageInfo .minRowBytes();
    size_t size = rowBytes * height;
    std::vector<char> pixelMemory(size);  // allocate memory
    sk_sp<SkSurface> surface =
            SkSurface::MakeRasterDirect(
                    imageInfo , &pixelMemory[0], rowBytes);
    SkCanvas* pCanvas = surface->getCanvas();


    size_t stride = width * 4;
    SkBitmap bitmap;
    (void)bitmap.tryAllocPixels(imageInfo,stride);
    uint32_t* pixels = reinterpret_cast< uint32_t* >( bitmap.getPixels() );
    for( int i = 0; i < width; i++ )
    for( int j = 0; j < height; j++ )
    {
      *pixels = i % 8 ? ( (255 << 24) + 255 ) : ( (255 << 24) + (255 << 8) + 255 );
       pixels++;
    }

    sk_sp<SkImage> image = SkImage::MakeFromBitmap( bitmap );

    SkPaint paint;
    SkMatrix matrix;
    matrix.setRotate(45);
    paint.setFilterQuality( kLow_SkFilterQuality );
    paint.setShader(image->makeShader(SkTileMode::kClamp,SkTileMode::kClamp,
                                      &matrix));

   SkRect rect = { (float)0, (float)0, (float)width , (float)height };

   auto start = std::chrono::high_resolution_clock::now(); 
   pCanvas->drawRect(rect, paint);
   auto stop = std::chrono::high_resolution_clock::now();
   long long duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start).count();
   //LOGE( "total time microseconds = %llu", duration );



On Monday, August 5, 2019 at 8:53:56 PM UTC+5:30, Mike Reed wrote:
Hmmm, ok. I thought the issue was the bicubic was too slow.

Bharat Ahuja

unread,
Aug 13, 2019, 10:03:17 AM8/13/19
to skia-discuss
@Mike Klien As you said our binary doesn't have optimal code and there are some additional instructions(extra book keeping?)as well. Is it becuase we are missing some flag skia depends on? Can you please suggest steps( compiler flags etc ) I can try?

Mike Klein

unread,
Aug 13, 2019, 10:32:20 AM8/13/19
to skia-discuss
Hmm. I'm not really sure.  Can you send the full command line that's used to compile src/core/SkOpts.cpp?

Also, just about everything about how you're drawing can affect performance... do you happen to know the SkColorType, SkAlphaType, and SkColorSpace of the surface you're drawing into?  Are you dithering?  Anything unusual like that?  The more accurately you can reproduce what you're doing on fiddle.skia.org, the more we can help.

To unsubscribe from this group and stop receiving emails from it, send an email to skia-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/skia-discuss/9032075a-7a7d-47ab-a4a1-cbb385ce1140%40googlegroups.com.

Bharat Ahuja

unread,
Aug 13, 2019, 11:22:13 AM8/13/19
to skia-discuss
Below is the command we run

androidndk.16.1.12\toolchains\llvm\prebuilt\windows-x86_64\bin\clang++.exe @src\core\android_clang_cpp_flags.rsp -c src\core\SkOpts.cpp -o src\core\obj\androidarm\clang-temp\SkOpts.cpp\compile.tmp

where android_clang_cpp_flags.rsp contains below compile time flags

  -target armv7-none-linux-androideabi -fdiagnostics-format=msvc -Werror=return-type  -Werror=unused-variable -Werror=address-of-packed-member -gcc-toolchain e:/nugetcache/androidndk.16.1.12/toolchains/arm-linux-androideabi-4.9/prebuilt/windows-x86_64 -D__stdcall= -D_stdcall= -D__cdecl= -D_cdecl= -D__thiscall= -D_thiscall= -D__vectorcall= -D_vectorcall= -D_fastcall=__fastcall  -fno-elide-type -fdiagnostics-show-template-tree -gdwarf-2 -fno-standalone-debug  -DNDEBUG -Os -fomit-frame-pointer  -Wno-c++1z-compat  -Wno-deprecated-declarations  -Wno-deprecated  -Wno-expansion-to-defined  -Wno-ignored-pragma-intrinsic  -Wno-multichar  -Wno-nonportable-include-path  -Wno-undefined-var-template  -Wno-write-strings  -mthumb  -march=armv7-a  -D__ARM_ARCH_7__ -D__ARM_ARCH_7A__  -mfpu=neon  -fmessage-length=0  -fasm-blocks  -fvisibility-inlines-hidden  -DUNICODE  -D_UNICODE  -D__ANDROID_API__=21  -DCOMPILER_GCCE  -DSAFECRT_VISIBLE  -DNSdk=  -DANDROID_NO_WARNING  -D_LCONV_DEFINED  -DNO_ENUM_FORWARD_DECLARATION  -D_WCHAR_IS_ALWAYS_SIGNED  -DCRYPT_OID_INFO_HAS_EXTRA_FIELDS -DSTD_CALL -DSHIP -DNDEBUG -DDBG=0 -DFPO=1  -DSK_BUILD_FOR_OFFICE -DSK_SUPPORT_LEGACY_TILED_BITMAPS -DSK_ARM_HAS_NEON  -DNOMINMAX    -DISOLATION_AWARE_ENABLED=0  -D_XMMINTRIN_H_INCLUDED  -DARM  -D_ARM_  -D_M_ARM -D__readfsdword=__readfsdwordStb  -mstackrealign -mstack-alignment=16 -fshort-wchar -fsigned-char  -ferror-limit=20  -fms-extensions  -fmsc-version=1700  -D__PRETTY_FUNCTION__=__FUNCTION__  -Wno-microsoft  -D MS_CLANG_MS_EXTENSIONS=1  -Wno-delete-non-virtual-dtor  -fshort-wchar  -D HAS_PRAGMA_MS_STRUCT=1  -frandom-seed=0  -fstack-protector  -D _FORTIFY_SOURCE=2  -fno-strict-overflow -Wno-attributes -Wno-comment -Wno-extra-tokens  -Wno-constant-logical-operand  -Wno-deprecated-dynamic-exception-spec  -Wno-extern-initializer  -Wno-ignored-attributes  -Wno-invalid-token-paste  -Wno-microsoft  -Wno-missing-declarations  -Wno-null-dereference  -Wno-parentheses  -Wno-unused-value  -Wno-invalid-offsetof -ffunction-sections -fdata-sections -fexceptions -frtti -std=c++1z  -DBOOST_HASH_CXX17=0  -DGSL_USE_STD_BYTE=0  -fno-aligned-allocation  -DBOOST_ASIO_DISABLE_STD_STRING_VIEW  -DBOOST_ASIO_HAS_STD_EXPERIMENTAL_STRING_VIEW  -DBOOST_THREAD_CHRONO_POSIX_API.


SkColorType = kRGBA_8888_SkColorType, SkAlphaType = kPremul_SkAlphaType, and SkColorSpace = nullptr

Below is the sample fiddle code. Please note that we have seen significant improvement since we have updated to M74, though it is still not comparable to other libraries we are using. Variance in numbers( not fiddle numbers, on my device ) when I am applying the transforms is getting almost around 8 times whereas for other libs variance is 2 -3 times. So, we expect there is still room for the improvement.

{"Skia Running": false, "Bitmap Tranformed": false, "time in microseconds": 1303}
 {"Skia Running": false, "Bitmap Tranformed": true(45), "time in microseconds": 2462}   
 {"Skia Running": true, "Bitmap Tranformed": false, "time in microseconds": 912}
 {"Skia Running": true, "Bitmap Tranformed": true(45), "time in microseconds": 7147}

Mike Klein

unread,
Aug 13, 2019, 11:33:56 AM8/13/19
to skia-discuss
Oh right!  You've got that crazy build with a mix of Clang and MSVC and Android and Windows settings.  I think I'm going to have to leave you to fend for yourself to work through that.  But it looks like you've got all the important stuff in there, like -march=armv7-a and -mfpu=neon.  AndwWith those surface settings, I think how SkOpts.cpp is mostly moot.  Let me take a quick scan through to make sure the code paths you're hitting are the one's I'd expect.

To unsubscribe from this group and stop receiving emails from it, send an email to skia-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/skia-discuss/46b328df-7c09-4f9d-823b-feaffaf5d593%40googlegroups.com.

Bharat Ahuja

unread,
Aug 13, 2019, 11:37:44 AM8/13/19
to skia-discuss
yeah, our build flags are weird. Isn't the surface and skimage setting I am using, is the optimal one? last time I had a discussion around this and I got the feedback premultiplied surfaces are supposed to be faster?

Mike Klein

unread,
Aug 13, 2019, 11:39:08 AM8/13/19
to skia-discuss
Yep, looks good.  Might want to throw a static_assert(kN32_SkColorType == kRGBA_8888_SkColorType, "") in there somewhere just to double check.  There's typically one byte order of 8888 that's fastest and the other is slower, but I think you've got the right one for your build.

To unsubscribe from this group and stop receiving emails from it, send an email to skia-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/skia-discuss/3b60ed35-d1ab-47dc-9657-14ee90a88a60%40googlegroups.com.

Mike Klein

unread,
Aug 13, 2019, 11:57:28 AM8/13/19
to skia-discuss
Ooh, how interesting.  I thought we had a faster code path than we'd be hitting here.  Turns out we're dumping this case into the slower general path (handled by that code in SkOpts.cpp) which explains the big step up in runtime.  This check is failing before we can get into the fast stuff:

static bool legacy_shader_can_handle(const SkMatrix& inv) {
    if (!inv.isScaleTranslate()) {
        return false;
    }
   ...

Mike Klein

unread,
Aug 13, 2019, 12:01:44 PM8/13/19
to skia-discuss

Mike Klein

unread,
Aug 13, 2019, 12:11:55 PM8/13/19
to skia-discuss
Without upstream changes, I think there's only one tweak to make here, which is to make sure to tag your source image as kOpaque_SkAlphaType if indeed it is opaque.  That'll cut a couple stages off the color pipeline we use here:

    seed_shader
    matrix_2x3
    bilerp_clamp_8888
    clamp_gamut
    load_8888_dst
    srcover
    store_8888

~~~>

    seed_shader
    matrix_2x3
    bilerp_clamp_8888
    clamp_gamut
    store_8888

On my laptop that's good for a 20% speedup, which isn't nothing.  I think anything else (restoring old fast paths, making this sort of general path faster, cutting possibly redundant work like clamp_gamut) will need some work on our end.

Bharat Ahuja

unread,
Aug 13, 2019, 12:51:55 PM8/13/19
to skia-discuss
Yeah, Thanks mike sorry it got lost due to the long thread but I had mentioned somewhere above that code goes through RasterPipeline instead of Legacy Pipeline. Is new code not neon optimized?

Thanks for the suggestion of kOpaque_SkAlphaType but our Bitmaps mostly has an alpha channel so I don't think we can use kOpaque_SkAlphaType.

I am not much of skia expert but If someone can guide me I can contribute and integrate those faster paths in general path code. 

Thanks
Bharat

Mike Klein

unread,
Aug 13, 2019, 12:56:50 PM8/13/19
to skia-discuss
Yeah, it uses NEON.  You're just hitting the more flexible, slower, general path.  Without rotation you're hitting common special case that's got its own optimized routines.

To unsubscribe from this group and stop receiving emails from it, send an email to skia-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/skia-discuss/11f1a2c2-df75-40a9-955a-cda69fe9047b%40googlegroups.com.

Bharat Ahuja

unread,
Aug 14, 2019, 12:04:29 PM8/14/19
to skia-discuss
Hey Mike,

I somehow hacked the faster path for rotated bitmaps and it seemed to work and step up in time seems to be gone. May I know why did we remove this path ? is it because of some bugs? normal rotation seems to be working fine though.

Mike Klein

unread,
Aug 14, 2019, 12:09:09 PM8/14/19
to skia-discuss
Oh that's great.  I'm not really sure I remember why we removed that code path.  Probably just an experiment to see if anyone cared?  It's a pain in the butt to maintain that sort of code.  We can't really care about all code paths being optimally fast... it's primarily driven by user needs.  The rough balance we try to maintain is to keep most pixels being drawn by fast code, and most features being drawn by code that's simpler, more combinatoric, and harder to get wrong.

To unsubscribe from this group and stop receiving emails from it, send an email to skia-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/skia-discuss/85b8299c-a3c2-4880-89ba-79eaf7f21a2c%40googlegroups.com.

Bharat Ahuja

unread,
Aug 14, 2019, 12:22:51 PM8/14/19
to skia-d...@googlegroups.com
Ohh I see. Thanks, then I 'll probably integrate into our skia build. BTW same goes for tiled bitmap also. Maybe we removed tiled optimization for the same reason?

Mike Klein

unread,
Aug 14, 2019, 12:26:05 PM8/14/19
to skia-discuss
Yep, as you see the only routines that are still there are the clamp-in-x/clamp-in-y ones.  Mirror, repeat, and decal all go through the general case.

Bharat Ahuja

unread,
Oct 10, 2019, 1:50:31 PM10/10/19
to skia-discuss
Hi Mike,

As we discussed last time I am bringing those fast bitmap rendering paths back for affine transformation case. Changes mainly bring back matrix procs and sampling procs( i.e S32_alpha_D32_filter_DXDY ).I am almost done with changes and have validated my changes on ARM(Neon), EmulatorX86(SSE2), EmulatorX86_64( SSE3 ). I had earlier done all these changes on M74. In the latest version of Skia, I can see we have also added AVX based implementation of sampling function S32_alpha_D32_filter_DX. Currently, I am missing AVX implementation of S32_alpha_D32_filter_DXDY( I do have SSE2 and SSE3 implementation ) but still, my changes seem to compile fine if I build it using ninja -C out/Shared. 

Is AVX implementation is not enabled by default? or am I just on compiling right platform?

--
You received this message because you are subscribed to the Google Groups "skia-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to skia-d...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "skia-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to skia-d...@googlegroups.com.

Mike Klein

unread,
Oct 10, 2019, 2:18:32 PM10/10/19
to skia-discuss
It'd help to see your code, but you have to explicitly opt each function into the various better-than-default SkOpts_foo.cpp specializations.  Take a look at SkOpts_hsw.cpp?

I think we've noticed that for most use cases, hsw::S32_alpha_D32_filter_DX doesn't actually help much if at all.  Seems like once we got the pmaddubsw instructions in there with_mm_maddubs_epi16(), it's mostly memory limited.  In fact, with AVX2, using _mm256_maddubs_epi16() really only helped make the code size smaller... a more naive approach resembling the SSE2 code ran at the same speed.

To unsubscribe from this group and stop receiving emails from it, send an email to skia-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/skia-discuss/4b22351a-0bc1-49ad-9a1d-043d752b037c%40googlegroups.com.

Bharat Ahuja

unread,
Oct 10, 2019, 2:27:14 PM10/10/19
to skia-discuss
Cool thanks for explaining.I will take a look into this file.

My Changes are little unclean. I will send you a clean change in a day or two.

Thanks
Bharat

Mike Klein

unread,
Oct 10, 2019, 2:28:55 PM10/10/19
to skia-discuss
No prob, take your time.

--
You received this message because you are subscribed to the Google Groups "skia-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to skia-discuss...@googlegroups.com.

Bharat Ahuja

unread,
Oct 15, 2019, 10:56:58 AM10/15/19
to skia-discuss
Hi Mike,

I am finished with the change. Can you please take a look.



Thanks
Bharat


On Thursday, October 10, 2019 at 11:58:55 PM UTC+5:30, Mike Klein wrote:
No prob, take your time.

On Thu, Oct 10, 2019 at 1:27 PM Bharat Ahuja <ahujab...@gmail.com> wrote:
Cool thanks for explaining.I will take a look into this file.

My Changes are little unclean. I will send you a clean change in a day or two.

Thanks
Bharat

--
You received this message because you are subscribed to the Google Groups "skia-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to skia-d...@googlegroups.com.

Bharat Ahuja

unread,
Oct 17, 2019, 10:56:21 AM10/17/19
to skia-discuss
Thanks so much for the valuable feedback mike. I have updated the iteration incorporating feedback. I have also removed change which enables legacy pipeline when tile modes are not equal. I 'll send it as another cl. Please take a look.

Thanks
Bharat

Bharat Ahuja

unread,
Oct 17, 2019, 3:11:00 PM10/17/19
to skia-discuss
Hi Mike,

I have tried this change on x86 and x86_64 with a bitmap of size 600 x 600 rendered on RGBAp888 SkCanvas of size 1000 x 1000 with a 45-degree transform.  I am noticing improvements for these platforms as well. Improvement is significantly large when extend mode is repeat or mirror. I have attached the benchmark results in the review. We primarily care about ARM devices so we don't mind limiting this CL to ARM platforms but I believe it can be useful for other platforms as well. Let me know what's your call on this?

Thanks
Bharat

Mike Klein

unread,
Oct 17, 2019, 3:14:29 PM10/17/19
to skia-discuss
You've split this in two, right, one part that would affect rotate, one part that would affect repeat and/or mirror?

To unsubscribe from this group and stop receiving emails from it, send an email to skia-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/skia-discuss/e1b581d3-4532-4bf7-bc9e-689bd2f72b12%40googlegroups.com.

Bharat Ahuja

unread,
Oct 17, 2019, 3:33:45 PM10/17/19
to skia-discuss
Actually I have split it into two parts. One is rotate and other one is when tilex mode is not same as tiley mode( irrespective of rotate or scale ).

Thanks
Bharat

Bharat Ahuja

unread,
Oct 18, 2019, 7:41:32 AM10/18/19
to skia-discuss
Hi Mike,

As discussed, I have limited this change to ARM\ARM64 devices and published new iteration. Please take a look.

Thanks
Bharat

Bharat Ahuja

unread,
Oct 21, 2019, 9:56:13 AM10/21/19
to skia-discuss

Hi Guys,

 

I have incorporated all CR comments and published a new iteration. Since I am going for a vacation (~10 days), I would really appreciate if we can close on this, if things look okay, before Wednesday.


Thanks

Bharat

Mike Klein

unread,
Oct 21, 2019, 12:18:03 PM10/21/19
to skia-discuss
I'd prefer we wait until you get back to land this.  This isn't time critical and I don't want to artificially rush.

--
You received this message because you are subscribed to the Google Groups "skia-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to skia-discuss...@googlegroups.com.

Bharat Ahuja

unread,
Oct 21, 2019, 1:15:50 PM10/21/19
to skia-discuss
Sure. sounds good. 

Thanks so much for the feedback.


Thanks
Bharat

On Monday, October 21, 2019, at 9:48:03 PM UTC+5:30, Mike Klein wrote:
I'd prefer we wait until you get back to land this.  This isn't time critical and I don't want to artificially rush.

On Mon, Oct 21, 2019 at 8:56 AM Bharat Ahuja <ahujab...@gmail.com> wrote:

Hi Guys,

 

I have incorporated all CR comments and published a new iteration. Since I am going for a vacation (~10 days), I would really appreciate if we can close on this, if things look okay, before Wednesday.


Thanks

Bharat


On Friday, October 18, 2019 at 5:11:32 PM UTC+5:30, Bharat Ahuja wrote:
Hi Mike,

As discussed, I have limited this change to ARM\ARM64 devices and published new iteration. Please take a look.

Thanks
Bharat


On Friday, October 18, 2019 at 1:03:45 AM UTC+5:30, Bharat Ahuja wrote:
Actually I have split it into two parts. One is rotate and other one is when tilex mode is not same as tiley mode( irrespective of rotate or scale ).

Thanks
Bharat

--
You received this message because you are subscribed to the Google Groups "skia-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to skia-d...@googlegroups.com.

Matthew Leibowitz

unread,
Nov 9, 2019, 9:38:46 PM11/9/19
to skia-discuss
I think I am following this discussion, but there seems to be talk of other variables.
I just did a test on fiddle and I see a 10x difference with 0 degrees vs 20 degrees. Is this what you folks are talking about?



On Monday, October 21, 2019 at 6:18:03 PM UTC+2, Mike Klein wrote:
I'd prefer we wait until you get back to land this.  This isn't time critical and I don't want to artificially rush.

On Mon, Oct 21, 2019 at 8:56 AM Bharat Ahuja <ahujab...@gmail.com> wrote:

Hi Guys,

 

I have incorporated all CR comments and published a new iteration. Since I am going for a vacation (~10 days), I would really appreciate if we can close on this, if things look okay, before Wednesday.


Thanks

Bharat


On Friday, October 18, 2019 at 5:11:32 PM UTC+5:30, Bharat Ahuja wrote:
Hi Mike,

As discussed, I have limited this change to ARM\ARM64 devices and published new iteration. Please take a look.

Thanks
Bharat


On Friday, October 18, 2019 at 1:03:45 AM UTC+5:30, Bharat Ahuja wrote:
Actually I have split it into two parts. One is rotate and other one is when tilex mode is not same as tiley mode( irrespective of rotate or scale ).

Thanks
Bharat

--
You received this message because you are subscribed to the Google Groups "skia-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to skia-d...@googlegroups.com.

Mike Klein

unread,
Nov 10, 2019, 6:20:08 AM11/10/19
to skia-discuss
Not exactly, no.  Drawing at 0 degrees there is a very special case as far as speed goes, essentially as fast as memcpy(), while any rotation or scale that kicks us out of that special 1:1 case requires more work.

To unsubscribe from this group and stop receiving emails from it, send an email to skia-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/skia-discuss/05940091-793e-4e38-bc03-0acd0f783dd4%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages