Float32Array and the VM

141 views
Skip to first unread message

John McCutchan

unread,
Jun 13, 2012, 12:47:48 AM6/13/12
to General Dart Discussion
Hey Dart team,

I'm wondering how Float32Arrays work internally. Some questions:

1) What is the cost of constructing a Float32Array? In terms of both time and space. How does this compare to an array of nums?

1.5) What is the cost of constructing a Float32Array that aliases another Float32Array?

2) When operating on individual elements in a Float32Array are the operations done in single precision or are the elements promoted into doubles before work begins?

3) Is the overhead of indexing into a Float32Array better or worse as indexing into an array?

I appreciate the help.

--
John McCutchan <jo...@johnmccutchan.com>

Peter Jakobs

unread,
Jun 13, 2012, 8:41:19 AM6/13/12
to mi...@dartlang.org
Also interesting, are there plans to merge Float32Array with Float32List?

A merge would be nice, so that I dont have to maintain, for example, two versions of my gl-matrix library (one with Float32Array on client and one with Float32List on the server).

Seth Ladd

unread,
Jun 14, 2012, 3:49:16 PM6/14/12
to Peter Jakobs, mi...@dartlang.org
For posterity, readers of this thread might find a similar thread https://groups.google.com/a/dartlang.org/forum/#!topic/misc/wyXxrnF0na0 also interesting.

John McCutchan

unread,
Jun 16, 2012, 7:55:08 PM6/16/12
to General Dart Discussion
Answering myself:

On Tue, Jun 12, 2012 at 9:47 PM, John McCutchan <jo...@johnmccutchan.com> wrote:
Hey Dart team,

I'm wondering how Float32Arrays work internally. Some questions:

1) What is the cost of constructing a Float32Array? In terms of both time and space. How does this compare to an array of nums?

1.5) What is the cost of constructing a Float32Array that aliases another Float32Array?


Based on the code in byte_array.cc it allocating a Float32Array seems cheap. I haven't done any serious testing yet because I've found bigger performance issues with Float32Array (Float32List)
 
2) When operating on individual elements in a Float32Array are the operations done in single precision or are the elements promoted into doubles before work begins?


Floats are fetched and promoted to double precision. 
 
3) Is the overhead of indexing into a Float32Array better or worse as indexing into an array?


The native code for indexing into a Float32Array is simple enough (base pointer + index). But, the VM can't seem to optimize the array index calls. The VM does optimize regular member variable access, making those much faster. 

I have been playing with two versions of 4x4 matrix multiplication. One version reads the matrix elements from dart object properties and the other accesses Float32List entries.The algorithm is the same in both, the only difference being how the properties are accessed, here is the Float32List implementation:

void mat4mult(Float32List arg1, Float32List arg2) {
    final num m00 = arg1[0];
    final num m01 = arg1[4];
    final num m02 = arg1[8];
    final num m03 = arg1[12];
    final num m10 = arg1[1];
    final num m11 = arg1[5];
    final num m12 = arg1[9];
    final num m13 = arg1[13];
    final num m20 = arg1[2];
    final num m21 = arg1[6];
    final num m22 = arg1[10];
    final num m23 = arg1[14];
    final num m30 = arg1[3];
    final num m31 = arg1[7];
    final num m32 = arg1[11];
    final num m33 = arg1[15];
    final num n00 = arg2[0];
    final num n01 = arg2[4];
    final num n02 = arg2[8];
    final num n03 = arg2[12];
    final num n10 = arg2[1];
    final num n11 = arg2[5];
    final num n12 = arg2[9];
    final num n13 = arg2[13];
    final num n20 = arg2[2];
    final num n21 = arg2[6];
    final num n22 = arg2[10];
    final num n23 = arg2[14];
    final num n30 = arg2[3];
    final num n31 = arg2[7];
    final num n32 = arg2[11];
    final num n33 = arg2[15];
    // The code below is writing results at incorrect indexes.
    arg1[0] =  (m00 * n00) + (m01 * n10) + (m02 * n20) + (m03 * n30);
    arg1[1] =  (m00 * n01) + (m01 * n11) + (m02 * n21) + (m03 * n31);
    arg1[2] =  (m00 * n02) + (m01 * n12) + (m02 * n22) + (m03 * n32);
    arg1[3] =  (m00 * n03) + (m01 * n13) + (m02 * n23) + (m03 * n33);
    arg1[4] =  (m10 * n00) + (m11 * n10) + (m12 * n20) + (m13 * n30);
    arg1[5] =  (m10 * n01) + (m11 * n11) + (m12 * n21) + (m13 * n31);
    arg1[6] =  (m10 * n02) + (m11 * n12) + (m12 * n22) + (m13 * n32);
    arg1[7] =  (m10 * n03) + (m11 * n13) + (m12 * n23) + (m13 * n33);
    arg1[8] =  (m20 * n00) + (m21 * n10) + (m22 * n20) + (m23 * n30);
    arg1[9] =  (m20 * n01) + (m21 * n11) + (m22 * n21) + (m23 * n31);
    arg1[10] =  (m20 * n02) + (m21 * n12) + (m22 * n22) + (m23 * n32);
    arg1[11] =  (m20 * n03) + (m21 * n13) + (m22 * n23) + (m23 * n33);
    arg1[12] =  (m30 * n00) + (m31 * n10) + (m32 * n20) + (m33 * n30);
    arg1[13] =  (m30 * n01) + (m31 * n11) + (m32 * n21) + (m33 * n31);
    arg1[14] =  (m30 * n02) + (m31 * n12) + (m32 * n22) + (m33 * n32);
    arg1[15] =  (m30 * n03) + (m31 * n13) + (m32 * n23) + (m33 * n33);
}  

The performance difference is large:
=============================================
Matrix Multiplication
=============================================
Avg: 10.92 ms Min: 0.0 ms Max: 31.2 ms (Avg: 10920 Min: 0 Max: 31200)
=============================================
Matrix Multiplication Float32
=============================================
Avg: 196.56 ms Min: 187.2 ms Max: 202.801 ms (Avg: 196560 Min: 187200 Max: 202801)

Comparing the initial property read of both methods:

Float32List:

00D13A27    ff750c            push [ebp+0xc]
00D13A2A    6800000000        push 0x0
00D13A2F    b9c1c1ae05        mov ecx,05AEC1C1  'ICData target:[]'
00D13A34    ba19d5b100        mov edx,00B1D519  Array[2, 2, null]
00D13A39    e822c7feff        call 00D00160  [stub: OneArgCheckInlineCache] <-- call into Float32List operator []
00D13A3E    83c408            add esp,0x8
00D13A41    50                push eax
00D13A42    58                pop eax
00D13A43    8945f4            mov [ebp-0xc],eax

Native Dart object properties:

00D0F8DC    0fb74301          movzx_w eax,[ebx+0x1]
00D0F8E0    3d09020000        cmp eax, 00000209 
00D0F8E5    0f8594190000      jnz 00D1127F          <-- I _think_ the cmp, jnz pattern is testing if the object is what the function was optimized for because 00D1127F has a call to Deoptimize
00D0F8EB    8b4307            mov eax,[ebx+0x7]
00D0F8EE    8945f0            mov [ebp-0x10],eax
00D0F8F1    8b5d0c            mov ebx,[ebp+0xc]
00D0F8F4    0fb74301          movzx_w eax,[ebx+0x1]
00D0F8F8    3d20020000        cmp eax, 00000220
00D0F8FD    0f8587190000      jnz 00D1128A
00D0F903    8b430f            mov eax,[ebx+0xf]
00D0F906    50                push eax
00D0F907    5b                pop ebx
00D0F908    f6c301            test_b ebx,0x1
00D0F90B    0f8484190000      jz 00D11295

I would love to hear from the VM guys on this. Are there plans to inline Float32Array array operator code? Is it even possible for the VM to inline "native" function calls?

Thanks,
--
John McCutchan <jo...@johnmccutchan.com>

John McCutchan

unread,
Jun 16, 2012, 8:16:31 PM6/16/12
to General Dart Discussion
Accidentally pressed Send before I was finished. See below.

After reading all the properties, the guts of both functions look similar:

Float32List:

00D13E9A    f20f104a07        movsd xmm1,[edx+0x7]
00D13E9F    f20f59c1          mulsd xmm0,xmm1
00D13EA3    f20f114107        movsd [ecx+0x7],xmm0
00D13EA8    51                push ecx
00D13EA9    5a                pop edx
00D13EAA    58                pop eax
00D13EAB    f20f104007        movsd xmm0,[eax+0x7]
00D13EB0    f20f104a07        movsd xmm1,[edx+0x7]
00D13EB5    f20f58c1          addsd xmm0,xmm1
00D13EB9    f20f114007        movsd [eax+0x7],xmm0

Native Dart objects:

00D10222    f20f104a07        movsd xmm1,[edx+0x7]
00D10227    f20f59c1          mulsd xmm0,xmm1 <--- multiply
00D1022B    f20f114107        movsd [ecx+0x7],xmm0
00D10230    51                push ecx
00D10231    5a                pop edx
00D10232    58                pop eax
00D10233    f20f104007        movsd xmm0,[eax+0x7]
00D10238    f20f104a07        movsd xmm1,[edx+0x7]
00D1023D    f20f58c1          addsd xmm0,xmm1 <-- add
00D10241    f20f114007        movsd [eax+0x7],xmm0

The above code is repeated many times.

The end of the Float32List function where the values are written is full of calls:

00D13F8A    b901cbae05        mov ecx,05AECB01  'ICData target:[]='
00D13F8F    ba1121ca00        mov edx,00CA2111  Array[3, 3, null]
00D13F94    e8c7c1feff        call 00D00160  [stub: OneArgCheckInlineCache]

In this specific example, working with a Float32List is 20x slower and I believe the slow down is caused by all the calls to operator[] and operator[]=.

--
John McCutchan <jo...@johnmccutchan.com>

Srdjan Mitrovic

unread,
Jun 18, 2012, 1:56:25 AM6/18/12
to John McCutchan, General Dart Discussion, Carl Shapiro
Yes, we will inline Float32Array indexed accesses. We already do inline some native function calls, e.g., growable and fixed sized lists as well as some Uint array indexed accesses. All floating points values are doubles and are represented as objects in Dart. We will also add optimizations that will allow direct floating point operations without allocating double objects in the heap.

Cheers,

- Srdjan
Reply all
Reply to author
Forward
0 new messages