Float32Array and the VM

John McCutchan

unread,

Jun 13, 2012, 12:47:48 AM6/13/12

to General Dart Discussion

Hey Dart team,

I'm wondering how Float32Arrays work internally. Some questions:

1) What is the cost of constructing a Float32Array? In terms of both time and space. How does this compare to an array of nums?

1.5) What is the cost of constructing a Float32Array that aliases another Float32Array?

2) When operating on individual elements in a Float32Array are the operations done in single precision or are the elements promoted into doubles before work begins?

3) Is the overhead of indexing into a Float32Array better or worse as indexing into an array?

I appreciate the help.

--
John McCutchan <jo...@johnmccutchan.com>

Peter Jakobs

unread,

Jun 13, 2012, 8:41:19 AM6/13/12

to mi...@dartlang.org

Also interesting, are there plans to merge Float32Array with Float32List?

A merge would be nice, so that I dont have to maintain, for example, two versions of my gl-matrix library (one with Float32Array on client and one with Float32List on the server).

Seth Ladd

unread,

Jun 14, 2012, 3:49:16 PM6/14/12

to Peter Jakobs, mi...@dartlang.org

For posterity, readers of this thread might find a similar thread https://groups.google.com/a/dartlang.org/forum/#!topic/misc/wyXxrnF0na0 also interesting.

John McCutchan

unread,

Jun 16, 2012, 7:55:08 PM6/16/12

to General Dart Discussion

Answering myself:

On Tue, Jun 12, 2012 at 9:47 PM, John McCutchan <jo...@johnmccutchan.com> wrote:

Hey Dart team,

I'm wondering how Float32Arrays work internally. Some questions:

1) What is the cost of constructing a Float32Array? In terms of both time and space. How does this compare to an array of nums?

1.5) What is the cost of constructing a Float32Array that aliases another Float32Array?

Based on the code in byte_array.cc it allocating a Float32Array seems cheap. I haven't done any serious testing yet because I've found bigger performance issues with Float32Array (Float32List)

2) When operating on individual elements in a Float32Array are the operations done in single precision or are the elements promoted into doubles before work begins?

Floats are fetched and promoted to double precision.

3) Is the overhead of indexing into a Float32Array better or worse as indexing into an array?

The native code for indexing into a Float32Array is simple enough (base pointer + index). But, the VM can't seem to optimize the array index calls. The VM does optimize regular member variable access, making those much faster.

I have been playing with two versions of 4x4 matrix multiplication. One version reads the matrix elements from dart object properties and the other accesses Float32List entries.The algorithm is the same in both, the only difference being how the properties are accessed, here is the Float32List implementation:

void mat4mult(Float32List arg1, Float32List arg2) {

final num m00 = arg1[0];

final num m01 = arg1[4];

final num m02 = arg1[8];

final num m03 = arg1[12];

final num m10 = arg1[1];

final num m11 = arg1[5];

final num m12 = arg1[9];

final num m13 = arg1[13];

final num m20 = arg1[2];

final num m21 = arg1[6];

final num m22 = arg1[10];

final num m23 = arg1[14];

final num m30 = arg1[3];

final num m31 = arg1[7];

final num m32 = arg1[11];

final num m33 = arg1[15];

final num n00 = arg2[0];

final num n01 = arg2[4];

final num n02 = arg2[8];

final num n03 = arg2[12];

final num n10 = arg2[1];

final num n11 = arg2[5];

final num n12 = arg2[9];

final num n13 = arg2[13];

final num n20 = arg2[2];

final num n21 = arg2[6];

final num n22 = arg2[10];

final num n23 = arg2[14];

final num n30 = arg2[3];

final num n31 = arg2[7];

final num n32 = arg2[11];

final num n33 = arg2[15];

// The code below is writing results at incorrect indexes.

arg1[0] = (m00 * n00) + (m01 * n10) + (m02 * n20) + (m03 * n30);

arg1[1] = (m00 * n01) + (m01 * n11) + (m02 * n21) + (m03 * n31);

arg1[2] = (m00 * n02) + (m01 * n12) + (m02 * n22) + (m03 * n32);

arg1[3] = (m00 * n03) + (m01 * n13) + (m02 * n23) + (m03 * n33);

arg1[4] = (m10 * n00) + (m11 * n10) + (m12 * n20) + (m13 * n30);

arg1[5] = (m10 * n01) + (m11 * n11) + (m12 * n21) + (m13 * n31);

arg1[6] = (m10 * n02) + (m11 * n12) + (m12 * n22) + (m13 * n32);

arg1[7] = (m10 * n03) + (m11 * n13) + (m12 * n23) + (m13 * n33);

arg1[8] = (m20 * n00) + (m21 * n10) + (m22 * n20) + (m23 * n30);

arg1[9] = (m20 * n01) + (m21 * n11) + (m22 * n21) + (m23 * n31);

arg1[10] = (m20 * n02) + (m21 * n12) + (m22 * n22) + (m23 * n32);

arg1[11] = (m20 * n03) + (m21 * n13) + (m22 * n23) + (m23 * n33);

arg1[12] = (m30 * n00) + (m31 * n10) + (m32 * n20) + (m33 * n30);

arg1[13] = (m30 * n01) + (m31 * n11) + (m32 * n21) + (m33 * n31);

arg1[14] = (m30 * n02) + (m31 * n12) + (m32 * n22) + (m33 * n32);

arg1[15] = (m30 * n03) + (m31 * n13) + (m32 * n23) + (m33 * n33);

}

The performance difference is large:

=============================================

Matrix Multiplication

=============================================

Avg: 10.92 ms Min: 0.0 ms Max: 31.2 ms (Avg: 10920 Min: 0 Max: 31200)

=============================================

Matrix Multiplication Float32

=============================================

Avg: 196.56 ms Min: 187.2 ms Max: 202.801 ms (Avg: 196560 Min: 187200 Max: 202801)

Comparing the initial property read of both methods:

Float32List:

00D13A27 ff750c push [ebp+0xc]

00D13A2A 6800000000 push 0x0

00D13A2F b9c1c1ae05 mov ecx,05AEC1C1 'ICData target:[]'

00D13A34 ba19d5b100 mov edx,00B1D519 Array[2, 2, null]

00D13A39 e822c7feff call 00D00160 [stub: OneArgCheckInlineCache] <-- call into Float32List operator []

00D13A3E 83c408 add esp,0x8

00D13A41 50 push eax

00D13A42 58 pop eax

00D13A43 8945f4 mov [ebp-0xc],eax

Native Dart object properties:

00D0F8DC 0fb74301 movzx_w eax,[ebx+0x1]

00D0F8E0 3d09020000 cmp eax, 00000209

00D0F8E5 0f8594190000 jnz 00D1127F <-- I _think_ the cmp, jnz pattern is testing if the object is what the function was optimized for because 00D1127F has a call to Deoptimize

00D0F8EB 8b4307 mov eax,[ebx+0x7]

00D0F8EE 8945f0 mov [ebp-0x10],eax

00D0F8F1 8b5d0c mov ebx,[ebp+0xc]

00D0F8F4 0fb74301 movzx_w eax,[ebx+0x1]

00D0F8F8 3d20020000 cmp eax, 00000220

00D0F8FD 0f8587190000 jnz 00D1128A

00D0F903 8b430f mov eax,[ebx+0xf]

00D0F906 50 push eax

00D0F907 5b pop ebx

00D0F908 f6c301 test_b ebx,0x1

00D0F90B 0f8484190000 jz 00D11295

I would love to hear from the VM guys on this. Are there plans to inline Float32Array array operator code? Is it even possible for the VM to inline "native" function calls?

Thanks,

--
John McCutchan <jo...@johnmccutchan.com>

John McCutchan

unread,

Jun 16, 2012, 8:16:31 PM6/16/12

to General Dart Discussion

Accidentally pressed Send before I was finished. See below.

After reading all the properties, the guts of both functions look similar:

Float32List:

00D13E9A f20f104a07 movsd xmm1,[edx+0x7]

00D13E9F f20f59c1 mulsd xmm0,xmm1

00D13EA3 f20f114107 movsd [ecx+0x7],xmm0

00D13EA8 51 push ecx

00D13EA9 5a pop edx

00D13EAA 58 pop eax

00D13EAB f20f104007 movsd xmm0,[eax+0x7]

00D13EB0 f20f104a07 movsd xmm1,[edx+0x7]

00D13EB5 f20f58c1 addsd xmm0,xmm1

00D13EB9 f20f114007 movsd [eax+0x7],xmm0

Native Dart objects:

00D10222 f20f104a07 movsd xmm1,[edx+0x7]

00D10227 f20f59c1 mulsd xmm0,xmm1 <--- multiply

00D1022B f20f114107 movsd [ecx+0x7],xmm0

00D10230 51 push ecx

00D10231 5a pop edx

00D10232 58 pop eax

00D10233 f20f104007 movsd xmm0,[eax+0x7]

00D10238 f20f104a07 movsd xmm1,[edx+0x7]

00D1023D f20f58c1 addsd xmm0,xmm1 <-- add

00D10241 f20f114007 movsd [eax+0x7],xmm0

The above code is repeated many times.

The end of the Float32List function where the values are written is full of calls:

00D13F8A b901cbae05 mov ecx,05AECB01 'ICData target:[]='

00D13F8F ba1121ca00 mov edx,00CA2111 Array[3, 3, null]

00D13F94 e8c7c1feff call 00D00160 [stub: OneArgCheckInlineCache]

In this specific example, working with a Float32List is 20x slower and I believe the slow down is caused by all the calls to operator[] and operator[]=.

--
John McCutchan <jo...@johnmccutchan.com>

Srdjan Mitrovic

unread,

Jun 18, 2012, 1:56:25 AM6/18/12

to John McCutchan, General Dart Discussion, Carl Shapiro

Yes, we will inline Float32Array indexed accesses. We already do inline some native function calls, e.g., growable and fixed sized lists as well as some Uint array indexed accesses. All floating points values are doubles and are represented as objects in Dart. We will also add optimizations that will allow direct floating point operations without allocating double objects in the heap.