The native code for indexing into a Float32Array is simple enough (base pointer + index). But, the VM can't seem to optimize the array index calls. The VM does optimize regular member variable access, making those much faster.
I have been playing with two versions of 4x4 matrix multiplication. One version reads the matrix elements from dart object properties and the other accesses Float32List entries.The algorithm is the same in both, the only difference being how the properties are accessed, here is the Float32List implementation:
void mat4mult(Float32List arg1, Float32List arg2) {
final num m00 = arg1[0];
final num m01 = arg1[4];
final num m02 = arg1[8];
final num m03 = arg1[12];
final num m10 = arg1[1];
final num m11 = arg1[5];
final num m12 = arg1[9];
final num m13 = arg1[13];
final num m20 = arg1[2];
final num m21 = arg1[6];
final num m22 = arg1[10];
final num m23 = arg1[14];
final num m30 = arg1[3];
final num m31 = arg1[7];
final num m32 = arg1[11];
final num m33 = arg1[15];
final num n00 = arg2[0];
final num n01 = arg2[4];
final num n02 = arg2[8];
final num n03 = arg2[12];
final num n10 = arg2[1];
final num n11 = arg2[5];
final num n12 = arg2[9];
final num n13 = arg2[13];
final num n20 = arg2[2];
final num n21 = arg2[6];
final num n22 = arg2[10];
final num n23 = arg2[14];
final num n30 = arg2[3];
final num n31 = arg2[7];
final num n32 = arg2[11];
final num n33 = arg2[15];
// The code below is writing results at incorrect indexes.
arg1[0] = (m00 * n00) + (m01 * n10) + (m02 * n20) + (m03 * n30);
arg1[1] = (m00 * n01) + (m01 * n11) + (m02 * n21) + (m03 * n31);
arg1[2] = (m00 * n02) + (m01 * n12) + (m02 * n22) + (m03 * n32);
arg1[3] = (m00 * n03) + (m01 * n13) + (m02 * n23) + (m03 * n33);
arg1[4] = (m10 * n00) + (m11 * n10) + (m12 * n20) + (m13 * n30);
arg1[5] = (m10 * n01) + (m11 * n11) + (m12 * n21) + (m13 * n31);
arg1[6] = (m10 * n02) + (m11 * n12) + (m12 * n22) + (m13 * n32);
arg1[7] = (m10 * n03) + (m11 * n13) + (m12 * n23) + (m13 * n33);
arg1[8] = (m20 * n00) + (m21 * n10) + (m22 * n20) + (m23 * n30);
arg1[9] = (m20 * n01) + (m21 * n11) + (m22 * n21) + (m23 * n31);
arg1[10] = (m20 * n02) + (m21 * n12) + (m22 * n22) + (m23 * n32);
arg1[11] = (m20 * n03) + (m21 * n13) + (m22 * n23) + (m23 * n33);
arg1[12] = (m30 * n00) + (m31 * n10) + (m32 * n20) + (m33 * n30);
arg1[13] = (m30 * n01) + (m31 * n11) + (m32 * n21) + (m33 * n31);
arg1[14] = (m30 * n02) + (m31 * n12) + (m32 * n22) + (m33 * n32);
arg1[15] = (m30 * n03) + (m31 * n13) + (m32 * n23) + (m33 * n33);
}
The performance difference is large:
=============================================
Matrix Multiplication
=============================================
Avg: 10.92 ms Min: 0.0 ms Max: 31.2 ms (Avg: 10920 Min: 0 Max: 31200)
=============================================
Matrix Multiplication Float32
=============================================
Avg: 196.56 ms Min: 187.2 ms Max: 202.801 ms (Avg: 196560 Min: 187200 Max: 202801)