ArrayFire vs custom CUDA kernel performance

143 views
Skip to first unread message

Evan Hilscher

unread,
May 4, 2023, 3:59:12 PM5/4/23
to ArrayFire Users

I have custom CUDA kernel that does the following:

  • Take a 3xN matrix of uints A
  • Sum along the first dimension to produce a 1xN vector of uints V
  • Lookup into a table of ushorts to map to a 1xN vector of uints V' (cast from ushorts)
  • Element-wise multiply each 1xN row of A by V' to produce 3xN matrix A' of uints
  • Cast A' to ushorts and write to output buffer

This CUDA kernel runs in about 3ms on a matrix of 3x3145728 uints.

I've written a function using ArrayFire that produces identical results:

void test()
{
static std::vector<uint16_t> input(3 * 3145728);
static std::vector<uint16_t> output(3 * 3145728);
static std::vector<uint16_t> table(2048);

af::array A = af::array(3, static_cast<dim_t>(input.size() / 3), input.data()).as(u32);

af::array V = af::sum(A);

af::array afTable(static_cast<dim_t>(table.size()), table.data());
af::array V_prime = af::lookup(afTable, V).T().as(u32);

af::array A_prime = af::batchFunc(A, V_prime, af::operator*).as(u16);

A_prime.host(output.data());
}

Using af::timeit(test) I see that this function, after a few iterations, stabilizes at about 12ms.

Why does this function take 4x longer than the CUDA kernel? Is this a limitation of ArrayFire? Is my ArrayFire code not optimal for this problem? 3ms vs 12ms is significant for the problem I'm working on.

I've also tried swapping batchFunc for gfor and manual tiling. gfor was much slower (~40ms), and manual tiling produced a negligible difference in timing.

To rule out host/device transfer, I've observed that the following ArrayFire function executes in about 3ms:

void test()
{
static std::vector<uint16_t> input(3 * 3145728);
static std::vector<uint16_t> output(3 * 3145728);

af::array A = af::array(3, static_cast<dim_t>(input.size() / 3), input.data());

af::array A_prime = (A + 1).as(u16);

A_prime.host(output.data());
}

Gallagher Pryor

unread,
May 17, 2023, 1:20:01 PM5/17/23
to ArrayFire Users
The function test() you gave includes two cpu->gpu transfers (lines 7 and 11) and one gpu->cpu transfer (last line in test()). Did your timing of the CUDA kernel include these transfers?

Umar Arshad

unread,
May 17, 2023, 6:01:02 PM5/17/23
to ArrayFire Users
It looks like you are performing some extra operations that you are probably not doing in your CUDA code. For example, you could perform all the casting operations and structure the input data before uploading to the GPU instead of transposing every time. This will give you a better comparison to what you are probably doing in your CUDA application.

Evan Hilscher

unread,
May 17, 2023, 7:28:07 PM5/17/23
to ArrayFire Users
The timing for my CUDA kernel did include both host to device transfers, and the device to host transfer. Additionally, all type casting that you see in test() was also done inside the CUDA kernel. I will pare down my CUDA kernel and post it here, because you shouldn't take my word for it.

Evan Hilscher

unread,
May 17, 2023, 8:19:19 PM5/17/23
to ArrayFire Users
The average runtime of the following kernel with all allocation, copies, and casts is 4ms:

__global__ void testKernel(
        u_int16_t* table,
        u_int16_t* A,
        u_int16_t* A_prime,
        uint32_t N)
{
    uint32_t column = blockDim.x * blockIdx.x + threadIdx.x;
    if (column >= N)
    {
        return;
    }

    column *= 3;

    uint32_t V = A[column] + A[column + 1] + A[column + 2];

    uint32_t V_prime = table[V];

    A_prime[column] = A[column] * V_prime;
    A_prime[column + 1] = A[column + 1] * V_prime;
    A_prime[column + 2] = A[column + 2] * V_prime;
}

void test()
{
    static constexpr size_t columns =  3145728;

    static std::vector<uint16_t> input(3 * columns);
    static std::vector<uint16_t> output(3 * columns);

    static std::vector<uint16_t> table(2048);

    uint16_t* d_input = nullptr;
    uint16_t* d_output = nullptr;
    uint16_t* d_table = nullptr;

    cudaMalloc(&d_input, input.size() * sizeof(uint16_t));
    cudaMalloc(&d_output, output.size() * sizeof(uint16_t));
    cudaMalloc(&d_table, table.size() * sizeof(uint16_t));

    cudaMemcpy(d_input, input.data(), input.size() * sizeof(uint16_t), cudaMemcpyHostToDevice);
    cudaMemcpy(d_table, table.data(), table.size() * sizeof(uint16_t), cudaMemcpyHostToDevice);

    int threadsPerBlock = 256;
    int blocksPerGrid = columns / threadsPerBlock;
    testKernel<<<blocksPerGrid, threadsPerBlock>>>(d_table, d_input, d_output, columns);

    cudaError_t error = cudaDeviceSynchronize();
    if (error != cudaSuccess)
    {
        cudaFree(d_input);
        cudaFree(d_output);
        cudaFree(d_table);

        return;
    }

    cudaMemcpy(output.data(), (void*)d_output, output.size() * sizeof(uint16_t), cudaMemcpyDeviceToHost);

    cudaFree(d_input);
    cudaFree(d_output);
    cudaFree(d_table);
}

int main(int argc, char * argv[])
{
    for (int i = 0; i < 100; ++i)
    {
        QElapsedTimer timer;
        timer.start();
        test();
        std::cout << timer.elapsed() << std::endl;
    }

    return 0;

Evan Hilscher

unread,
May 17, 2023, 8:29:22 PM5/17/23
to ArrayFire Users
I missed your comment about eliminating the transposition. I will have to think about how to structure my input to avoid that and see if it helps.

Evan Hilscher

unread,
May 17, 2023, 8:40:00 PM5/17/23
to ArrayFire Users
I realized I could just change the dimensionality of my dummy data and not worry about actual reordering for now. This function stabilizes at about 5ms:

void test()
{
    static std::vector<uint16_t> input(3 * 3145728);
    static std::vector<uint16_t> output(3 * 3145728);
    static std::vector<uint16_t> table(2048);

    af::array A = af::array(static_cast<dim_t>(input.size() / 3), 3, input.data()).as(u32);

    af::array V = af::sum(A, 1);


    af::array afTable(static_cast<dim_t>(table.size()), table.data());
    af::array V_prime = af::lookup(afTable, V).as(u32);


    af::array A_prime = af::batchFunc(A, V_prime, af::operator*).as(u16);

    A_prime.host(output.data());
}

So transpose is expensive. I'm not sure how to do this input/output reordering for my particular problem, but at least I know where to start.

Thank you so much!
Reply all
Reply to author
Forward
0 new messages