Assertion `array_dev != nullptr' failed. in tileLayoutConvert

bcsj

unread,

Jun 19, 2024, 8:47:26 AM6/19/24

to SLATE User

Hi, I'm running into the following error for some of my code.

main.app: /.../include/slate/BaseMatrix.hh:3524: void slate::BaseMatrix<double>::tileLayoutConvert(std::set<ij_tuple> &, int, Layout, bool) [scalar_t = double]: Assertion `array_dev != nullptr' failed.
main.app: /.../include/slate/BaseMatrix.hh:3524: void slate::BaseMatrix<double>::tileLayoutConvert(std::set<ij_tuple> &, int, Layout, bool) [scalar_t = double]: Assertion `array_dev != nullptr' failed.

Now, I believe the error is not due to SLATE but due to my own code that I'm using to populate the matrix.

As a note, my SLATE version is 2023.11.05 release.

I have the following method that I use to populate the matrix with data from an hdf5-file

template <typename scalar_t>
void read(
    slate::Matrix<scalar_t> A,
    const std::string filename,
    const std::string dsname)
{
    // hdf5 initialization
    H5::H5File file = impl::h5open(filename, dsname);
    H5::DataSet dataset = file.openDataSet(dsname);
    H5::DataSpace dataspace = dataset.getSpace();

    hsize_t start[2];  // Start of the hyperslab
    hsize_t count[2];  // Dimensions of the hyperslab

    // Read file
    int64_t mt = A.mt();
    int64_t nt = A.nt();

    int64_t i, j, ii, jj, kk;
    int64_t mb, nb;
    int64_t mm = 0, nn = 0;
    for (i = 0; i < mt; i++) {
        mb = A.tileMb(i);
        nn = 0;
        for (j = 0; j < nt; j++) {
            nb = A.tileNb(j);

            if (A.tileIsLocal(i, j)) {
                A.tileGetForWriting(i, j, slate::LayoutConvert::RowMajor);
                auto T = A(i, j);
                auto tiledata = T.data();
                int64_t stride = T.stride();

                // hdf5 get data
                dataspace.selectNone(); // Make sure dataspace selection is empty
                start[0] = mm; start[1] = nn;
                count[0] = mb; count[1] = nb;
                dataspace.selectHyperslab(H5S_SELECT_SET, count, start);

                hsize_t size = mb * nb;
                scalar_t data[size];
                H5::DataSpace memspace(1, &size);

                dataset.read(data, hdf5::datatype<scalar_t>::value, memspace, dataspace);
                
                // populate tile
                kk = 0;
                for (ii = 0; ii < mb; ii++) {
                    for (jj = 0; jj < nb; jj++) {
                        tiledata[jj + ii * stride] = data[kk++];
                    }
                }

            }

            nn += nb;
        }
        mm += mb;
    }

    // Update
    A.tileUpdateAllOrigin();
    A.releaseWorkspace();

    // close stuff
    dataset.close();
    file.close();

}

And the error occur when I call

A.insertLocalTiles(slate::Target::Devices);
hdf5::read(A, matrix_file, "X");

const scalar_t one = 1.0;
const scalar_t zero = 0.0;
auto AT = slate::transpose(A);
slate::gemm(one, A, AT, zero, D, opts);

I have another custom method I've build based on the slate::gemm code that uses the matrix A in RowMajor only (no transpose is involved there) and that one runs fine after the above hdf5::read method if I replace the slate::gemm with it.

But if I have another method which modify A, while requesting the tiles in ColMajor during that process, then my own method suddenly breaks with the same error too.

In essence, there seem to be something happening with the layout when I fetch the device tile to host in a particular layout, update the host-local version, and try to flush back the changes with tileUpdateAllOrigin().

I can only assume that I should be doing something more beyond simply calling tileUpdateAllOrigin() to get the changes properly back to the origin on the device, but it is not clear to me what that something is?

Best, bcsj

bcsj

unread,

Jun 19, 2024, 9:36:50 AM6/19/24

to SLATE User, bcsj

Here is a minimal working example of the bug in practice. Note in main() where I call populate().

#include <slate/slate.hh>

template <typename scalar_t>
using Matrix = slate::Matrix<scalar_t>;

const int64_t Hostnum = -1;
auto RM = slate::LayoutConvert::RowMajor;
auto CM = slate::LayoutConvert::ColMajor;

template <typename scalar_t>
void populate(Matrix<scalar_t> &A, slate::LayoutConvert layout)
{
    int64_t m = A.m();
    int64_t n = A.n();

    int64_t mt = A.mt();
    int64_t nt = A.nt();

    int64_t i, j, ii, jj;
    int64_t mm, nn, mb, nb;
    mm = 0;

    for (i = 0; i < mt; i++) {

        nn = 0;
        mb = A.tileMb(i);

        for (j = 0; j < nt; j++) {
            nb = A.tileNb(j);
            if (A.tileIsLocal(i, j)) {

                A.tileGetForWriting(i, j, layout);

                auto T = A(i, j);

                auto data = T.data();
                auto stride = T.stride();

                for (ii = 0; ii < mb; ii++) {
                    for (jj = 0; jj < nb; jj++) {

                        int64_t kk = layout == slate::LayoutConvert::RowMajor
                                    ? ii * stride + jj
                                    : ii + jj * stride; 
                        data[kk] = (mm + ii) * n + (nn + jj);

    A.tileUpdateAllOrigin();
    A.releaseWorkspace();
}

int main(int argc, char** argv) {
    using scalar_t = double;

    int64_t p = 2, q = 2;
    int err=0, mpi_provided=0;
    err = MPI_Init_thread( &argc, &argv, MPI_THREAD_MULTIPLE, &mpi_provided );
    assert( err == 0 && mpi_provided == MPI_THREAD_MULTIPLE );

    slate::Options opts = {{slate::Option::Target, slate::Target::Devices}};

    const int64_t m = 128;
    const int64_t n = 128;
    const int64_t tilesize = 16;

    Matrix<scalar_t> A(m , n, tilesize, p, q, MPI_COMM_WORLD);
    Matrix<scalar_t> D(m , m, tilesize, p, q, MPI_COMM_WORLD);

    A.insertLocalTiles(slate::Target::Devices);
    
    // =========================================================
    populate(A, CM); // CM works, RM gives the error
    // =========================================================

    D.insertLocalTiles(slate::Target::Devices);

    const scalar_t one = 1.0;
    const scalar_t zero = 0.0;
    auto AT = slate::transpose(A);
    slate::gemm(one, A, AT, zero, D, opts);

    return 0;
}

bcsj

unread,

Jun 19, 2024, 11:06:37 AM6/19/24

to SLATE User, bcsj

I've played around a bit more and it seems that the issue disappears if I forcefully update the tile on the device with

if (!T.origin() && slate::Layout(layout) != A.layout())
    A.tileGetForReading(i, j, A.tileDevice(i, j), slate::LayoutConvert(A.layout()));

after the inner loops over ii and jj in populate().

I somehow thought that tileUpdateAllOrigin() would do this for me though?

bcsj

unread,

Jun 20, 2024, 5:16:16 AM6/20/24

to SLATE User, bcsj

Okay, nevermind, testing some more this doesn't actually seem to fix the issue. I still get the error in some odd circumstances.

bcsj

unread,

Jun 20, 2024, 6:05:26 AM6/20/24

to SLATE User, bcsj

So, I think I'm beginning to understand the issue better, though I still don't understand it fully.

To clarify: I have a function I've built based on the slate::gemm code. It uses the same "backbone" to transfer tiles around for computation. The code works, conditionally however. There's a difference to the slate::gemm-code, this code wants the tiles in RowMajor ordering, due to the structure of the GPU kernel. So I have simply changed the layout in the part of the slate::gemm code that set the layout = slate::Layout::ColMajor; changed that to RolMajor. This seems to be creating some sort of hick-up with the matrix being ColMajor by default.

Things were working fine with the hdf5::read method above, and that seems to be due to it working on the data in RowMajor. Cause the moment I put in the 2-line "fix" in the read-method, suddenly my computation broke. Testing some more however, I have found that running

for (int64_t i = 0; i < A.mt(); i++) {
    for (int64_t j = 0; j < A.nt(); j++) {
        if (A.tileIsLocal(i, j)) {
            A.tileGetForReading(i, j, 0, slate::LayoutConvert::RowMajor);
        }
    }
}

after the hdf5::read, before the computation seems to make the code work again.

So, it seems that the call to

A.template listBcastMT<target>(bcast_list_A0, layout);

doesn't deal well with device tiles of A being ColMajor, when the layout input for listBcastMT is RowMajor. But if I do the A.tileGetForReading(..., slate::LayoutConvert::RowMajor) first, then it is fine somehow ... ?

Weirder yet again, when I look in the listBcastMT, it seems to call tileIbcastToSet with the layout and that one calls

tileGetForReading(i, j, device, LayoutConvert(layout));

on line 2430 of BaseMatrix.hh, so it actually seem to make the exact same call as I just injected before the method, ... I am so confused.

bcsj

unread,

Jun 20, 2024, 8:50:09 AM6/20/24

to SLATE User, bcsj

I'm slowly beginning to uncover the issue further...

Here is the small working example again. I have the introduced "fix" from above in the populate method. That method doesn't cause any issue anymore.

Instead I've created a small function "create_issue" which creates the issue.

I'm beginning to assume that there is some internal issue with converting the layout of a tile on the device in the tileLayoutConvert() method?

#include <slate/slate.hh>

template <typename scalar_t>
using Matrix = slate::Matrix<scalar_t>;

const int64_t Hostnum = -1;
auto RM = slate::LayoutConvert::RowMajor;
auto CM = slate::LayoutConvert::ColMajor;

template <typename scalar_t>
void populate(Matrix<scalar_t> &A, slate::LayoutConvert layout)
{
    int64_t m = A.m();
    int64_t n = A.n();
    int64_t mt = A.mt();
    int64_t nt = A.nt();

    int64_t i, j, ii, jj;
    int64_t mm, nn, mb, nb;
    mm = 0;
    for (i = 0; i < mt; i++) {
        nn = 0;
        mb = A.tileMb(i);
        for (j = 0; j < nt; j++) {
            nb = A.tileNb(j);
            if (A.tileIsLocal(i, j)) {

                auto org_layout = A.layout();

                A.tileGetForWriting(i, j, layout);

                auto T = A(i, j);
                auto data = T.data();
                auto stride = T.stride();

                for (ii = 0; ii < mb; ii++) {
                    for (jj = 0; jj < nb; jj++) {
                        int64_t kk = layout == slate::LayoutConvert::RowMajor
                                    ? ii * stride + jj
                                    : ii + jj * stride; 
                        data[kk] = (mm + ii) * n + (nn + jj);
                    }
                }

                // This fixes the issue in the first place

                if (!T.origin() && slate::Layout(layout) != A.layout())
                    A.tileGetForReading(i, j, A.tileDevice(i, j), slate::LayoutConvert(A.layout()));
            }

            nn += nb;
        }
        mm += mb;
    }
    A.tileUpdateAllOrigin();
    A.releaseWorkspace();
}

// This function reintroduces the issue
template <typename scalar_t>
void create_issue(Matrix<scalar_t> &A, slate::LayoutConvert layout) {

    for (int64_t i = 0; i < A.mt(); i++) {
        for (int64_t j = 0; j < A.nt(); j++) {
            if (A.tileIsLocal(i, j)) {

                A.tileGetForReading(i, j, A.tileDevice(i, j), layout);

            }
        }
    }
}

int main(int argc, char** argv) {
    using scalar_t = double;

    int64_t p = 2, q = 2;
    int err=0, mpi_provided=0;
    err = MPI_Init_thread( &argc, &argv, MPI_THREAD_MULTIPLE, &mpi_provided );
    assert( err == 0 && mpi_provided == MPI_THREAD_MULTIPLE );

    slate::Options opts = {{slate::Option::Target, slate::Target::Devices}};

    const int64_t m = 128;
    const int64_t n = 128;
    const int64_t tilesize = 16;

    Matrix<scalar_t> A(m , n, tilesize, p, q, MPI_COMM_WORLD);
    Matrix<scalar_t> D(m , m, tilesize, p, q, MPI_COMM_WORLD);

    A.insertLocalTiles(slate::Target::Devices);
    
    // =========================================================

    populate(A, CM); // <- a fix was introduced here
    // =========================================================
    create_issue(A, RM); // <- the issue is reintroduced here

                         // CM works, RM gives the error
    // =========================================================

    D.insertLocalTiles(slate::Target::Devices);

    if (A.mpiRank() == 0) {
        if (A.layout() == slate::Layout::RowMajor) {
            std::cout << "A is RowMajor" << std::endl;
        } else {
            std::cout << "A is ColMajor" << std::endl;

        }
    }

    const scalar_t one = 1.0;
    const scalar_t zero = 0.0;
    auto AT = slate::transpose(A);
    slate::gemm(one, A, AT, zero, D, opts);

    return 0;
}

bcsj

unread,

Jun 20, 2024, 9:03:14 AM6/20/24

to SLATE User, bcsj

Okay, I think I've found a working solution: in the slate::gemm code I've based my method on there is a place where batchArrays are allocated here:

https://github.com/icl-utk-edu/slate/blob/9a6f44c92f9c29e8760246ca2bdb5990201f64dd/src/gemmC.cc#L57

if (target == Target::Devices) {
    C.allocateBatchArrays();
    C.reserveDeviceWorkspace();
}

If I introduce the following change

if (target == Target::Devices) {
    C.allocateBatchArrays();
    C.reserveDeviceWorkspace();
    A.allocateBatchArrays();
    B.allocateBatchArrays();
}

The the tileLayoutConvert() method doesn't throw the error anymore.

So it seems that for device tiles the BaseMatrix::tileLayoutConvert() method need the batch arrays to do a conversion, but the slate::gemm method only allocate batch arrays for the C-matrix, not A and B.

bcsj

unread,

Jun 20, 2024, 9:04:45 AM6/20/24

to SLATE User, bcsj

Now for some follow-up questions

Are there any unforeseen consequences of allocating batch-arrays for those matrices?
Can I clean up those batch-arrays again somehow?

Mark Gates

unread,

Jun 20, 2024, 11:39:27 AM6/20/24

to bcsj, SLATE User

Thanks for the example code, and glad that you seem to have worked out the issue. Indeed, row-major is not very well tested in SLATE. Most code is done in col-major.

1. Off hand, the only issue I know of is the overhead of allocating the batched arrays, which are a host vector and a device vector of pointers of size the number of local tiles.

2. You can use A.clearBatchArrays() to clear them (i.e., delete them).

Mark

bcsj

unread,

Jun 21, 2024, 4:18:42 AM6/21/24

to SLATE User, mga...@icl.utk.edu, SLATE User, bcsj

Hey, thank you for the reply! :)

Good to know that row-major is perhaps something to keep an eye on and that I should perhaps try to stick to col-major.

I think my current code could be reworked into col-major, so I might try that, but for now it at least works as is.

1. Good, that seems like a very minor thing. :)

2. Great!

Thanks again!

Reply all

Reply to author

Forward