Segmentation fault in slate::gels

44 views
Skip to first unread message

bcsj

unread,
Sep 21, 2023, 8:28:30 AM9/21/23
to SLATE User
So, I'm working om some code for solving a least squares problem and I'm running into a segmentation fault in the slate::gels call. I've done a lot of debugging already to make sure all my input matrices don't have faulty values, nans or other odd stuff already; the content of the matrices are definitely error-free at this point. But slurm breaks off exactly in the slate::gels call with the following error.

srun: error: nid005137: task 0: Segmentation fault
srun: launch/slurm: _step_signal: Terminating StepId=4576728.0
slurmstepd: error: *** STEP 4576728.0 ON nid005137 CANCELLED AT 2023-09-21T12:31:27 ***
srun: error: nid005137: tasks 3,7: Terminated
srun: error: nid005137: tasks 1-2,4-6: Terminated
srun: Force Terminated StepId=4576728.0


Interestingly enough, I've tried to save the inputs and wrote a small program to read them in and do the call, and that program throws no error:

Example code: ignore the // [X] comment for now
// TEST #5 //////////////////////////////////////////////////////////////////////////////
using scalar_t = double;
int64_t m = 4860, n = 2160, mb = 1024, nb = 1024, p = 2, q = 2;
slate::Matrix<scalar_t> A(m, n, mb, nb, p, q, MPI_COMM_WORLD);
A.insertLocalTiles();
slateUtils::matrix::fromFile(A, "/.../debug_dump_A0");

slate::Matrix<scalar_t> BX(m, 1, mb, nb, p, q, MPI_COMM_WORLD);
BX.insertLocalTiles();
slateUtils::matrix::fromFile(BX, "/.../debug_dump_BX0");

// [X]

slate::Options opts = {{slate::Option::Target, slate::Target::Devices}};
slate::gels(A, BX, opts);

slateUtils::matrix::toFile(A, "./export/test.A.1.csv");
slateUtils::matrix::toFile(BX, "./export/test.BX.1.csv");

I've also as a sanity-check compared the output against an A\b in Matlab, which came out positive.
note: "slateUtils" is just a personal library of methods to simplify my life working with SLATE.

The call in the original code is like the above
slate::Options opts = {{slate::Option::Target, slate::Target::Devices}};
slate::gels(A, BX, opts);

I've found (through a lot of trial and error) that the segmentation fault appears to happen
when I don't comment out this slate::copy() call before the slate::gels(). 
slate::Matrix<scalar_t> BX_copy = BX.emptyLike();
BX_copy.insertLocalTiles();
slate::copy(BX, BX_copy); // <--
I wanted the copy because slate::gels() is supposed to modify BX to contain the solution, and I needed the original rhs for some later testing.

I've tried to check if there could be some race-condition is going on by including a
MPI_Barrier(MPI_COMM_WORLD);
call between the slate::copy() and slate::gels() calls, but it doesn't appear to affect the situation.

In the example code above I tried to insert the 3 lines above for BX_copy in the location I marked with "// [X]", and that code started creating segmentation faults too then. So it really seems to be a bad interaction between slate::copy() and slate::gels(). 

After more testing, this appears to maybe be related to earlier calls to "tileGetForWriting(i, j)" used when I populate the matrix. Reading through the tileGetForWriting() method and subsequently the tileGet() method code on Github this could be related to the MOSI state maybe becoming Invalid(?). I've attempted to fix this with a tileUpdateAllOrigin() call after the modifications, but it doesn't seem to fix the issue.

Does anyone have any suggestions on what I might try to fix this? @MarkGates ... ideas?

Notably, I only seem to run into the error when I do the slate::copy() call.

I've included the slateUtils::matrix::fromFile code here, the last line was my attempt to sort out the MOSI state after the tileGetForWriting calls.
template <typename scalar_t>
Status fromFile(
    slate::Matrix<scalar_t> A,
    const std::string filename,
    const char delimiter)
{
 
    std::ifstream file(filename);
    if (!file.is_open())
        throw std::runtime_error("Failed to open the file: " + filename);

    std::stringstream ss("");

    std::string line, element;

    int64_t mt = A.mt();
    int64_t nt = A.nt();

    int64_t skips, skip_counter = 0;
    int64_t mb, nb, i, j, ii, jj;
    for (i = 0; i < mt; i++) {
        mb = A.tileMb(i);
        for (ii = 0; ii < mb; ii++) {

            std::getline(file, line);
            ss << line;

            skip_counter = 0;
            for (j = 0; j < nt; j++) {
                nb = A.tileNb(j);
                if (A.tileIsLocal(i, j)) {

                    // Skip elements for non-local tiles.
                    for (skips = 0; skips < skip_counter; skips++)
                        std::getline(ss, element, delimiter);
                    skip_counter = 0;

                    // First time accessing tile, getForWriting in RowMajor;
                    if (ii == 0) A.tileGetForWriting(i, j, slate::LayoutConvert::RowMajor);

                    // Get tile data
                    auto tile = A(i, j);
                    auto tiledata = tile.data();

                    // Insert line of elements
                    for (jj = 0; jj < nb; jj++) {
                        std::getline(ss, element, delimiter);
                        tiledata[jj + ii * tile.stride()] = stod(element); // string-to-double
                    }

                } else { skip_counter += nb; }
            }
           
            ss.str("");
            ss.clear();    
        }
    }

    file.close();
    return Status::Success;
    MPI_Barrier(A.mpiComm()); // Ensure no MPI process darts ahead
    A.tileUpdateAllOrigin();
}

bcsj

unread,
Sep 21, 2023, 8:36:32 AM9/21/23
to SLATE User, bcsj
Ah, whoops. that return call somehow get moved, that is a bug, I've fixed that, not sure how long that was in the wrong order, but fixing that didn't change any of the above. 

Neil Lindquist

unread,
Sep 21, 2023, 9:05:35 AM9/21/23
to SLATE User, bcsj
Hi Bjørn,

I also ran into this same segfault on Frontier and am looking into it.  When I added HIPCCFLAGS+=-g -ggdb to my make.inc, it no longer segfaulted, although it still hangs for larger problems.  It's not a proper fix, but it might let you keep working until I can fix it.

I'll let you know what I find.

Neil

bcsj

unread,
Sep 21, 2023, 9:08:57 AM9/21/23
to SLATE User, bcsj
Adding to the above, I've tried to run the following small example too now:
template <typename matrix_type>
void random_matrix( matrix_type& A )
{
    // for each tile in the matrix
    for (int64_t j = 0; j < A.nt(); ++j) {
        for (int64_t i = 0; i < A.mt(); ++i) {
            if (A.tileIsLocal( i, j )) {
                A.tileGetForWriting(i, j, slate::LayoutConvert::ColMajor);
                // set data-values in the local tile
                auto tile = A( i, j );
                auto tiledata = tile.data();
                for (int64_t jj = 0; jj < tile.nb(); ++jj) {
                    for (int64_t ii = 0; ii < tile.mb(); ++ii) {
                        tiledata[ ii + jj*tile.stride() ] =
                            (1.0 - (rand() / double(RAND_MAX))) * 100;
                    }
                }
            }
        }
    }
    A.tileUpdateAllOrigin();
    A.releaseWorkspace();
}

using scalar_t = double;
int64_t m = 10, n = 10, mb = 4, nb = 4, p = 2, q = 2;
slate::Matrix<scalar_t> A(m, n, mb, nb, p, q, MPI_COMM_WORLD);
A.insertLocalTiles();
random_matrix(A);

slate::Matrix<scalar_t> BX(m, 1, mb, nb, p, q, MPI_COMM_WORLD);
BX.insertLocalTiles();
random_matrix(BX);

slate::Matrix<scalar_t> BX_copy = BX.emptyLike();
BX_copy.insertLocalTiles();
slate::copy(BX, BX_copy);

slate::Options opts = {{slate::Option::Target, slate::Target::Devices}};
slate::gels(A, BX, opts);

I seem to get a segmentation fault if I change the slate::LayoutConvert::ColMajor to slate::LayoutConvert::RowMajor in the random_matrix() call, but I don't appear to get any error with ColMajor.
This seems odd to me, do the methods assume a particular layout?

bcsj

unread,
Sep 21, 2023, 9:18:27 AM9/21/23
to SLATE User, nlin...@icl.utk.edu, bcsj
Hi Neil,

Thanks, I hope you figure it out. I did another experiment (see my other message here, you might not have been notified about it (?) since I posted it almost simultaneous with your reply).
It could perhaps be related to tile-layouts?

Neil Lindquist

unread,
Sep 21, 2023, 9:23:33 AM9/21/23
to SLATE User, bcsj
Yea, I received your message right after sending mine.  SLATE should be handling all of the required layout conversions internally, and even if it's not converting there shouldn't be a segfault when using square tiles.  So, something is broken inside SLATE; I'll see what I can figure out.

Neil

bcsj

unread,
Sep 21, 2023, 9:43:19 AM9/21/23
to SLATE User, nlin...@icl.utk.edu, bcsj
Well, in the examples about not all tiles are square. In the last example for instance I had

-------------------
| 4x4 | 4x4 | 4x2 |
|-----------------|
| 4x4 | 4x4 | 4x2 |
|-----------------|
| 2x4 | 2x4 | 2x2 |
-------------------

So at least some were rectangular there

Neil Lindquist

unread,
Sep 21, 2023, 10:45:10 AM9/21/23
to SLATE User, bcsj, Neil Lindquist
Ah, right.  But either way, that example is suppose to work.

Neil

Neil Lindquist

unread,
Sep 26, 2023, 1:17:30 PM9/26/23
to SLATE User, Neil Lindquist, bcsj
I fixed what appears to be the bug you're running into.  You can try the neil/qr-tile-life branch in my fork (https://github.com/neil-lindquist), which includes that and some other improvements to QR.  Otherwise, I'm working on getting that into the master branch.

If that doesn't fix your issue, let me know and I can look at  it more.

Best,
Neil

bcsj

unread,
Sep 27, 2023, 4:12:01 AM9/27/23
to SLATE User, nlin...@icl.utk.edu, bcsj
Hi Neil,

Thanks for letting me know, I'll give it a try one of the next days and report back.

bcsj

unread,
Sep 29, 2023, 3:40:09 AM9/29/23
to SLATE User, bcsj, nlin...@icl.utk.edu
Seems there is a larger reservation on LUMI until Tuesday next week on the partition, so my trying it will have to wait until then.
Reply all
Reply to author
Forward
0 new messages