Multiply gives different answers for different block sizes

Robert Langefeld

unread,

Jul 5, 2024, 9:11:03 AM7/5/24

to SLATE User

I've been working with slate::multiply and slate::rank_k_update in my code. I am able to print my X matrix beforehand and it appears to be correct (at least based on the SLATE printed output). I am attempting to perform the operation A = X X^T. I notice that every time I run this, I get a different answer from the function calls (regardless of if I transpose manually and use multiply or if I use the rank_k_update) based on the block size I choose. Specifically, choosing a block size of 1 or a block size that is greater than the matrix dimensions will both provide the same (incorrect) answer. Block sizes that are less than the matrix dimensions (so having to store more than 1 tile with more than 1 element per tile) each produce different (also incorrect) results. For testing, I am working with relatively small matrices (5x5, 10x10, 15x10, etc.), and I've compared the results with those I obtain from NumPy by reading the same data in and performing the operation. Also, I've been using the binary reading function, not the other. Below is the code I've been using (note alpha=1.0 and beta=0.0, and T and some other variables are determined based on the call to the function this is nested within):

Multiply:

slate::Matrix<T> X;

if (input_x.substr(input_x.find_last_of(".") + 1) == "bin") {
    X = read_binary<T>(input_x, &n, nb, p, q, transpose);
} else {
    X = read_data<T>(input_x, &n, nb, p, q, transpose);
}

slate::SymmetricMatrix<T> A( slate::Uplo::Lower, n, nb, p, q, MPI_COMM_WORLD );
A.insertLocalTiles();

// A = X * X^T
auto XT = slate::transpose( X );
slate::Matrix<T> temp( n, n, nb, p, q, MPI_COMM_WORLD );
temp.insertLocalTiles();

// ISSUE: Different results for different nb
slate::multiply(alpha, X, XT, beta, temp);  // A = X X^T

print("A", temp);
Rank_K_Update:

slate::Matrix<T> X;

if (input_x.substr(input_x.find_last_of(".") + 1) == "bin") {
    X = read_binary<T>(input_x, &n, nb, p, q, transpose);
    //X = slate::transpose( X );
} else {
    X = read_data<T>(input_x, &n, nb, p, q, transpose);
}
slate::SymmetricMatrix<T> A( slate::Uplo::Lower, n, nb, p, q, MPI_COMM_WORLD );
A.insertLocalTiles();

// A = X * X^T
slate::rank_k_update(alpha, X, beta, A);  // A = X X^T

print("A", A);

Read Binary File:

template <typename T>
slate::Matrix<T> read_binary(std::string& input_file, 
                            int64_t* n_ptr,
                            int64_t nb = 100, 
                            int64_t p = 0, 
                            int64_t q = 0,
                            bool transpose = false)
{
    // Input file is of form filename.bin
    // finename.bin stores a binary file of floats
    // Matrix dimensions stored as rows cols in filename.dim
    // Need to remove .bin and replace with .dim
    std::string dim_file = input_file.substr(0, input_file.find_last_of(".")) + ".dim";

    // Get the number of rows and columns to allocate memory
    std::ifstream file(dim_file);
    std::string line;
    int64_t rows = 0;
    int64_t cols = 0;

    std::getline(file, line);
    std::istringstream iss(line);
    iss >> rows >> cols;


    // Set n_ptr to final number of rows
    if (transpose) {
        *n_ptr = cols;
    } else {
        *n_ptr = rows;
    }

    // Close .dim file
    file.close();

    // Open binary file
    std::ifstream bin_file(input_file, std::ios::binary);

    // Allocate memory for the matrix
    slate::Matrix<T> X( rows, cols, nb, p, q, MPI_COMM_WORLD );
    X.insertLocalTiles();

    // Read in the data
    for (int64_t j = 0; j < X.nt(); ++j) {
        for (int64_t i = 0; i < X.mt(); ++i) {
            if (X.tileIsLocal( i, j )) {
                try {
                    auto Atile = X( i, j );
                    auto A = Atile.data();

                    int64_t start_row = i * nb;
                    int64_t start_col = j * nb;
                    int64_t end_row = std::min(start_row + nb, rows);
                    int64_t end_col = std::min(start_col + nb, cols);

                    auto n = end_row - start_row;
                    auto m = end_col - start_col;

                    for (int64_t k = 0; k < n; ++k) {
                        bin_file.seekg(( cols * (start_row + k) + start_col ) * sizeof(float));
                        bin_file.read((char*)&A[k*m], m * sizeof(float));
                    }
                    
                }
                catch (...) {
                    // ignore missing tiles
                }

                // Reset file
                bin_file.clear();
                bin_file.seekg(0, std::ios::beg);
            }
        }
    }

    // Close binary file
    bin_file.close();

    // If transpose, return the transpose
    if (transpose) {
        return slate::transpose( X );
    } else {
        return X;
    }

}

Any help would be appreciated. I'm also happy to take out an issue on the GitHub repo if that is better. 

Also, if there is a better way to read this data in, that advice would be helpful as well.

I do it this way now because the matrix is too large to store in memory on any one node for production

The file is the output of NumPy writing to a binary file, supposedly in row major order.

Best,
Robert

Robert Langefeld

unread,

Jul 5, 2024, 9:16:36 AM7/5/24

to SLATE User, Robert Langefeld

I should probably note that I'm running on a server with Intel processors and I'm using spack for software management with the below configuration:

sl...@2023.11.05

bla...@2023.11.05

lapa...@2023.11.05

ope...@4.1.6

intel-on...@2024.0.0

g...@9.4.0

No CUDA

Robert

Mark Gates

unread,

Jul 6, 2024, 9:23:21 AM7/6/24

to Robert Langefeld, SLATE User

Hi Robert,

The code snippets are helpful, but without a complete working example, it's hard to replicate the issue.

I noticed you aren't initializing A or temp. Try adding that:

scalar_t zero = 0;

set( zero, A );

set( zero, temp );

as needed after inserting tiles. You can also print out the A or temp matrix before doing the operation.

These lines seem to switch m, n. Usually m is rows, n is columns.

auto n = end_row - start_row;

auto m = end_col - start_col;

You can get the tile dimensions directly from the tile as Atile.mb() for rows and Atile.nb() for cols.

The code assumes the stride is m. This should be the case since it uses insertLocalTiles, but more robust code would treat that separately, e.g.:

auto lda = Atile.stride();

.... A[ k*lda ] ... // instead of A[ k*m ]

Mark

Robert Langefeld

unread,

Jul 6, 2024, 4:25:27 PM7/6/24

to SLATE User, mga...@icl.utk.edu, SLATE User, Robert Langefeld

I have made those changes in my local code. It seems the problem I was having was that I treated it as row major when it is actually column major. As such, I had mixed up indices when I read in data. Currently, it seems the matrix multiplication part of my code now works. Previously, parts of my matrix looked like they were read correctly, but it appears they weren't. Are there examples of code to populate these kinds of distributed matrices from files, since the I/O pattern makes this complicated?

I was also wondering what the standard way to write out data is for slate matrix objects, given that the matrices are distributed?

Robert

Mark Gates

unread,

Jul 16, 2024, 1:28:00 PM7/16/24

to Robert Langefeld, SLATE User

Hi Robert,

We haven't dealt with file I/O. For basic debugging, we have slate::print that outputs in a Matlab-compatible text format. We will certainly keep file I/O in mind for future expansion.

Are there certain I/O features that you would propose?

Binary and text format, row and column major would be priorities for future I/O work.

Mark

Robert Langefeld

unread,

Jul 16, 2024, 1:50:47 PM7/16/24

to SLATE User, mga...@icl.utk.edu, SLATE User, Robert Langefeld

Thus far, the two issues I ran into were reading things into memory and writing from memory. I spent at least 4x as long implementing that than I did the actual math itself. Much of that was because tile indices are (row, column), but within a tile, it's column major. That wasn't something I picked up from the documentation. Implementing writing was much harder than the reading functions, since I want it to store the whole matrix but need to account for (1) tiles having more than one row and (2) each process potentially having multiple tiles. My guess is most implementations will need to rely on these kinds of I/O features to used results, especially when output is expected to be large. slate::print seemed to take too long for my purposes except for small matrices during debugging. My full runs are using matrices that are over 300k x 300k. My implementation may be crude (likely due to a lack of experience programming for HPC linear algebra software and MPI), but I implemented matrix writing to binary with MPI I/O. I'll include that below in case it is useful. I saw there are functions to regather locally, but that wouldn't work for my applications, since the matrix itself is too large to ever possibly store in memory on a single process. Right now, my functions can read from binary and text files, provided dimensions are also given for the binary file. They can only write to a binary file as of now. What I was getting at is that the SLATE User' Guide does give an example of walking through each tile. It doesn't seem to provide the memory layouts used when loading into each tile. Providing examples for I/O for a specified data format would be super helpful, likely over implementing for SLATE itself, since each file format will likely be different and require different specifics to write. At the least, an example for reading and an example for writing a binary file and/or a text file to store the matrix stored as rows, cols order would be easily adaptable for most use cases. If any of my other code would be helpful, let me know.

Robert

template <typename T>
void write_matrix(slate::Matrix<T> X,
                    std::string& output_file)
{

    // Write X to file
    MPI_Status status; 
    MPI_File fh; 
    //MPI_File* fh = new MPI_File;
    MPI_Offset offset; 
    try {
        int rc = MPI_File_open(MPI_COMM_WORLD, 
                    output_file.c_str(),
                    MPI_MODE_CREATE | MPI_MODE_WRONLY, 
                    MPI_INFO_NULL, 
                    &fh);

        if (rc != MPI_SUCCESS) {
            if (mpi_rank == 0) {
                char error[MPI_MAX_ERROR_STRING+1];
                int errorlen;
                MPI_Error_string(rc, error, &errorlen);
                fprintf(stderr,"opening file: %s\n",error);
                fflush(stderr);
            }

            exit(1);
        }
    }
    catch (const std::exception& e) {
        if (mpi_rank == 0) {
            std::cout << "Exception caught: " << e.what() << std::endl;
        }
        
        return;
    }

    auto rows = X.m();
    auto cols = X.n();
    auto total_row_tiles = X.mt();
    auto total_col_tiles = X.nt();
    int64_t rows_per_tile = X.tileMb(0);
    int64_t tile_row;
    bool isTransposed = X.op() == slate::Op::Trans;

    for (int64_t curr_row = 0; curr_row < rows; ++curr_row){
        tile_row = curr_row / rows_per_tile;
        
        for (int64_t tile_col = 0; tile_col < total_col_tiles; ++tile_col){
            if (X.tileIsLocal( tile_row, tile_col )) {
                auto Atile = X( tile_row, tile_col );
                auto A = Atile.data();
                auto lda = Atile.stride();
                auto m = Atile.mb();
                auto n = Atile.nb();
                
                if (isTransposed) {  
                    MPI_File_write_shared(fh, &A[(curr_row % rows_per_tile)*lda], n, MPI_T, &status);
                } else {
                    auto row_array = new FLOAT_T[n];

                    for (int64_t l = 0; l < n; ++l) {
                        row_array[l] = A[(curr_row % rows_per_tile) + l*lda];
                    }
                    
                    MPI_File_write_shared(fh, row_array, n, MPI_T, &status);
                }
            }

            MPI_Barrier(MPI_COMM_WORLD);
        }
    }
    
    MPI_Barrier(MPI_COMM_WORLD);

    MPI_File_close(&fh);

    MPI_Barrier(MPI_COMM_WORLD);
}

Reply all

Reply to author

Forward