Hi Yeonjun,
What is your final goal in extracting dB from dA? Are you using dB in a matrix-vector multiply or a factorization or some other operation? Depending on the use case, there may be simpler methods that don't ever extract dB from dA.
If you're just appending one column dx to matrix dB, then magma_zcopy can copy dx, assuming there's space in dB to hold it. See use below.
For extracting dB from dA, I don't think there's a ready-made routine in MAGMA, but one shouldn't be too difficult for you to write.
A _slow_ version would just use magma_zcopy to copy columns of A to B. Something like (untested!):
// UNTESTED!
// dA is m x n, dB is m x k, columns is k vector with indices to extract.
// T is any data type.
extract( int m, int n, int k, T* dA, int lda, T* dB, int ldb, int* columns, queue )
{
assert( k <= n );
for (int j = 0; j < k; ++j) {
int j2 = columns[ j ];
assert( 0 <= j2 && j2 < n );
magma_zcopy( m, dA[ j2*lda ], 1, dB[ j*ldb ], 1, queue );
}
}
A much faster version would be to make a modified copy of magmablas/
zlacpy.cu that does the same thing. Something like (again, untested!):
// UNTESTED!
// Divides matrix into ceil( m/BLK_X ) x ceil( n/BLK_Y ) blocks.
// Each block has BLK_X threads.
// Each thread loops across one row, updating BLK_Y entries.
// Cf. magmablas/zlacpy.cu zlacpy_full_device. static __device__
void extract_device(
int m, int n, int k,
const magmaDoubleComplex *dA, int ldda,
magmaDoubleComplex *dB, int lddb,
int* columns )
{
int i = blockIdx.x*BLK_X + threadIdx.x;
int jj = blockIdx.y*BLK_Y;
// do only rows inside matrix B.
if (i < m) {
// Thread copies row i, columns [ jj, ..., min( k, jj + BLK_Y ) ].
dB += i + jj*lddb;
columns += jj;
for (int j=0; j < BLK_Y && jj + j < k; ++j) {
int j2 = columns[ j ];
dB[ j*lddb ] = dA[ j2*ldda ];
}
}
}
Needs a __global__ function (cf. zlacpy_full_kernel) and the CPU driver (cf. magmablas_zlacpy) to call the above device routine.
Mark