I am playing around with SIMD SSE/SSE2 instructions in a mex files, and I am wondering wether matlab's allocation of data is aligned or not. I will probably write a test for that, but if anybody has a clue...
In case it is not, I am wondering :
You matlab guys have used simd instructions at least in imfilter of the image processing toolbox. It means you handle somewhere data alignement : I am wondering : how ?
1) natively when creating the array
2) by copy.. (ugly, isn't it ?)
Thx in advance....
Mathias
malloc, calloc, and realloc are required by the C standard to return a pointer that is correctly aligned for any data type. I can't imagine mxMalloc, mxCalloc, or mxRealloc doing anything else.
James Tursa
I think calloc and malloc provide aligned allocation for C regular use but the point when using SIMD instructions is that we are needed to get 16 byte aligned allocation of data (apparently it makes the memory access by simd instructions faster).
More specifically, _aligned_malloc and _aligned_free_ functions are to be used (or their equivalent simd counterparts __mm_malloc_ __mm_free). These functions are used
to control the alignment of data in memory - look here to see what's inside :
http://williamchan.ca/portfolio/c/alignedmalloc/
I was therefore wondering wether matlab matrix allocation mxCreateSomething was using such an allocation and/or if there is a hidden feature allowing to do this without the ugly way
1) allocate Array (possibly a scalar)
2) allocate 16 byte aligned array
3) free the pointer of first array
4) put the pointer of the second in the array (possibly modify other tags relating to size)
Moreover the problem I have with such an approach concerns what happens when I leave the mex function, perform operations in MATLAB and call back the mex function, especially regardign the lazy copy behavior.
X = mexAllocateAligned(n,m); % this is a mex file performing the allocation
X = rand(size(X)); % playing around in matlab workspace
X = fun(X); % calling an M file on the data where fun is :
function X=fun(X)
X = mexDoFastSIMDProcessing(X); % applying the fast processing.
"James Tursa" <aclassyguy...@hotmail.com> wrote in message <gs5dnd$kha$1...@fred.mathworks.com>...
Have you finally find a way to make the 128bit alignment in mex memory allocation?
regards,
Alex
"Mathias Ortner" wrote in message <gs46u5$a5g$1...@fred.mathworks.com>...
I can think of a way to do it, but it is very crude:
1) Create a variable that using the standard API functions that are slightly larger than needed (e.g., 15 bytes larger).
2) Create a shared data copy of that and save it inside the mex routine in a global variable that is made permanent.
3) Create an address from the variable address that is aligned how you want it.
4) Create a new variable of the size you want, but manually attach the aligned address.
5) Create a shared data copy of that and save it inside the mex routine in a global variable that is made permanent.
6) Lock the mex routine.
7) Put in a mexAtExit function that cleans all of this up once all of the workspace variables that have these adjusted data pointers have been cleared from memory.
Very messy but it might work.
James Tursa
> I can think of a way to do it, but it is very crude:
> ...
I share your opinion, that this is crude.
It will be hard to save the speed advantage of 128 bit aligned memory access after this complicated procedure.
We have 2011 now. 99% of the computers, which run MATLAB, understand SSE commands and the acceleration would be very impressive. 128bit aligned memory allocation is a really important feature - please, TMW, include this soon.
The OP asked for SSE acceleration in IMFILTER... I'm still very surprised, that a naive C-implementation needs just 40% of the time used by the built-in FILTER -- at least in MATLAB 2009a and a vector. Now imagine an efficient SSE4 method operating on aligned memory...
Kind regards, Jan
http://www.delorie.com/gnu/docs/glibc/libc_31.html
http://msdn.microsoft.com/en-us/library/8z34s9c6%28v=vs.71%29.aspx
Bruno
> Does your compiler supports similar function?
> http://www.delorie.com/gnu/docs/glibc/libc_31.html
> http://msdn.microsoft.com/en-us/library/8z34s9c6%28v=vs.71%29.aspx
Yes, e.g. the MSVC and Intel compilers support the 128 bit aligned memory allocation. Now it would be efficient to create a MATLAB variable inside a Mex, whose mxGetData is aligned also. Even some MATLAB functions as a deep data copy profit from 128 bit aligned arrays, but this is more important in the 32 bit version (to my surprise - perhaps the 64 bit versions use the expanded alignment already).
For all non-experts: The SSE-units of the processor perform operations multiple data at the same time, e.g. addition or multiplication with 4 SINGLEs or 2 DOUBLEs. These operations are much faster if the memory address of the data is a multiple of 16, but MATLAB (most, some, all versions?) align e.g. DOUBLEs at multiples of 8 in the memory. Therefore in 50% of the cases a remarkably faster processing is possible. The 128 bit alignment has the drawback, that the memroy is allocated in blocks of 16, such that e.g. a scalar DOUBLE wastes 8 byte.
In a stand-alone version we can use mxSetAllocFcn to use the _aligned_alloc functions, but not in MATLAB directly, as far as I know.
Kind regards, Jan
Jan,
I couldn't find any reference of mxSetAllocFcn, could you provide a pointer?
Bruno
> zerosalign16(1)
It works in some other cases. Unless if there are trivial BUGs that escapes me, it seems using _aligned_malloc is not a good idea after all.
Bruno
-----------------------
/* zerosalign16.c MATLAB mex file, for MSVS only:
*
* Purpose: similar to ZEROS(...'double')
* but data Pr has its adddress aligned with 16 bytes
*
* mex zerosalign16.c on 32-bit platform
* mex -largeDimsArray zerosalign16.c on 64-bit platform
*/
#include <malloc.h>
#include "mex.h"
#include "matrix.h"
#define A plhs[0]
#define DIM prhs[0]
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) {
mwSize *dims;
double *dimPr, *APr;
mwSize ndim, i, n;
size_t alignment;
A = mxCreateDoubleMatrix(0, 0, mxREAL);
if (nrhs > 1) {
ndim = nrhs;
dims = (mwSize*)mxMalloc(sizeof(mwSize)*ndim);
for (i=0; i<ndim; i++)
dims[i] = (mwSize)mxGetScalar(prhs[i]);
}
else if (nrhs == 1) {
ndim = (mwSize)(mxGetM(DIM)*mxGetN(DIM));
dimPr = mxGetPr(DIM);
if (ndim == 1) {
ndim = 2;
dims = (mwSize*)mxMalloc(sizeof(mwSize)*2);
dims[0] = dims[1] = (mwSize)dimPr[0];
}
else {
dims = (mwSize*)mxMalloc(sizeof(mwSize)*ndim);
for (i=0; i<ndim; i++)
dims[i] = (mwSize)dimPr[i];
} /* ndim == 1 */
} /* nrhs == 1 */
else
{
mexErrMsgTxt("Missing size");
return;
}
n = 1;
for (i=0; i<ndim; i++) n *= dims[i];
/* Experimental parameter: Set to 0 for non-alignment,
* 16 otherwise, WARNING IT BOMBS */
alignment = 16;
if (alignment==0)
APr = (double*)mxMalloc(sizeof(double)*n);
else
{
/* THIS BOMB SOMETIME */
/* http://msdn.microsoft.com/fr-fr/library/8z34s9c6%28v=vs.80%29.aspx */
APr = (double*)_aligned_malloc(sizeof(double)*n, 16);
}
if (n>0 && !APr)
{
mxFree(dims);
mexErrMsgTxt("Out of memory\n");
}
else
{
mxSetDimensions(A, dims, ndim);
memset(APr, 0, sizeof(double)*n);
mxSetPr(A, APr);
mxFree(dims);
}
}
> > In a stand-alone version we can use mxSetAllocFcn ...
> I couldn't find any reference of mxSetAllocFcn, could you provide a pointer?
Sorry, a typo. Better: mxSetAllocFcns *with* a trailing s.
Google's new engine is dull. The edit-distance between the terms is tiny, but Google does not find anything. I do not find the method for a more fuzzy search.
Kind regards, Jan
> A = mxCreateDoubleMatrix(0, 0, mxREAL);
> APr = (double*)_aligned_malloc(sizeof(double)*n, 16);
> mxSetDimensions(A, dims, ndim);
> mxSetPr(A, APr);
Bold. This gets memory from the compiler's memory manager (which means the Windows function for MSVC), but later MATLAB's own manager uses mxFree for releasing. This must have cruel effects.
James suggested to allocate the memory persistently in the Mex function and use mxLock to care for the pointer. Finally another call to the function is responsible for releasing the memory. But unfortunately MATLAB can create a deep copy of the array with alignment and the complicated strategy looses its power.
Kind regards, Jan
Bruno
I've been using SSE with Matlab for a while. I've noticed that all memory in Matlab mex files seems to be 16 byte aligned by default, whether statically or dynamically allocated. I have no proof that this is the case, just empirical evidence:
For example, the following code in a mex file:
float *A = mxMalloc(100*sizeof(float));
if( (size_t)A&15 ) mexErrMsgTxt("Error: expected 16 byte aligned data.");
never seems to throw an error (meaning A is 16 byte aligned). Larger alignment values fail.
This doesn't seem to be just for mxMalloc or floats, also the following:
unsigned int A[1];
if( (size_t)A&15 ) mexErrMsgTxt("Error: expected 16 byte aligned data.");
again never seems to throw an error.
Since for many SSE purposes 16 byte alignment is sufficient, I've been able to use SSE with mex files so far without any problems. I've tested in both 32 and 64 bit Matlab on Windows (using R2010a) and 64 bit Linux (using R2010b). I'm not sure if this would work in older versions of Matlab...
Does this work for other people? Does anyone foresee any problems with this approach?
Piotr
>
> Does this work for other people? Does anyone foresee any problems with this approach?
>
You are right, it does work for me up to 1e7 random allocations.
Bruno