Prroffessorr Fir Kenobi
unread,Aug 18, 2015, 4:16:15 AM8/18/15You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to
This night run my first gpu c code that is doing something and made some tests
this is a simple mandelbrot drwing code, first i
run scalar version
int mandelbrot_n( float cRe, float cIm, int max_iter )
{
float re = cRe;
float im = cIm;
float rere=re*re;
float imim=im*im;
for(int n=1; n<=max_iter; n++)
{
im = (re+re)*im + cIm;
re = rere - imim + cRe;
rere=re*re;
imim=im*im;
if ( (rere + imim) > 4.0 )
return n;
}
return 0;
}
for 256 x256 x 1000 iteration it take 90 ms
then i made sse intrinsic version
__attribute__((force_align_arg_pointer))
__m128i mandelbrot_n_sse( __m128 cre, __m128 cim, int max_iter )
{
__m128 re = _mm_setzero_ps();
__m128 im = _mm_setzero_ps();
__m128 _1 = _mm_set_ps1(1.);
__m128 _4 = _mm_set_ps1(4.);
__m128 iteration_counter = _mm_set_ps1(0.);
for(int n=0; n<=max_iter; n++)
{
__m128 re2 = _mm_mul_ps(re, re);
__m128 im2 = _mm_mul_ps(im, im);
__m128 radius2 = _mm_add_ps(re2,im2);
__m128 compare_mask = _mm_cmplt_ps( radius2, _4);
iteration_counter = _mm_add_ps( iteration_counter, _mm_and_ps(compare_mask, _1) );
if (_mm_movemask_ps(compare_mask)==0) break;
__m128 ren = _mm_add_ps( _mm_sub_ps(re2, im2), cre);
__m128 reim = _mm_mul_ps(re, im);
__m128 imn = _mm_add_ps( _mm_add_ps(reim, reim), cim);
re = ren;
im = imn;
}
__m128i n = _mm_cvtps_epi32(iteration_counter);
return n;
}
this run 20 ms (more that 4 times faster, dont know why)
(the procesor i run is anyway old core2 e6550 2.33GHz - i got better machine with avx support but didnt use it here yet)
then i make opencl code
"__kernel void square( \n" \
" __global int* input, \n" \
" __global int* output, \n" \
" const unsigned int count) \n" \
"{ \n" \
" int i = get_global_id(0); \n" \
" if(i < count) \n" \
" { \n" \
" int x = i%256; \n" \
" // if(x>=256) return; \n" \
" int y = i/256; \n" \
" // if(y>=256) return; \n" \
" float cRe = -0.5 + -1.5 + x/256.*3.; \n" \
" float cIm = 0.0 + -1.5 + y/256.*3.; \n" \
" float re = 0; \n" \
" float im = 0; \n" \
" int n = 0; \n" \
" for( n=0; n<=1000; n++) { \n" \
" if( re * re + im * im > 4.0 ) { output[256*y+x] = n + 256*n + 256*256*n; return;} \n" \
" float re_n = re * re - im * im + cRe; \n" \
" float im_n = 2 * re * im + cIm; \n" \
" re = re_n; \n" \
" im = im_n; \n" \
" } \n" \
" output[256*y+x] = 250<<8; \n" \
" } \n" \
"} \n" \
"\n";
this works with not a problem and works at 7 ms
(i got weak gpu gt610)
How to optimise this gpu version? Is it common to write such scalar code on gpu, maybe there is some way of writing something like sse intrinsics here? or other kind of optimisation?
(anyway i must say that thiose critics of gpu /opencl coding i dont fully agree this works
easy and fine - at least for some cases, (esp good is that it has not to much slowdown when
runing gpu from cpu and getting back results
- it seem i can run it in the 1 milisecond
window, so its very fine) i belive that with harder codes it may getting slower, but also belive with better card i may go also better than 7 ms)