Hello!
I had a bit of free time in Toukyou yesterday and this morning
and have implemented Takayuki's challenge, in MetaOCaml. I don't have
(yet) the github account. I put the code on my web site:
http://okmij.org/ftp/Computation/staging/stencil.ml
Please feel free to pull it to GitHub.
This is the complete code, requiring only MetaOCaml and the library of
delimited control delimcc (both are available from my web site). The
code develops a DSL for stencil computations from scratch, treating
a stencil as a cache. The size of the stencil is definable by the
user. Setting a stencil too big or too small will not affect the
correctness but may affect the performance. The right stencil sizes
could be found empirically or geometrically.
The file develops the method gradually and has lots of
generated sample code and numerical tests. Let me quote the
naive version of the double-time-step fused code generator
let t21 =
.<fun a b ->
.~(loop0 2 0 n .<a>. .<b>. (fun a ->
let w1 j = a j *@ a (j+1) in
let wm j = a j -@ w1 j +@ w1 (j-1) in
let w2 j = wm j *@ wm (j+1) in
wm 0 -@ w2 0 +@ w2 (-1)))
>.
;;
and the code with stencil optimizations
let t25 =
.<fun a b ->
.~(
with_stencil (-2) 1 (fun stU_w1 ->
with_stencil (-1) 1 (fun stU_wm ->
with_stencil (-1) 0 (fun stU_w2 ->
loop (-2) 2 0 n .<a>. .<b>. (fun a q p ->
let w1 = stencil_memo stU_w1 q p (fun j ->
a j *@ a (j+1)) in
let wm = stencil_memo stU_wm q p (fun j ->
a j -@ w1 j +@ w1 (j-1)) in
let w2 = stencil_memo stU_w2 q p (fun j ->
wm j *@ wm (j+1)) in
wm 0 -@ w2 0 +@ w2 (-1))))))
>.
;;
The difference is the introduction of stencil_memo statements, which
apply to real arrays and to `virtual arrays' (that is, any
int -> float code computations). Here is the body of the generated
loop (preceded and followed by long prologue and epilogue taking
care of boundary conditions)
let rec floop_29 =
fun i_30 ->
if (i_30 < 7) then begin
let t_31 = a_1.(i_30 + 3) in
let t_32 = ((! s_7) *. t_31) in
let t_33 = (((! s_7) -. t_32) +. (! s_12)) in
let t_34 = ((! s_16) *. t_33) in
b_2.(i_30) <- (((! s_15) -. (! s_19)) +. (! s_18));
(s_18 := (! s_19));
(s_19 := t_34);
(s_14 := (! s_15));
(s_15 := (! s_16));
(s_16 := t_33);
(s_9 := (! s_10));
(s_10 := (! s_11));
(s_11 := (! s_12));
(s_12 := t_32);
(s_3 := (! s_4));
(s_4 := (! s_5));
(s_5 := (! s_6));
(s_6 := (! s_7));
(s_7 := t_31);
(floop_29 (i_30 + 1))
end in
(floop_29 3);
As you can see, each iteration reads a global array once, writes a
global array once, and does 6 FLOPS (in OCaml, floating-point
operations are written *., +, and -.). So B/F is 16/6, which seems
optimal for two-time-steps. We can unroll three time-steps if desired.
BTW, I was wondering about one line in the code Takayuki sent earlier
(on May 20):
> void program3(vector<double> input, vector<double>& output) {
> int i = 2;
> double a0 = input[i-2];
> double a1 = input[i-1];
> double a2 = input[i];
> ...
> output[i+2] = e0;
I wonder should the last line read
output[i] = e0
The boundary conditions are really tough to get right by hand. The
code generator really helps here.
I like this example very much indeed. It indeed requires
code generation and effects, in particular, effects crossing the
binders. I did make a few mistakes that resulted in scope extrusion;
one of them was quite well hidden.
Cheers,
Oleg