Re: Overlapped tiling and split tiling

168 views
Skip to first unread message

ol...@okmij.org

unread,
May 30, 2012, 12:19:57 AM5/30/12
to mura...@gmail.com, Albert...@inria.fr, stag...@googlegroups.com

Hello!

I had a bit of free time in Toukyou yesterday and this morning
and have implemented Takayuki's challenge, in MetaOCaml. I don't have
(yet) the github account. I put the code on my web site:

http://okmij.org/ftp/Computation/staging/stencil.ml

Please feel free to pull it to GitHub.

This is the complete code, requiring only MetaOCaml and the library of
delimited control delimcc (both are available from my web site). The
code develops a DSL for stencil computations from scratch, treating
a stencil as a cache. The size of the stencil is definable by the
user. Setting a stencil too big or too small will not affect the
correctness but may affect the performance. The right stencil sizes
could be found empirically or geometrically.

The file develops the method gradually and has lots of
generated sample code and numerical tests. Let me quote the
naive version of the double-time-step fused code generator

let t21 =
.<fun a b ->
.~(loop0 2 0 n .<a>. .<b>. (fun a ->
let w1 j = a j *@ a (j+1) in
let wm j = a j -@ w1 j +@ w1 (j-1) in
let w2 j = wm j *@ wm (j+1) in
wm 0 -@ w2 0 +@ w2 (-1)))
>.
;;

and the code with stencil optimizations

let t25 =
.<fun a b ->
.~(
with_stencil (-2) 1 (fun stU_w1 ->
with_stencil (-1) 1 (fun stU_wm ->
with_stencil (-1) 0 (fun stU_w2 ->
loop (-2) 2 0 n .<a>. .<b>. (fun a q p ->
let w1 = stencil_memo stU_w1 q p (fun j ->
a j *@ a (j+1)) in
let wm = stencil_memo stU_wm q p (fun j ->
a j -@ w1 j +@ w1 (j-1)) in
let w2 = stencil_memo stU_w2 q p (fun j ->
wm j *@ wm (j+1)) in
wm 0 -@ w2 0 +@ w2 (-1))))))
>.
;;

The difference is the introduction of stencil_memo statements, which
apply to real arrays and to `virtual arrays' (that is, any
int -> float code computations). Here is the body of the generated
loop (preceded and followed by long prologue and epilogue taking
care of boundary conditions)

let rec floop_29 =
fun i_30 ->
if (i_30 < 7) then begin
let t_31 = a_1.(i_30 + 3) in
let t_32 = ((! s_7) *. t_31) in
let t_33 = (((! s_7) -. t_32) +. (! s_12)) in
let t_34 = ((! s_16) *. t_33) in
b_2.(i_30) <- (((! s_15) -. (! s_19)) +. (! s_18));
(s_18 := (! s_19));
(s_19 := t_34);
(s_14 := (! s_15));
(s_15 := (! s_16));
(s_16 := t_33);
(s_9 := (! s_10));
(s_10 := (! s_11));
(s_11 := (! s_12));
(s_12 := t_32);
(s_3 := (! s_4));
(s_4 := (! s_5));
(s_5 := (! s_6));
(s_6 := (! s_7));
(s_7 := t_31);
(floop_29 (i_30 + 1))
end in
(floop_29 3);

As you can see, each iteration reads a global array once, writes a
global array once, and does 6 FLOPS (in OCaml, floating-point
operations are written *., +, and -.). So B/F is 16/6, which seems
optimal for two-time-steps. We can unroll three time-steps if desired.

BTW, I was wondering about one line in the code Takayuki sent earlier
(on May 20):

> void program3(vector<double> input, vector<double>& output) {
> int i = 2;
> double a0 = input[i-2];
> double a1 = input[i-1];
> double a2 = input[i];
> ...
> output[i+2] = e0;

I wonder should the last line read
output[i] = e0

The boundary conditions are really tough to get right by hand. The
code generator really helps here.

I like this example very much indeed. It indeed requires
code generation and effects, in particular, effects crossing the
binders. I did make a few mistakes that resulted in scope extrusion;
one of them was quite well hidden.

Cheers,
Oleg

Takayuki Muranushi

unread,
Jun 16, 2012, 5:51:20 AM6/16/12
to ol...@okmij.org, stag...@googlegroups.com
Dear Oleg, and everyone,

I am sorry for the very lazy response. Due to other tasks I had to do,
and because I have never touched Ocaml before. I barely read the
concepts of virtual arrays and memoizing the array access, but I'd
definitely like to apply them into my research.

Okay, so I have installed MetaOCaml and delimcc, and tried to run
stencil.ml, then got some error:

$ ocamlc stencil.ml
File "stencil.ml", line 298, characters 41-208:
Warning P: this pattern-matching is not exhaustive.
Here is an example of a value that is not matched:
[]
Error while linking stencil.cmo: Reference to undefined global `Trx'
$ ocaml stencil.ml
Reference to undefined global `Trx'

I guess it's a trivial one and it's better to ask an advice of an expert...


P.S. your suggestion about c++ sample is right! The code had been so
correcteted on the github version.

Thank you in advance,


Takayuki


2012/5/30 <ol...@okmij.org>:

Baris Aktemur

unread,
Jun 16, 2012, 6:14:26 PM6/16/12
to stag...@googlegroups.com
Hi Takayuki,

MetaOCaml's bin folder contains ocaml and ocamlc as well as metaocaml and metaocamlc. The problem you faced most probably occurred because you used the ocaml executable instead of metaocaml.

-Baris Aktemur

ol...@okmij.org

unread,
Jun 16, 2012, 9:00:07 PM6/16/12
to mura...@gmail.com, stag...@googlegroups.com

Hello!

> I am sorry for the very lazy response.
Well, there has been no demand <sorry, I couldn't resist>.


> Okay, so I have installed MetaOCaml and delimcc, and tried to run
> stencil.ml, then got some error:
>
> $ ocamlc stencil.ml

If you installed MetaOCaml, you should also have executable
metaocaml (MetaOCaml top level). If you intalled delimcc, there should be
a directory caml-shift somewhere. Then you can do

/usr/local/src/ometa/bin $ LD_LIBRARY_PATH=../caml-shift ./metaocaml ../caml-shift/delimcc.cma
MetaOCaml version 3.09.1 alpha 030

# #directory "../caml-shift";;
# #use "/home/oleg/papers/MetaFX/meta-shift/stencil.ml";;

(the first has symbol is the ocaml prompt, but the second hash symbol
in #directory and #use are the part of the ocaml directive).

You can omit `#directory' directive if your current directory is
caml-shift.

If you have questions about MetaOCaml or delimcc or stencil, by
all means, please do ask.

Cheers,
Oleg

Takayuki Muranushi

unread,
Jun 21, 2012, 10:51:56 AM6/21/12
to ol...@okmij.org, stag...@googlegroups.com
Thank you Oleg & Baris, I have delimcc installed under Now the next
error message is like this:

[nushio@myhost ml]$ LD_LIBRARY_PATH=caml-shift/ metaocaml
./caml-shift/delimcc.cma
Cannot load required shared library dlldelimcc.
Reason: /usr/local/lib/ocaml/stublibs/dlldelimcc.so: undefined symbol:
caml_ref_table.
[nushio@myhost ml]$ LD_LIBRARY_PATH=/usr/local/lib/ocaml/stublibs/
metaocaml ./caml-shift/delimcc.cma
Cannot load required shared library dlldelimcc.
Reason: /usr/local/lib/ocaml/stublibs/dlldelimcc.so: undefined symbol:
caml_ref_table.
[nushio@myhost ml]$ cd caml-shift
[nushio@myhost caml-shift]$ metaocaml delimcc.cma
Cannot load required shared library dlldelimcc.
Reason: ./dlldelimcc.so: undefined symbol: caml_ref_table.

and even Prof. google doesn't know the message!
https://www.google.co.jp/search?q=undefined%20symbol%20caml_ref_table

Will you tell me what should I do next?


2012/6/17 <ol...@okmij.org>:
--
Takayuki MURANUSHI
The Hakubi Center for Advanced Research, Kyoto University
http://www.hakubi.kyoto-u.ac.jp/02_mem/h22/muranushi.html

ol...@okmij.org

unread,
Jun 22, 2012, 1:49:59 PM6/22/12
to mura...@gmail.com, stag...@googlegroups.com

Hello!

> [nushio@myhost caml-shift]$ metaocaml delimcc.cma
> Cannot load required shared library dlldelimcc.
> Reason: ./dlldelimcc.so: undefined symbol: caml_ref_table.

It is very likely that the problem is version mismatch. Almost every
version of OCaml introduces a number of changes to its internal APIs;
old functions are renamed, new functions are added. Therefore, almost
any library that has a C component (e.g., standard libraries num,
unix, etc., and also delimcc) requires recompilation. Chances are you
have OCaml 3.12 or similar on you computer. MetaOCaml is based on an
older version of OCaml. So, to use delimcc against MetaOCaml, it has
to be compiled against MetaOCaml. If you look inside
caml-shift/Makefile, you will see a set of definitions like

#OCAMLINCLUDES=../../byterun
#OCAMLINCLUDES=./ocaml-byterun-3.09
#OCAMLINCLUDES=./ocaml-byterun-3.10
OCAMLINCLUDES=./ocaml-byterun-3.11
# Directory of your OCAML executables...
#OCAMLBIN=../../bin
OCAMLBIN := $(shell dirname `which ocaml`)
OCAMLC=$(OCAMLBIN)/ocamlc
LIBDIR := $(shell ocamlc -where)

which will compile delimcc against OCaml installed on your system.
Chances are MetaOCaml is installed somewhere separately. So, you need
to set the paths to point to MetaOCaml. If MetaOCaml installation
directory is HOME/ometa/, it is enough to say

OCAMLINCLUDES=HOME/ometa/byterun
#OCAMLINCLUDES=./ocaml-byterun-3.09
#OCAMLINCLUDES=./ocaml-byterun-3.10
OCAMLBIN=HOME/ometa/bin
OCAMLC=$(OCAMLBIN)/ocamlc
LIBDIR=`$(OCAMLC) -where`

Then you need make clean; make; make testd0

Sorry for the trouble.
Oleg

Takayuki Muranushi

unread,
Jun 23, 2012, 5:22:50 AM6/23/12
to ol...@okmij.org, stag...@googlegroups.com
Thank you Oleg, my ocaml and metaocaml was 3.09, and I got the
stencil.ml working!

I wanted to inform you of this as soon as possible. Let me then
gradually understand the content.

Best,

Takayuki

2012/6/23 <ol...@okmij.org>:

Tiark Rompf

unread,
Sep 4, 2012, 11:56:45 AM9/4/12
to stag...@googlegroups.com, mura...@gmail.com, ol...@okmij.org
Hi all,

I finally found the time to implement Takayuki's stencil computation challenge using Scala and LMS.

The approach I took is quite general. The core idea is to perform common subexpression elimination across loop iterations and introduce a variable for each expression that iteration i+1 can reuse from iteration i.

We start by implementing the challenge example using a regular staged for loop:

def w1(j: Rep[Int]) = a(j) * a(j+1)

def wm(j: Rep[Int]) = a(j) - w1(j) + w1(j-1)

def w2(j: Rep[Int]) = wm(j) * wm(j+1)

def b(j: Rep[Int]) = wm(j) - w2(j) + w2(j-1)

// regular for loop

for (i <- (2 until n-2)) {
output(i) = b(i)
}

The generated code executes, for each loop iteration, 5 reads (positions i-2 ... i +2), 1 write and 14 double operations (b/f = 48/14).

Now all we need to change to reduce this to 1 read, 1 write and 6 flops (b/f = 16/6) is this:

... w1, wm, w2, b stays the same ...

// sliding for loop

for (i <- (2 until n-2).sliding) {
output(i) = b(i)
}

Apart from the little `.sliding`, everything else stays the same. Here is the generated loop:

while (x76 < x3) {
// variable reads
val x77 = x69
val x78 = x70
val x79 = x71
val x80 = x72
val x81 = x73
val x82 = x74
val x84 = x76 + 2
// computation
val x85 = x0(x84)
val x86 = x77 * x85
val x87 = x77 - x86
val x88 = x87 + x78
val x89 = x79 * x88
val x90 = x79 - x89
val x91 = x90 + x80
val x92 = x2(x81) = x91
// variable writes
x69 = x85
x70 = x86
x71 = x88
x72 = x89
x73 = x82
x74 = x84
x76 = x76 + 1
}



How does it work? We rely on the fact that LMS performs common subexpression elimination internally, and we just insert variables for values that are re-used from one loop iteration to the next. Let's consider a simpler example:

for (i <- (0 until n).sliding) {
output(i) = 2 * i + 2 * (i+1)
}

We compute the code for loop iterations i and i + 1:

// iteration i:
val x0 = 2 * i
val x1 = i + 1
val x2 = 2 * x1
val x3 = x0 + x2
output(i) = x3

// iteration i+1: substitute i+1 for i
// i+1 ---> already computed as x1, reuse
// x0' = 2 * i ---> 2 * x1 ---> already computed as x2, reuse
val x1' = x1 + 1
val x2' = 2 * x1'
val x3' = x2 + x2'
output(x1) = x3'

We notice that iteration i + 1 uses symbols x1 and x2 defined in iteration i. Thus, we add variables X1,X2 to propagate the values of x1 and x2 from one iteration to the next.

We peel the first loop iteration to initialize the variables:

if (0 < n) {
var X1 = 1
var X2 = 2
output(0) = 2

And then generate a modified loop that executes the remaining iterations. The loop body corresponds to the (i+1) iteration above plus the reads and write statements that maintain the variables:

while (X1 < n) {
val x1 = X1
val x2 = X2
val x1' = x1 + 1
val x2' = 2 * x1'
val x3' = x2 + x2'
output(x1) = x3'
X1 = x1'
X2 = x2'
}
}


So with a twist on our general cse facility, we implement cse across loop iterations. The approach also works for window sizes larger than 1, we just need to unroll more iterations of the loop (I haven't implemented that yet). In the challenge example, if we look at iteration i + 2, we find that it does not reuse any memory access operations from iteration i, so a larger window is not necessary in this case.

The full code is on GitHub:
https://github.com/TiarkRompf/virtualization-lms-core/blob/delite-develop2/test-src/epfl/test11-shonan/TestStencil.scala

Test case output (regular for loop):
https://github.com/TiarkRompf/virtualization-lms-core/blob/delite-develop2/test-out/epfl/test11-stencil3a.check

Test case output (sliding for loop):
https://github.com/TiarkRompf/virtualization-lms-core/blob/delite-develop2/test-out/epfl/test11-stencil3b.check


Cheers,
- Tiark

Takayuki Muranushi

unread,
Sep 6, 2012, 10:01:05 AM9/6/12
to Tiark Rompf, stag...@googlegroups.com, ol...@okmij.org
Thank you, Tiark!
by the way, are you coming to ICFP2012?

I've been stuffed with my PhD paper in physics, then in this summer
school http://www.iip.ist.i.kyoto-u.ac.jp/mlss12/doku.php?id=schedule
then ICFP!
I'm grateful to SDHPC team, and do my best find the time to interpret
Tiark's and Oleg's sources after that.

Best,

Takayuki

2012/9/5 Tiark Rompf <tiark...@epfl.ch>:

Tiark Rompf

unread,
Sep 6, 2012, 12:34:34 PM9/6/12
to Takayuki Muranushi, stag...@googlegroups.com, ol...@okmij.org
Yes, I'll be around at ICFP. Looking forward to meeting again!
- Tiark
Reply all
Reply to author
Forward
0 new messages