on explicit result tokens

Herhut, Stephan A

unread,

Apr 30, 2013, 4:56:03 PM4/30/13

to dev-tech-js-en...@lists.mozilla.org, Hudson, Rick, Shpeisman, Tatiana, Shu-yu Guo, pnkf...@mozilla.com, Niko Matsakis, David Herman, Sreeram, Jaswanth

Dear all.

After our discussion on an result (or out) token last week, I played a little bit with the idea and by now I actually quite like the concept. However, what I still do not like is the current proposal of an explicit token kind with a set method to write to the out token. If I recall our discussion, that interface was created to enforce a write only or even write once policy. I do not think such restriction is required, as long as each out token is completely separated from the other result tokens, i.e., as long as there is no means to abuse result tokens for communication between threads. I think it actually is quite handy to be able to use a result token like an ordinary read/write object. Otherwise, one might end up in allocation a temp object to compute the final result in, which is exactly what we want to avoid.

If I recall correctly, we only plan to support the out token for dense data, so we will already have a binary data descriptor:

new ParallelArray([10], f, ArrayType(uint8, 3))

this will create a [10,3] array of uint8 values. Furthermore, the binary data proposal already has the concept of StructViews (more commonly referred to as pointers in other languages). The key idea then is to use such a StructView as out token. A StructView in the above example would behave exactly like a 3 element array of uint8 values, so for the code, there is no API difference. However, we would need the setView method to be disabled when the StructView is being used as result value.

function f(i, out) {
out[0] = r;
out[1] = g;
out[2] = b;
out[3] = a;
}

One can even pass the out array to other functions, etc. If the programmer also supplied an explicit return that returns something other than undefined, that result will overwrite the result. I have decided to use undefined as a signal rather than presence of a return statement as the former allows for easy implementation of the API as a library.

While thinking about this new design, I also revisited the idea of multiple return values, which currently is not well reflected in the API proposal and the current solution, unzip, requires special machinery in the JIT to be efficient. With the above, it is simple to add multiple results:

new ParallelArray([10], f, ArrayType(uint8, 3), ArrayType(uint8,3))

By providing two binary data specifications, we can signal that we want to compute two results. This leads to two out tokens being passed:

function f(i, out1, out2) {
// some code to fill out1 and out2
}

I envision the constructor to return an array of two values in this case. One could argue that returning an array of objects is a strange behavior for a constructor but it is not something that one could not write in JavaScript today. There is precedence for this way of multiple results in regular expression matching, so it is not totally alien. The use of return will only affect the first result, though.

This API design is less clean but it is meant for the performance programmer only, which is willing to go to some length to gain performance.

One issue that remains is supporting multiple scalar results. StructViews only work as pointers to objects but not to scalar data. A workaround is to simply use a 1 element array, which in memory gives the same layout (fixed size arrays in binary data are stored as a packed data structure). So one needs to write

new ParallelArray([10], f, ArrayType(uint8, 3), ArrayType(uint8,1))

Note that for a single scalar result, there is no issue as this is not allocated on the heap anyway and can simply be returned using a return statement.

I have not fleshed out the API for all methods, as I do not quite know what it should look like, yet. So far, I think constructor, map and filter make sense. For the other ones, I am not convinced this is needed. What does reduction on multiple values even mean?

Some food for thought until this week's meeting. Comments welcome.

Stephan

Niko Matsakis

unread,

Apr 30, 2013, 9:41:04 PM4/30/13

to Herhut, Stephan A, Hudson, Rick, Shpeisman, Tatiana, Shu-yu Guo, pnkf...@mozilla.com, dev-tech-js-en...@lists.mozilla.org, David Herman, Sreeram, Jaswanth

Hi Stephan,

I very much like the idea of reusing binary data StructViews for the
out pointer. One problem of course is that they are not currently
implemented. In the interim, it may be more practical for us to use
something simpler, so as not to gate the work on parallel processing
on the binary data work (which, so far as I know, has not yet begun
nor been 100% fully specified).

I'm a bit less clear on the zipping and unzipping proposal. Do you
mean that I might write something like:

var [pa1, pa2] = pa0.map(..., type1, type2, function(e, ..., out1, out2) {...});

and thus create two parallel arrays from one source array in one step?

Niko

Hudson, Rick

unread,

May 1, 2013, 8:32:30 AM5/1/13

to Niko Matsakis, Herhut, Stephan A, Shpeisman, Tatiana, Shu-yu Guo, pnkf...@mozilla.com, dev-tech-js-en...@lists.mozilla.org, David Herman, Sreeram, Jaswanth

Are StructViews implemented as just pointers? I'm assuming they aren't since the GC is unlikely to play well with raw pointers into the middle of some object. Instead the GC will want to see a StructView that holds the base of the object and an index so it can do things like mark/move/flip the object.
The gains are from reuse of StructViews, not having to allocate each result, and from not having to do a copy to aggregate the results.

I worry about false sharing of cache lines and overestimating the cost of allocation of short lived objects using bump pointer allocation.

- Rick

Felix S. Klock II

unread,

May 1, 2013, 9:17:12 AM5/1/13

to Hudson, Rick, Shpeisman, Tatiana, Shu-yu Guo, Niko Matsakis, Herhut, Stephan A, dev-tech-js-en...@lists.mozilla.org, David Herman, Sreeram, Jaswanth

Rick and other PJS folk-

Regarding token representation:

I do not know about StructView, but I have assumed any result-token
representation would generally need to carry a base+index, as Rick
describes. (Some GC systems may be able to avoid that, if they support
tracing raw pointers to the interiors of objects, but we of course
cannot require nor presume that.)

(In the past, Shu, Niko and I have discussed trying to cut-down on the
overhead introduced by the result-tokens by putting in logic to reuse
the memory allocated for the tokens, when we can prove that they don't
escape and thus such re-use is sound, so that you would only allocate
O(num-processors) tokens, rather than O(num-values) tokens. I don't
know if this line of thought would assuage your concerns about their
representation, or just introduce new worries for you.)

----

Regarding overestimating allocation cost, two points:

1.) I don't think we can assume/require that JS Engines will have a
bump-pointer allocation scheme. The proposal is meant as a way for
library implementers (presumably experts) to write code that will avoid
temporary allocations regardless of a particular browser's GC strategy;
it is meant in part to make adoption more attractive for implementers,
not end-users.

2.) An important use-case for the API in my mind is when the
intermediate arrays are *large* (i.e. not just ARGB 4-arrays, but
rather O(n) for an n*m matrix, etc), and so details of the allocation
scheme are a distraction. Avoiding the copy, a gain Rick mentioned, is
the more important win for the result-token I anticipate for those
cases.

(Unfortunately the immediate instances of the latter that are on my
mind are for cases that we haven't really discussed and are not part of
the current API, such as a variant of scan() for doing row-by-row
wavefront computations, where the prefix computed so far is passed
along for each successive row. The point there is that you want to
parallelize the row-computation, but a row is presumably large enough
that you want to avoid copying it.)

----

Regarding false sharing of cache lines:

That is an interesting issue. I can imagine an implementation
attempting to work around it (e.g. by making cache-aware decisions
about how to break up the result-array and hand off its addresses), but
designing the API to allow for that (and/or encourage it) seems tricky,
at least in the general case.

I think I'll have to ponder this further; for now my gut tells me that
the copying is hurting our performance more than potential
false-sharing would. But I have no concrete numbers to back up that
claim.

Cheers,
-Felix

--
irc: pnkfelix on irc.mozilla.org
email: {fklock, pnkfelix}@mozilla.org

Norm Rubin

unread,

May 1, 2013, 10:02:55 AM5/1/13

to Felix S. Klock II, Hudson, Rick, Shpeisman, Tatiana, Shu-yu Guo, Niko Matsakis, Herhut, Stephan A, dev-tech-js-en...@lists.mozilla.org, David Herman, Sreeram, Jaswanth

I thought I'd add some observations about how this would work on a gpu

Some background:

On today's gpu devices there are two memories
The usual cpu memory and a second separate gpu memory (the size of this memory may be limited)
Programs generally look like
Copy some data from cpu to gpu
Do the work on the gpu
Copy the result back

For almost all current gpu programs the two copies take most of the time (performance is bandwidth limited)
To get gpu performance out of a parallejs program the compiler/runtime is going to need to carefully figure out where to place the copies and identify cases where it can keep the data gpu resident

The gpu tends to work well when the data set is large, so the copies are often many megabytes

While future machines may reduce the need for copies, I expect they still will have bandwidth concerns.

False sharing is reversed on the gpu,
Code will run significantly faster is nearby threads write nearby addresses
That is if thread0 writes *x, it would be good for performance if thead1 writes (*x+1)

----

Now (at last) my questions

1) Given var [pa1, pa2] = pa0.map(..., array(int,2) , array(int,3), function(e, ..., out1, out2) {...});

So each thread writes a 5 ints split up as a block of 2(out1) and a block of 3 (out2)
How would this proposal put those values into pa1 and pa2,

Is pa[i] is an array of 2 elements, and each element of pa2 is an array of 3 elements or are the arrays flattened in some way?

2) the input side question corresponding to multiple outputs, we often want to write code that does stencils, e.g., we might what to write x[i] = a[i-1]+ a[i]+a[i+1], each thread reads three inputs centered around its thread id, no thread needs to access anyone else in a. What do you think of using something similar to let the jit know the amount of a the code will touch?
So the first argument of the function might change from a single element and an index, to small array?

Pa.map(function( in) ){ out[0] = in[-1]+ in[0]+ in[1];}

_______________________________________________
dev-tech-js-engine-rivertrail mailing list dev-tech-js-en...@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-tech-js-engine-rivertrail
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Felix S. Klock II

unread,

May 1, 2013, 10:07:28 AM5/1/13

to Herhut, Stephan A, Hudson, Rick, Shpeisman, Tatiana, Shu-yu Guo, Niko Matsakis, dev-tech-js-en...@lists.mozilla.org, David Herman, Sreeram, Jaswanth

Stephan and other PJS folk-

On 30/04/2013 22:56, Herhut, Stephan A wrote:
> If the programmer also supplied an explicit return that returns something other than undefined, that result will overwrite the result. I have decided to use undefined as a signal rather than presence of a return statement as the former allows for easy implementation of the API as a library.

In my own prototype, since I do use the `set()` method, I've been using
the invocation of that method as the signal meaning "ignore the return
value from the callback." Obviously since you do not want the `set()`
method, that option is not immediately available, unless we treat
`out[i] = expr;` as having the signalling effect, which I infer you do
not want to do.

I don't have any strong allegiance to the `set()` method or the various
write-only/write-once constraints at this point; they were largely an
attempt to restrict the protocol as much as possible.

> I have not fleshed out the API for all methods, as I do not quite know what it should look like, yet. So far, I think constructor, map and filter make sense. For the other ones, I am not convinced this is needed. What does reduction on multiple values even mean?

Indeed, I have struggled with that question myself: Which methods should
be generalized for this token API.

The rule-of-thumb I've employed is: If one can predict the result shape,
and if it can be used to help avoid intermediate allocations, then a
token (oriented around that result shape) might be warranted.

So:

* constructor and map() make sense.

* filter() : does it make sense? You cannot predict the precise
length ahead of time. I guess the caller could guess the length (and
thus the whole shape); but what happens if they guess wrong? Do they
need to guess the exact length, or will it suffice for them to guess >=
the end length? If they guess wrong, do we freshly allocate a result
then, or throw an exception? (Nonetheless, there is probably some
variant here that is reasonable.)

* reduce() : my intuition is that an outptr does not make sense given
its current API. [1] But this is not a case I feel the need to optimize
right now; I've only been thinking about it in terms of making the API
"uniform" in some sense.

* scan() : seems analogous to reduce(). But then the fact that you
build up a result array/matrix leads me to wonder if I should not be so
quick to dismiss the case of reduce. (I worry in particular about the
row-computations on the image-resize benchmark, and if some sequential
variant of scan could allow us to compute the rows in-place.)

----

The most important thing to me is that the token itself have parallel
methods, so that a large sub-computation can itself be distributed.
(This is again related to what I'd like to see for the image-resize
benchmark.) I didn't see this addressed in Stephan's note, but maybe it
was implicit. (Maybe it would just be built-in to all StructViews in
engines with PJS.)

So, as a toy-demonstration of one potential API method:

var [A,B,C] = [2,3,2];
print(new ParallelMatrix([A], [B,C], (i, outptr) => outptr.gather((j,k) => 1000+100*i+10*j+k)))

prints:

[[[1000,1001], [1010,1011], [1020,1021]],
[[1130,1131], [1100,1101], [1110,1111]]]

The `outptr.gather()` is allowed to fork off parallel threads to compute
the [B,C] elements; so even though outptr is not an instance of
Array/ParallelArray/ParallelMatrix/whatever-we-call-it, it nonetheless
has parallel capabilities.

Cheers,
-Felix

[1] At times I've wondered if all we would need is the programmer to
provide the shape of the expected result from the reduction operator, so
that if the user is constructing an array/matrix from a reduce()
invocation, we could do some preallocation and get some reuse of the
associated intermediate results. But inevitably I run into the issue
that the original cell-shape and the intermediate values are not
necessarily of the same domain, and I wonder if this use-case even
actually exists. That's about where I give up on trying to shoe-horn
this into the API.

Hudson, Rick

unread,

May 1, 2013, 10:54:47 AM5/1/13

to Felix S. Klock II, Herhut, Stephan A, Shpeisman, Tatiana, Shu-yu Guo, Niko Matsakis, dev-tech-js-en...@lists.mozilla.org, David Herman, Sreeram, Jaswanth

On 30/04/2013 22:56, Herhut, Stephan A wrote:
"If the programmer also supplied an explicit return that returns something other than undefined, that result will overwrite the result."

We know the signature of the kernel function and if it takes the out arguments then why can't we just spec that we ignore the returned result. This cleans up the foot gun of an arrow function without a block returning values unintentionally.

- Rick

From: Felix S. Klock II [mailto:pnkf...@mozilla.com]
Sent: Wednesday, May 01, 2013 10:07 AM
To: Herhut, Stephan A
Cc: dev-tech-js-en...@lists.mozilla.org; Hudson, Rick; Shpeisman, Tatiana; Sreeram, Jaswanth; Niko Matsakis; Shu-yu Guo; David Herman
Subject: Re: on explicit result tokens

email: {fklock, pnkfelix}@mozilla.org<mailto:pnkfelix%7...@mozilla.org>

Felix S. Klock II

unread,

May 1, 2013, 11:08:39 AM5/1/13

to Hudson, Rick, Shpeisman, Tatiana, Shu-yu Guo, Niko Matsakis, Herhut, Stephan A, dev-tech-js-en...@lists.mozilla.org, David Herman, Sreeram, Jaswanth

Rick-

Are you suggesting we have different behavior based on the number of
arguments declared by the kernel function?

If the kernel function uses ...varargs, what then?

(Not that I object to the idea; at one point in my prototype I did
something along these lines, with the policy of "tough luck" for anyone
using varargs (where I resolved the ambiguity by copying the returned
value). I've since abandoned that code for other reasons.)

Cheers,
-Felix

On Wed May 1 16:54:47 2013, Hudson, Rick wrote:
> On 30/04/2013 22:56, Herhut, Stephan A wrote:
>
> “If the programmer also supplied an explicit return that returns
> something other than undefined, that result will overwrite the result.”
>
> We know the signature of the kernel function and if it takes the out
> arguments then why can’t we just spec that we ignore the returned
> result. This cleans up the foot gun of an arrow function without a
> block returning values unintentionally.
>
> -Rick
>

> *From:*Felix S. Klock II [mailto:pnkf...@mozilla.com]
> *Sent:* Wednesday, May 01, 2013 10:07 AM
> *To:* Herhut, Stephan A
> *Cc:* dev-tech-js-en...@lists.mozilla.org; Hudson, Rick;

> Shpeisman, Tatiana; Sreeram, Jaswanth; Niko Matsakis; Shu-yu Guo;
> David Herman

> *Subject:* Re: on explicit result tokens

Hudson, Rick

unread,

May 1, 2013, 11:12:23 AM5/1/13

to Felix S. Klock II, Shpeisman, Tatiana, Shu-yu Guo, Niko Matsakis, Herhut, Stephan A, dev-tech-js-en...@lists.mozilla.org, David Herman, Sreeram, Jaswanth

Yes, if you have a ...varargs how would the compiler know to use the out protocol in the first place?

- Rick

-----Original Message-----
From: dev-tech-js-engine-rivertrail-bounces+rick.hudson=inte...@lists.mozilla.org [mailto:dev-tech-js-engine-rivertrail-bounces+rick.hudson=inte...@lists.mozilla.org] On Behalf Of Felix S. Klock II
Sent: Wednesday, May 01, 2013 11:09 AM
To: Hudson, Rick
Cc: Shpeisman, Tatiana; Shu-yu Guo; Niko Matsakis; Herhut, Stephan A; dev-tech-js-en...@lists.mozilla.org; David Herman; Sreeram, Jaswanth

Felix S. Klock II

unread,

May 1, 2013, 11:16:58 AM5/1/13

to Hudson, Rick, Shpeisman, Tatiana, Shu-yu Guo, Niko Matsakis, Herhut, Stephan A, dev-tech-js-en...@lists.mozilla.org, David Herman, Sreeram, Jaswanth

You could unconditionally pass a token, and then use one of the dynamic
signalling mechanisms mentioned by Stephan and myself to decide whether
it was used.

Not that I'm saying this is a *good* idea; I'm just noting it as a
point in the design space.

-Felix

Hudson, Rick

unread,

May 1, 2013, 11:23:14 AM5/1/13

to Norm Rubin, Felix S. Klock II, Shpeisman, Tatiana, Shu-yu Guo, Niko Matsakis, Herhut, Stephan A, dev-tech-js-en...@lists.mozilla.org, David Herman, Sreeram, Jaswanth

1) Given var [pa1, pa2] = pa0.map(..., array(int,2) , array(int,3), function(e, ..., out1, out2) {...});
So each thread writes a 5 ints split up as a block of 2(out1) and a block of 3 (out2) How would this proposal put those values into pa1 and pa2,
Is pa[i] is an array of 2 elements, and each element of pa2 is an array of 3 elements or are the arrays flattened in some way?

That is correct, the arrays are not flattened.

2) the input side question corresponding to multiple outputs, we often want to write code that does stencils, e.g., we might what to write x[i] = a[i-1]+ a[i]+a[i+1], each thread reads three inputs centered around its thread id, no thread needs to access anyone else in a. What do you think of using something similar to let the jit know the amount of a the code will touch?
So the first argument of the function might change from a single element and an index, to small array?

Pa.map(function( in) ){ out[0] = in[-1]+ in[0]+ in[1];}

This should be written on top of the spec as a library where the your sum function is passed to a makeStencil function that returns a function with the correct semantics and map kernel function signature.

Pa.map(buildStencil(function(east, me, west) {...}, -1, 0, 1, boundaryValue)); where -1, 0, and 1 are index offsets used to calculate east, me, and west and boundaryValue is what is used if the stencil crosses a boundary.

- Rick

-----Original Message-----
From: Norm Rubin [mailto:nru...@nvidia.com]
Sent: Wednesday, May 01, 2013 10:03 AM
To: Felix S. Klock II; Hudson, Rick
Cc: Shpeisman, Tatiana; Shu-yu Guo; Niko Matsakis; Herhut, Stephan A; dev-tech-js-en...@lists.mozilla.org; David Herman; Sreeram, Jaswanth

From: dev-tech-js-engine-rivertrail-bounces+nrubin=nvidi...@lists.mozilla.org [mailto:dev-tech-js-engine-rivertrail-bounces+nrubin=nvidi...@lists.mozilla.org] On Behalf Of Felix S. Klock II
Sent: Wednesday, May 01, 2013 9:17 AM
To: Hudson, Rick
Cc: Shpeisman, Tatiana; Shu-yu Guo; Niko Matsakis; Herhut, Stephan A; dev-tech-js-en...@lists.mozilla.org; David Herman; Sreeram, Jaswanth
Subject: Re: on explicit result tokens

Rick and other PJS folk-

Regarding token representation:

I do not know about StructView, but I have assumed any result-token representation would generally need to carry a base+index, as Rick describes. (Some GC systems may be able to avoid that, if they support tracing raw pointers to the interiors of objects, but we of course cannot require nor presume that.)

(In the past, Shu, Niko and I have discussed trying to cut-down on the overhead introduced by the result-tokens by putting in logic to reuse the memory allocated for the tokens, when we can prove that they don't escape and thus such re-use is sound, so that you would only allocate
O(num-processors) tokens, rather than O(num-values) tokens. I don't know if this line of thought would assuage your concerns about their representation, or just introduce new worries for you.)

----

Regarding overestimating allocation cost, two points:

1.) I don't think we can assume/require that JS Engines will have a bump-pointer allocation scheme. The proposal is meant as a way for library implementers (presumably experts) to write code that will avoid temporary allocations regardless of a particular browser's GC strategy; it is meant in part to make adoption more attractive for implementers, not end-users.

2.) An important use-case for the API in my mind is when the intermediate arrays are *large* (i.e. not just ARGB 4-arrays, but rather O(n) for an n*m matrix, etc), and so details of the allocation scheme are a distraction. Avoiding the copy, a gain Rick mentioned, is the more important win for the result-token I anticipate for those cases.

(Unfortunately the immediate instances of the latter that are on my mind are for cases that we haven't really discussed and are not part of the current API, such as a variant of scan() for doing row-by-row wavefront computations, where the prefix computed so far is passed along for each successive row. The point there is that you want to parallelize the row-computation, but a row is presumably large enough that you want to avoid copying it.)

----

Regarding false sharing of cache lines:

That is an interesting issue. I can imagine an implementation attempting to work around it (e.g. by making cache-aware decisions about how to break up the result-array and hand off its addresses), but designing the API to allow for that (and/or encourage it) seems tricky, at least in the general case.

I think I'll have to ponder this further; for now my gut tells me that the copying is hurting our performance more than potential false-sharing would. But I have no concrete numbers to back up that claim.

Cheers,
-Felix

On Wed May 1 14:32:30 2013, Hudson, Rick wrote:
> Are StructViews implemented as just pointers? I'm assuming they aren't since the GC is unlikely to play well with raw pointers into the middle of some object. Instead the GC will want to see a StructView that holds the base of the object and an index so it can do things like mark/move/flip the object.
> The gains are from reuse of StructViews, not having to allocate each result, and from not having to do a copy to aggregate the results.
>
> I worry about false sharing of cache lines and overestimating the cost of allocation of short lived objects using bump pointer allocation.
>

> - Rick

>
> -----Original Message-----
> From: Niko Matsakis [mailto:nmat...@mozilla.com]
> Sent: Tuesday, April 30, 2013 9:41 PM

> To: Herhut, Stephan A
> Cc: dev-tech-js-en...@lists.mozilla.org; Hudson, Rick;
> Shpeisman, Tatiana; Sreeram, Jaswanth; Shu-yu Guo;
> pnkf...@mozilla.com; David Herman

> Subject: Re: on explicit result tokens
>

> Hi Stephan,
>
> I very much like the idea of reusing binary data StructViews for the out pointer. One problem of course is that they are not currently implemented. In the interim, it may be more practical for us to use something simpler, so as not to gate the work on parallel processing on the binary data work (which, so far as I know, has not yet begun nor been 100% fully specified).
>
> I'm a bit less clear on the zipping and unzipping proposal. Do you mean that I might write something like:
>
> var [pa1, pa2] = pa0.map(..., type1, type2, function(e, ...,
> out1, out2) {...});
>
> and thus create two parallel arrays from one source array in one step?
>
>
>
> Niko
>

email: {fklock, pnkfelix}@mozilla.org

_______________________________________________
dev-tech-js-engine-rivertrail mailing list dev-tech-js-en...@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-tech-js-engine-rivertrail

Niko Matsakis

unread,

May 1, 2013, 11:35:16 AM5/1/13

to Felix S. Klock II, Hudson, Rick, Shpeisman, Tatiana, Shu-yu Guo, Niko Matsakis, Herhut, Stephan A, dev-tech-js-en...@lists.mozilla.org, David Herman, Sreeram, Jaswanth

On Wed, May 01, 2013 at 04:07:28PM +0200, Felix S. Klock II wrote:
> The rule-of-thumb I've employed is: If one can predict the result
> shape, and if it can be used to help avoid intermediate allocations,
> then a token (oriented around that result shape) might be warranted.

Seems reasonable! Incidentally, One thing that we lose by not having
an out pointer unless the grain is of some aggregate type is that it
perhaps becomes a bit harder to write generic kernel functions that
work regardless of the output type. I am envisioning some sort of
generic, higher-order kernel function that delegates to another
functon to actually generate the result: this would have to be written
using return values, I think. But this loss is worth the conceptual
gain of re-using StructViews.

> * filter() : does it make sense? You cannot predict the precise
> length ahead of time. I guess the caller could guess the length
> (and thus the whole shape); but what happens if they guess wrong? Do
> they need to guess the exact length, or will it suffice for them to
> guess >= the end length? If they guess wrong, do we freshly
> allocate a result then, or throw an exception? (Nonetheless, there
> is probably some variant here that is reasonable.)

In the case of `filter()`, the result of the function is always a
boolean, no? So I don't think an out pointer makes any sense there.

> * reduce() : my intuition is that an outptr does not make sense
> given its current API. [1] But this is not a case I feel the need
> to optimize right now; I've only been thinking about it in terms of
> making the API "uniform" in some sense.
>

> * scan() : seems analogous to reduce(). [...]

It seems useful to me to support out pointers in `reduce` and
`scan`. As an example, consider a large 2-d matrix where you want to
output a vector containing the sum of every row. In short, the
reduce() function always has the type `(A,A) -> A` where `A` is the
grain type you are iterating over, and that grain may in fact be a
larger value. (When I say "has the type", I mean that logically it has
that type: clearly JS is dynamically typed and users may do all kinds
of tricks etc)

> The most important thing to me is that the token itself have
> parallel methods, so that a large sub-computation can itself be
> distributed. (This is again related to what I'd like to see for the
> image-resize benchmark.) I didn't see this addressed in Stephan's
> note, but maybe it was implicit. (Maybe it would just be built-in to
> all StructViews in engines with PJS.)

In general, I favor the idea of offering parallel "mutate in place"
methods uniformly, but there are some implications that must be
weighed. In particular, we need to prevent reads from within the tasks
spawned by `divide()`, which implies some kind of read-guard.

One specific concern I have with respect to supporting such methods in
the general case is that I believe you can have many StructViews all
examining the same underlying buffer, so we'd have to do the read
guards at the level of the buffer.

The other obvious concern is performance; if done naively, read-guards
could affect performance of all reads. However, I imagine we can
optimize these well. For one thing, they are never necessary outside
of parallel execution mode, so we don't need to worry about affecting
sequential baseline performance. TI could probably also be used to
tell us if the object being read has ever been divided and thus
whether a guard would be necessary at all.

Also, this certainly intersects some of the "buffer slicing" ideas
that dherman has been tossing around, so maybe we can address this use
case in another way. All in all this is an area that will require some
thought, I think.

> [1] ... But inevitably I run into the issue that the original

> cell-shape and the intermediate values are not necessarily of the
> same domain, and I wonder if this use-case even actually exists.
> That's about where I give up on trying to shoe-horn this into the
> API.

I am confused about this. I think that the return type of the reduce
and the values within the original input must logically be of the same
type, since the callback can be invoked with either the return value
of a prior reduction or an input from the array, and there is no way
to control which you get. Of course dynamic type tests can be used,
which is perhaps what you mean? But it seems like defaulting to a
grain that is taken from the array is reasonable, and the user is
always permitted to supply "any" if they prefer.

Overall though I feel like I may not be understanding your point,
though.

Niko

Niko Matsakis

unread,

May 1, 2013, 12:02:46 PM5/1/13

to Felix S. Klock II, Hudson, Rick, Shpeisman, Tatiana, Shu-yu Guo, Niko Matsakis, Herhut, Stephan A, dev-tech-js-en...@lists.mozilla.org, David Herman, Sreeram, Jaswanth

On Wed, May 01, 2013 at 05:16:58PM +0200, Felix S. Klock II wrote:
> You could unconditionally pass a token, and then use one of the
> dynamic signalling mechanisms mentioned by Stephan and myself to
> decide whether it was used.
>
> Not that I'm saying this is a *good* idea; I'm just noting it as a
> point in the design space.

I personally favor this approach.

Niko

Herhut, Stephan A

unread,

May 1, 2013, 12:47:54 PM5/1/13

to Niko Matsakis, Hudson, Rick, Shpeisman, Tatiana, Shu-yu Guo, pnkf...@mozilla.com, dev-tech-js-en...@lists.mozilla.org, David Herman, Sreeram, Jaswanth

>I very much like the idea of reusing binary data StructViews for the out
>pointer. One problem of course is that they are not currently implemented. In
>the interim, it may be more practical for us to use something simpler, so as not
>to gate the work on parallel processing on the binary data work (which, so far
>as I know, has not yet begun nor been 100% fully specified).

I agree. However, for the specification (and grand vision), we should tie to binary data. I have been thinking about using TypedArrays and ArrayBufferViews for now but unfortunately, ArrayBufferViews cannot be moved, i.e., their offset is a read only property. If that could be changed (if only for internal use in self-hosted code, that approach could reach quite far.

>I'm a bit less clear on the zipping and unzipping proposal. Do you mean that I
>might write something like:
>
> var [pa1, pa2] = pa0.map(..., type1, type2, function(e, ..., out1, out2) {...});
>
>and thus create two parallel arrays from one source array in one step?

Yes, that is the idea (except that in my proposal types trail the function argument to give a IMHO nicer overloading). This essentially allows for the equivalent of loop fusion, which can be hugely beneficial when two operations share some computation and also may help to drive down overheads.

Stephan

Herhut, Stephan A

unread,

May 1, 2013, 12:58:11 PM5/1/13

to Norm Rubin, Felix S. Klock II, Hudson, Rick, Shpeisman, Tatiana, Shu-yu Guo, Niko Matsakis, dev-tech-js-en...@lists.mozilla.org, David Herman, Sreeram, Jaswanth

>-----Original Message-----
>From: Norm Rubin [mailto:nru...@nvidia.com]
>

>1) Given var [pa1, pa2] = pa0.map(..., array(int,2) , array(int,3), function(e,

>..., out1, out2) {...});
>

>So each thread writes a 5 ints split up as a block of 2(out1) and a block of 3
>(out2) How would this proposal put those values into pa1 and pa2,
>
>Is pa[i] is an array of 2 elements, and each element of pa2 is an array of 3
>elements or are the arrays flattened in some way?

This depends on the type specification you provide. Let me provide some detail:

When you write something like

[A,B] = new ParallelArray([10,20], f, ArrayType(uint8, 3), ArrayType(uint8))

you get two ParallelArray object back with different backing stores. The on in A will have type ArrayType(ArrayType(ArrayType(uint8, 3), 20), 10) and will be represented as a dense, flat 600 element array in memory (as per definition in the binary data spec). For B, however, you get a different situation as the size of the inner array is unknown. The overall backing store type is ArrayType(ArrayType(ArrayType(uint8), 20), 10), which is stored as an array of 200 pointers to arrays of uint8 values.

So you can have both, depending on what you need. The outer ParallelArray structure, however, is always dense and flat.

>2) the input side question corresponding to multiple outputs, we often want
>to write code that does stencils, e.g., we might what to write x[i] = a[i-1]+
>a[i]+a[i+1], each thread reads three inputs centered around its thread id, no
>thread needs to access anyone else in a. What do you think of using
>something similar to let the jit know the amount of a the code will touch?
>So the first argument of the function might change from a single element and
>an index, to small array?
>
>Pa.map(function( in) ){ out[0] = in[-1]+ in[0]+ in[1];}

To benefit from this kind of information, the JIT would still need to do some analysis to compute the optimal iteration order and memory mapping. If one goes to that effort, inferring the access patterns, in particular for relatively simple stencil operations, should be easy enough to do, too.

Do you have a specification language in mind or is there some prior art? One would need to define border behavior, too.

Stephan

Herhut, Stephan A

unread,

May 1, 2013, 1:11:56 PM5/1/13

to Felix S. Klock II, Hudson, Rick, Shpeisman, Tatiana, Shu-yu Guo, Niko Matsakis, dev-tech-js-en...@lists.mozilla.org, David Herman, Sreeram, Jaswanth

> From: Felix S. Klock II [mailto:pnkf...@mozilla.com]

>
> * filter() : does it make sense? You cannot predict the precise length ahead of time. I guess the caller
> could guess the length (and thus the whole shape); but what happens if they guess wrong? Do they
> need to guess the exact length, or will it suffice for them to guess >= the end length? If they guess
> wrong, do we freshly allocate a result then, or throw an exception? (Nonetheless, there is probably
> some variant here that is reasonable.)

In my proposal, the type specification only gives the cell type (or grain type) and does not provide the size of the frame of the result. Thus it is up to the runtime to figure that piece of information out.

> * reduce() : my intuition is that an outptr does not make sense given its current API. [1] But this
> is not a case I feel the need to optimize right now; I've only been thinking about it in terms of
> making the API "uniform" in some sense.

I came to the same conclusion when drafting the binary data spec and have not added binary data support to reduce as it simply produces a single value that is passed along. If you want that to be a binary data type, you can implement your reduction function that way.

> * scan() : seems analogous to reduce(). But then the fact that you build up a result array/matrix

> leads me to wonder if I should not be so quick to dismiss the case of reduce. (I worry in particular
> about the row-computations on the image-resize benchmark, and if some sequential variant of
> scan could allow us to compute the rows in-place.)

For scan, an out token makes sense as one is not producing a single result but an array of results and may want that to be in a binary format. To avoid copying, there should be an out parameter. My previous comment was about providing multiple out parameters and I do not have a good use case or API design for that.

Stephan

Felix S. Klock II

unread,

May 1, 2013, 1:23:10 PM5/1/13

to Herhut, Stephan A, Hudson, Rick, Shpeisman, Tatiana, Shu-yu Guo, Niko Matsakis, dev-tech-js-en...@lists.mozilla.org, David Herman, Sreeram, Jaswanth

On 01/05/2013 19:11, Herhut, Stephan A wrote:
>> * reduce() : my intuition is that an outptr does not make sense given its current API. [1] But this
>> is not a case I feel the need to optimize right now; I've only been thinking about it in terms of
>> making the API "uniform" in some sense.
> I came to the same conclusion when drafting the binary data spec and have not added binary data support to reduce as it simply produces a single value that is passed along. If you want that to be a binary data type, you can implement your reduction function that way.

My current thinking was I was wrong and Niko is right: If you are doing
a reduce that produces an array/matrix, then an outptr can make sense.

The mental block I was experiencing in this case is how to make this
work on arbitrary arrays.

But Niko pointed out to me that this reduction, like any other for the
PJS reduce operator, needs to start with the proper inputs. If your
input array/matrix does not have the right grain to work with a reduce
operator (C,C) -> C, then you must first map it so that it *is* an
arrayof<C> (or can be decomposed into some reduction-space * grain where
the grain = C).

(I know that last paragraph is a bit jumbled; but it will be easier to
clear up any confusion directly via voice during tonight's meeting, I
think, rather than try to spend more time working out the right text.)

-Felix

Herhut, Stephan A

unread,

May 1, 2013, 1:37:40 PM5/1/13

to Felix S. Klock II, Hudson, Rick, Shpeisman, Tatiana, Shu-yu Guo, Niko Matsakis, dev-tech-js-en...@lists.mozilla.org, David Herman, Sreeram, Jaswanth

>-----Original Message-----
>From: Felix S. Klock II [mailto:pnkf...@mozilla.com]
>

>On 01/05/2013 19:11, Herhut, Stephan A wrote:
>>> * reduce() : my intuition is that an outptr does not make sense
>>> given its current API. [1] But this is not a case I feel the need to
>>> optimize right now; I've only been thinking about it in terms of making the
>API "uniform" in some sense.
>> I came to the same conclusion when drafting the binary data spec and have
>not added binary data support to reduce as it simply produces a single value
>that is passed along. If you want that to be a binary data type, you can
>implement your reduction function that way.
>
>My current thinking was I was wrong and Niko is right: If you are doing a
>reduce that produces an array/matrix, then an outptr can make sense.

I just found that mail in my spam filter :-)

>The mental block I was experiencing in this case is how to make this work on
>arbitrary arrays.
>
>But Niko pointed out to me that this reduction, like any other for the PJS
>reduce operator, needs to start with the proper inputs. If your input
>array/matrix does not have the right grain to work with a reduce operator
>(C,C) -> C, then you must first map it so that it *is* an arrayof<C> (or can be
>decomposed into some reduction-space * grain where the grain = C).

This conversion, however, can be performed in the reduction function itself if needed. The other use of type specifications, enabling pre-allocation of data, does not apply in the case of reduce, though, as there is no single inflight result or output to write.

Stephan

Felix S. Klock II

unread,

May 1, 2013, 1:40:48 PM5/1/13

to Herhut, Stephan A, Hudson, Rick, Shpeisman, Tatiana, Shu-yu Guo, Niko Matsakis, dev-tech-js-en...@lists.mozilla.org, David Herman, Sreeram, Jaswanth

Stephan-

On 01/05/2013 19:37, Herhut, Stephan A wrote:
> This conversion, however, can be performed in the reduction function itself if needed. The other use of type specifications, enabling pre-allocation of data, does not apply in the case of reduce, though, as there is no single inflight result or output to write.

I'm not sure about that. I've been thinking you could pre-allocate the
backing storage for O(num-processors) intermediate values from the
subreductions, rather than forcing the GC to deal with
O(num-invocations) allocations of the intermediate values. (Of course
this gets into serious "must avoid aliasing bugs" territory.)

Cheers,
-Felix

Norm Rubin

unread,

May 1, 2013, 3:13:04 PM5/1/13

to Hudson, Rick, Felix S. Klock II, Shpeisman, Tatiana, Shu-yu Guo, Niko Matsakis, Herhut, Stephan A, dev-tech-js-en...@lists.mozilla.org, David Herman, Sreeram, Jaswanth

While it is true that stencil codes could be done in a library -

Pa.map(buildStencil(function(east, me, west) {...}, -1, 0, 1, boundaryValue)); where -1, 0, and 1 are index offsets used to calculate east, me, and west and boundaryValue is what is used if the stencil crosses a boundary.

I have some trouble seeing how the jit would understand what is going on.
On a gpu, an ideal jit would copy a block (enough of the input to cover a few threads) into a per thread group scratch pad memory so that the references would be very fast.

The buildStencil function has all the information, but if it is a separate library, how does the info flow along?

-----Original Message-----
From: Hudson, Rick [mailto:rick....@intel.com]
Sent: Wednesday, May 01, 2013 11:23 AM
To: Norm Rubin; Felix S. Klock II
Cc: Shpeisman, Tatiana; Shu-yu Guo; Niko Matsakis; Herhut, Stephan A; dev-tech-js-en...@lists.mozilla.org; David Herman; Sreeram, Jaswanth
Subject: RE: on explicit result tokens

1) Given var [pa1, pa2] = pa0.map(..., array(int,2) , array(int,3), function(e, ..., out1, out2) {...});
So each thread writes a 5 ints split up as a block of 2(out1) and a block of 3 (out2) How would this proposal put those values into pa1 and pa2,
Is pa[i] is an array of 2 elements, and each element of pa2 is an array of 3 elements or are the arrays flattened in some way?

That is correct, the arrays are not flattened.

2) the input side question corresponding to multiple outputs, we often want to write code that does stencils, e.g., we might what to write x[i] = a[i-1]+ a[i]+a[i+1], each thread reads three inputs centered around its thread id, no thread needs to access anyone else in a. What do you think of using something similar to let the jit know the amount of a the code will touch?
So the first argument of the function might change from a single element and an index, to small array?

Pa.map(function( in) ){ out[0] = in[-1]+ in[0]+ in[1];}

This should be written on top of the spec as a library where the your sum function is passed to a makeStencil function that returns a function with the correct semantics and map kernel function signature.

Pa.map(buildStencil(function(east, me, west) {...}, -1, 0, 1, boundaryValue)); where -1, 0, and 1 are index offsets used to calculate east, me, and west and boundaryValue is what is used if the stencil crosses a boundary.

- Rick

-----Original Message-----
From: Norm Rubin [mailto:nru...@nvidia.com]

Sent: Wednesday, May 01, 2013 10:03 AM
To: Felix S. Klock II; Hudson, Rick
Cc: Shpeisman, Tatiana; Shu-yu Guo; Niko Matsakis; Herhut, Stephan A; dev-tech-js-en...@lists.mozilla.org; David Herman; Sreeram, Jaswanth
Subject: RE: on explicit result tokens

I thought I'd add some observations about how this would work on a gpu

Some background:

On today's gpu devices there are two memories
The usual cpu memory and a second separate gpu memory (the size of this memory may be limited)
Programs generally look like
Copy some data from cpu to gpu
Do the work on the gpu
Copy the result back

For almost all current gpu programs the two copies take most of the time (performance is bandwidth limited) To get gpu performance out of a parallejs program the compiler/runtime is going to need to carefully figure out where to place the copies and identify cases where it can keep the data gpu resident

The gpu tends to work well when the data set is large, so the copies are often many megabytes

While future machines may reduce the need for copies, I expect they still will have bandwidth concerns.

False sharing is reversed on the gpu,
Code will run significantly faster is nearby threads write nearby addresses That is if thread0 writes *x, it would be good for performance if thead1 writes (*x+1)

----

Now (at last) my questions

1) Given var [pa1, pa2] = pa0.map(..., array(int,2) , array(int,3), function(e, ..., out1, out2) {...});

So each thread writes a 5 ints split up as a block of 2(out1) and a block of 3 (out2) How would this proposal put those values into pa1 and pa2,

Is pa[i] is an array of 2 elements, and each element of pa2 is an array of 3 elements or are the arrays flattened in some way?

2) the input side question corresponding to multiple outputs, we often want to write code that does stencils, e.g., we might what to write x[i] = a[i-1]+ a[i]+a[i+1], each thread reads three inputs centered around its thread id, no thread needs to access anyone else in a. What do you think of using something similar to let the jit know the amount of a the code will touch?
So the first argument of the function might change from a single element and an index, to small array?

Pa.map(function( in) ){ out[0] = in[-1]+ in[0]+ in[1];}

> I very much like the idea of reusing binary data StructViews for the out pointer. One problem of course is that they are not currently implemented. In the interim, it may be more practical for us to use something simpler, so as not to gate the work on parallel processing on the binary data work (which, so far as I know, has not yet begun nor been 100% fully specified).
>

> I'm a bit less clear on the zipping and unzipping proposal. Do you mean that I might write something like:
>

> var [pa1, pa2] = pa0.map(..., type1, type2, function(e, ...,
> out1, out2) {...});
>

> and thus create two parallel arrays from one source array in one step?
>
>
>

>> Stephan

--
irc: pnkfelix on irc.mozilla.org
email: {fklock, pnkfelix}@mozilla.org

Hudson, Rick

unread,

May 1, 2013, 4:03:03 PM5/1/13

to Norm Rubin, dev-tech-js-en...@lists.mozilla.org

>> The buildStencil function has all the information, but if it is a separate library, how does the info flow along?

The buildStencil function is not part of the kernel function, all it does is produce the kernel function. Nothing I'm saying here deals with how the kernels are scheduled or memory is moved about. It is just pointing out that stencil should be built as a helper function in JavaScript.

To simplify the example below let's just consider stencil's boundary problem. Below halo returns a function that only calls blur on values not within a boundary. To be general the size of the boundary is passed in as an argument to halo. Values within the boundary are returned unaltered. It is this function returned by halo that is the kernel function used within the map.

var pa = new ParallelArray(
new Int32Array([2,4,8,16,32]));
function blur(ele,ind,arr){return (ele+arr[i-1])/2;};
pa.map(blur); // -> throws error on arr[-1]
function halo(boundary, work) {
return function (element, indx, arr){
if (indx < boundary) { return element;}
else { return work(element, indx, arr);};
};
};
pa.map(halo(3, blur))) // ->[2 4 8 12 24]
pa.map(halo(1, blur))) // ->[2 3 6 12 24]
pa.map(halo(pa.length-1, blur))) //->[2 4 8 16 24]

Norm Rubin

unread,

May 2, 2013, 1:24:26 AM5/2/13

to Hudson, Rick, dev-tech-js-en...@lists.mozilla.org

Gpu's have very fast scratch pad memories . Each scratch pad is shared by a subset of the threads, so one way to do a stencil might be to figure out subsets of the input that both fit into the scratch pad and cover all the needs of a subset of the thread

Suppose that
The original parallel array held 100 elements. And stencil showed that each element needed to refer to its left and right neighbors
An implementation might want to cut the array up into 10 groups each holding 12 elements. If that was the size of the scratch pad. Some extra elements might get inserted for boundary conditions
Then the implementation could launch 100 threads grouped into 10 blocks of 10,
Each block could read in a chunk of data, each of the 10 threads reads one value, the first and last threads either read an extra value or setup the boundary.
Then every thread can read its input out of fast scratchpad memory.

Without some clues about the input access pattern, I don't think the jit is going to be able to split up data and threads efficiently.

Niko Matsakis

unread,

May 2, 2013, 5:58:30 AM5/2/13

to Norm Rubin, dev-tech-js-en...@lists.mozilla.org, Hudson, Rick

> Without some clues about the input access pattern, I don't think
> the jit is going to be able to split up data and threads
> efficiently.

This is an interesting point. Just to be sure I understand (because
there have been a lot of dense messages), you're saying that if we had
a stencil method, then the runtime would have a very strong hint of
what data will be accessed (i.e., the data that the stencil requests)
and thus it would be able to schedule more efficiently. This makes
sense to me. I also think that it makes sense to offer higher-level
methods, even if they *can* be composed from lower-level ones,
particularly if it makes it easier to generate more efficient code. I
know this is not a universal opinion.

This has an interesting overlap with pnkfelix's thoughts about need
for a wavefront method. Wavefronts and stencils are very similar,
except that in a wavefront, the inputs to the computation are the
outputs from previous iterations. Anyway, my strawman for a wavefront
API was something like this:

var pa1 = pa.wavefront(
// Dependence vector. This specifies that to compute
// element (x, y), you need the values that were computed
// for elements (x-1, y-1), (x-1, y), and (x-1, y+1):
[[-1, -1], [-1, 0], [-1, 1]],

// Default value. This specifies what value to use for
// a dependency if it cannot be satisfied. In this example,
// defaults would be used where x==0 || y==0. If no simple
// default is suitable, `undefined` or some other
// sentinel value can be used.
0,

// This function computes the result for an individual cell.
// Note that some of the dependencies may be default
// if this is a border cell.
function(e0, e1, e2, // (x-1, y-1), (x-1, y), (x-1, y+1)
x, y) { // the indices we are computing
});

I imagine a stencil API could have the same form, though with
different semantics (the inputs from `pa` and not the results of
previous iterations).

Niko

Norm Rubin

unread,

May 2, 2013, 9:28:12 AM5/2/13

to Niko Matsakis, dev-tech-js-en...@lists.mozilla.org, Hudson, Rick

This is exactly what I'm suggesting.

It seems to me that there is a wide spectrum for higher level operations, on one end we could pick a minimal set and rely on the compiler to find ways to get performance, and on the other we could pick a huge set to make the compilers job trivial.

I suspect everyone wants to be somewhere in the middle, but everyone has had different experiences so there are lots of views on what do you need for performance and how hard will it be to construct the jit.