Re: [stan-dev/math] Specialized gradients for cov_exp_quad (#353)

50 görüntüleme
İlk okunmamış mesaja atla

Bob Carpenter

okunmadı,
21 Ağu 2016 07:39:1321.08.2016
alıcı stan...@googlegroups.com
[off GitHub, in order to keep issues on topic]


> > With all that being said, I'd like to talk through potentially adding a matrix var type, with values stored as a member double matrix and adjoints stored as a member double matrix, but whose values are allocated in our arena. As it stands, it seems like we're copying values and adjoints far too many times (think about a cov_exp_quad -> cholesky_decompose -> multiply -> sum) and should be able to store one pointer to a matrix vari rather than n^2 pointers to varis.

How far have you worked through this? I've tried many
times for just the reasons you mention. But I'd love to
get a fresh perspective.

In the current approach, we have both the cost of the copies
and the cost of allocating containers on the C++ heap. Both of
those cause unwanted, fine-grained memory contention. What I
don't now is how much overhead that is compared to the actual
matrix operations.

Have you worked through what the data type would look like
and how it'd fit in with the rest of our autodiff? I assume
you're talking about:

mvar
mvari* mvi_;

mvari
MatrixXd val_;
MatrixXd adj_;

Or did you really mean storing pointers (or references)
rather than values, as in:

mvari
MatrixXd& val_;
MatrixXd& adj_;
void chain();

The biggest issue I kept running into is how (or even whether)
to support indexing (e.g., a[i, j]). To interface with the rest of
the language as it currently is, it'd have to support:

matrix_var a(5, 5);
var b = a(i, j);

Specifically, how is the vari inside of var b handled?

An alternative might be to disallow indexing on matrix_var
instances, which would limit both their construction and
their use, but might be enough to speed up the operations
you're worried about. I haven't thought this through as much
as the previous plans.

Then there's the issue of how the chain() method will work.
The challenge is that you either need to store pointers to other
scalar vari or matrix mvari.

There's also the issue of higher-order autodiff if we
ever want to get that going for real.

Also, Ben's always been talking about how to store decompositions
for reuse, which is where a lot of inefficiency in naive matrix
use comes from in user programs.

Anyway, I'd love to hear some fresh ideas on how to handle this.
I can also ask around at the autodiff conference in a couple weeks
in Oxford. I have a feeling I'm going to mostly be sitting there
slack-jawed while all the calculus and applied math notation
rolls over my head.

Whatever the answer to these questions is, it'd be a massive
undertaking at both the language and math lib level---akin to
taking on using Nomad.

- Bob

Rob Trangucci

okunmadı,
21 Ağu 2016 23:36:3521.08.2016
alıcı stan...@googlegroups.com
On Sun, Aug 21, 2016 at 7:38 AM, Bob Carpenter <ca...@alias-i.com> wrote:
[off GitHub, in order to keep issues on topic]


> > With all that being said, I'd like to talk through potentially adding a matrix var type, with values stored as a member double matrix and adjoints stored as a member double matrix, but whose values are allocated in our arena. As it stands, it seems like we're copying values and adjoints far too many times (think about a cov_exp_quad -> cholesky_decompose -> multiply -> sum) and should be able to store one pointer to a matrix vari rather than n^2 pointers to varis.

How far have you worked through this?  I've tried many
times for just the reasons you mention.  But I'd love to
get a fresh perspective.

In the current approach, we have both the cost of the copies
and the cost of allocating containers on the C++ heap.  Both of
those cause unwanted, fine-grained memory contention.  What I
don't now is how much overhead that is compared to the actual
matrix operations.


My main worry is actually memory consumption rather than computational cost, because for most of the models that I have in mind, a cubic operation (matrix-matrix multiply, or decomposition) would dominate the computational cost. However, for something like a matrix-vector product, the copying will be on the same order as the function evaluation and gradient calc. 

multiply(Matrix<var, -1, -1> A, Matrix<var, -1, -1> B) has to allocate arena memory for the double values of A and B in order to compute Matrix<double, -1, -1> AB = A * B, and then in chain() we have to make copies of the adjoint matrix of AB, and A and B. It seems like we're allocating more memory than we need to here. I think we can cut down on one of these allocations by allowing Eigen to reuse and resize memory for an adjoint calculation. But as we start to chain matrix operations together, we're going to have to make seemingly unnecessary copies of matrices.

 
Have you worked through what the data type would look like
and how it'd fit in with the rest of our autodiff?  I assume
you're talking about:

  mvar
    mvari* mvi_;

  mvari
    MatrixXd val_;
    MatrixXd adj_;

Yes, this is exactly what I had in mind.
 

Or did you really mean storing pointers (or references)
rather than values, as in:

  mvari
    MatrixXd& val_;
    MatrixXd& adj_;
    void chain();

The biggest issue I kept running into is how (or even whether)
to support indexing (e.g., a[i, j]).  To interface with the rest of
the language as it currently is, it'd have to support:

  matrix_var a(5, 5);
  var b = a(i, j);

Specifically, how is the vari inside of var b handled?

An alternative might be to disallow indexing on matrix_var
instances, which would limit both their construction and
their use, but might be enough to speed up the operations
you're worried about.  I haven't thought this through as much
as the previous plans.

Yes, this is where I was stumbling yesterday. I don't have a good answer yet. I'm tempted to say that we don't allow indexing, but this should be a last-resort. Is there a way to template matrix vars in such a way as to specialize for a class that allows for indexing and one that doesn't? Then you could template the matrix functions to handle inputs of indexable matrix vars and to handle inputs of non-indexable matrix vars. The class and chain method for the former would look similar to our matrix var functions now, but the class and chain methods for the latter would use the MatrixXd members of the matrix vars passed in as arguments.
 

Then there's the issue of how the chain() method will work.
The challenge is that you either need to store pointers to other
scalar vari or matrix mvari.

The idea would be to store pointers to the mvari.
 

There's also the issue of higher-order autodiff if we
ever want to get that going for real.

As long as we've added the properly templated versions of all the functions, 
we should be fine, no? 
 

Also, Ben's always been talking about how to store decompositions
for reuse, which is where a lot of inefficiency in naive matrix
use comes from in user programs.

Anyway, I'd love to hear some fresh ideas on how to handle this.
I can also ask around at the autodiff conference in a couple weeks
in Oxford.  I have a feeling I'm going to mostly be sitting there
slack-jawed while all the calculus and applied math notation
rolls over my head.

Whatever the answer to these questions is, it'd be a massive
undertaking at both the language and math lib level---akin to
taking on using Nomad.

No doubt, but I think we're currently stuck with either slower matrix computation or faster but memory inefficient matrix computation. Worth chatting about on Tuesday at the dev meeting?
 

- Bob

--
You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ben Goodrich

okunmadı,
22 Ağu 2016 00:40:0122.08.2016
alıcı stan development mailing list
On Sunday, August 21, 2016 at 11:36:35 PM UTC-4, robert.trangucci wrote:
multiply(Matrix<var, -1, -1> A, Matrix<var, -1, -1> B) has to allocate arena memory for the double values of A and B in order to compute Matrix<double, -1, -1> AB = A * B, and then in chain() we have to make copies of the adjoint matrix of AB, and A and B. It seems like we're allocating more memory than we need to here. I think we can cut down on one of these allocations by allowing Eigen to reuse and resize memory for an adjoint calculation. But as we start to chain matrix operations together, we're going to have to make seemingly unnecessary copies of matrices.

Perhaps some of this could be prevented by utilizing Eigen Maps, if we could ever get the math library to template on Eigen::MatrixBase rather than Eigen::Matrix.

Ben

Michael Betancourt

okunmadı,
22 Ağu 2016 01:51:5022.08.2016
alıcı stan...@googlegroups.com
I think the only resolution would be slick template expressions, no?

Bob Carpenter

okunmadı,
22 Ağu 2016 08:15:1522.08.2016
alıcı stan...@googlegroups.com

> On Aug 22, 2016, at 5:36 AM, Rob Trangucci <robert.t...@gmail.com> wrote:
>
>
>
> ...
> My main worry is actually memory consumption rather than computational cost, because for most of the models that I have in mind, a cubic operation (matrix-matrix multiply, or decomposition) would dominate the computational cost. However, for something like a matrix-vector product, the copying will be on the same order as the function evaluation and gradient calc.

Same order, but much lower constant given the overhead
of autodiff.

> multiply(Matrix<var, -1, -1> A, Matrix<var, -1, -1> B) has to allocate arena memory for the double values of A and B in order to compute Matrix<double, -1, -1> AB = A * B,

We shouldn't allocate a local Matrix<double, ...> in the arena.
We should just use Eigen so the memory doesn't persist beyond
the local scope.

> and then in chain() we have to make copies of the adjoint matrix of AB, and A and B. It seems like we're allocating more memory than we need to here.

Absolutely, but the dynamic overhead for the operations should
be freed as soon as the result's calculated.

> I think we can cut down on one of these allocations by allowing Eigen to reuse and resize memory for an adjoint calculation. But as we start to chain matrix operations together, we're going to have to make seemingly unnecessary copies of matrices.

Right.



> Have you worked through what the data type would look like
> and how it'd fit in with the rest of our autodiff? ...

Above paragraph from me, below from Rob---for some reason I'm not
getting nesting in his responses.

> Yes, this is where I was stumbling yesterday. I don't have a good answer yet. I'm tempted to say that we don't allow indexing, but this should be a last-resort. Is there a way to template matrix vars in such a way as to specialize for a class that allows for indexing and one that doesn't?

Not that I can figure out. The issue is that templating is resolved
statically (at compile time), but we don't know what kind of variable
we'll run into on the stack. And we have to keep things in stack order
to pass gradients. So the only way I know how to do this is with a
virtual function, which is much more epensive.

> Then you could template the matrix functions to handle inputs of indexable matrix vars and to handle inputs of non-indexable matrix vars. The class and chain method for the former would look similar to our matrix var functions now, but the class and chain methods for the latter would use the MatrixXd members of the matrix vars passed in as arguments.
>
>

(from bob)
> Then there's the issue of how the chain() method will work.
> The challenge is that you either need to store pointers to other
> scalar vari or matrix mvari.
>

(from rob)
The idea would be to store pointers to the mvari.

And then you'd need the indexes into mvari, too.

> There's also the issue of higher-order autodiff if we
> ever want to get that going for real.
>
> As long as we've added the properly templated versions of all the functions,
> we should be fine, no?

So you'd have

template <T>
mvari
Matrix<T, -1, -1> val_;
Matrix<T, -1, -1> adj_;

?

> Also, Ben's always been talking about how to store decompositions
> for reuse, which is where a lot of inefficiency in naive matrix
> use comes from in user programs.
>
> Anyway, I'd love to hear some fresh ideas on how to handle this.
> I can also ask around at the autodiff conference in a couple weeks
> in Oxford. I have a feeling I'm going to mostly be sitting there
> slack-jawed while all the calculus and applied math notation
> rolls over my head.
>
> Whatever the answer to these questions is, it'd be a massive
> undertaking at both the language and math lib level---akin to
> taking on using Nomad.
>
> No doubt, but I think we're currently stuck with either slower matrix computation or faster but memory inefficient matrix computation. Worth chatting about on Tuesday at the dev meeting?

If you have anything else to say about it, sure. I'd be gung-ho
to do this if we had a design prototype that worked and at least
a person-year's worth of programmer time.

- Bob

Rob Trangucci

okunmadı,
22 Ağu 2016 11:13:1322.08.2016
alıcı stan...@googlegroups.com
On Mon, Aug 22, 2016 at 8:14 AM, Bob Carpenter <ca...@alias-i.com> wrote:

> On Aug 22, 2016, at 5:36 AM, Rob Trangucci <robert.t...@gmail.com> wrote:
>
>
>
> ...
> My main worry is actually memory consumption rather than computational cost, because for most of the models that I have in mind, a cubic operation (matrix-matrix multiply, or decomposition) would dominate the computational cost. However, for something like a matrix-vector product, the copying will be on the same order as the function evaluation and gradient calc.

Same order, but much lower constant given the overhead
of autodiff.

> multiply(Matrix<var, -1, -1> A, Matrix<var, -1, -1> B) has to allocate arena memory for the double values of A and B in order to compute Matrix<double, -1, -1> AB = A * B,

We shouldn't allocate a local Matrix<double, ...> in the arena.
We should just use Eigen so the memory doesn't persist beyond
the local scope.

We need the doubles for A and B again in chain. I can run some benchmarks to see how much of a speed hit (if any) we get from pulling the doubles out into local variables.
I can't come up with an elegant way of storing adjoints and values as matrices. We're always going to have complications when interfacing with the scalar vars. 

Here's another idea. What if an N by M matrix_var contains a pointer to N*M block_varis whose vals and adjoints are laid out contiguously in the arena. The matrix_var also stores pointers to the starting locations of the vals and adjoints. That way, when the var operator()(int i, int j) is defined for the matrix_var, the operator creates a var, whose vi_ is a block_vari, whose val_ and adj_ members map to their correct locations in the block allocated val and block allocated adjoint.

Bob Carpenter

okunmadı,
22 Ağu 2016 11:30:1922.08.2016
alıcı stan...@googlegroups.com

> On Aug 22, 2016, at 5:12 PM, Rob Trangucci <robert.t...@gmail.com> wrote:
>
>
> On Mon, Aug 22, 2016 at 8:14 AM, Bob Carpenter <ca...@alias-i.com> wrote:
>
> > On Aug 22, 2016, at 5:36 AM, Rob Trangucci <robert.t...@gmail.com> wrote:
> >
> >
> >
> > ...
> > My main worry is actually memory consumption rather than computational cost, because for most of the models that I have in mind, a cubic operation (matrix-matrix multiply, or decomposition) would dominate the computational cost. However, for something like a matrix-vector product, the copying will be on the same order as the function evaluation and gradient calc.
>
> Same order, but much lower constant given the overhead
> of autodiff.
>
> > multiply(Matrix<var, -1, -1> A, Matrix<var, -1, -1> B) has to allocate arena memory for the double values of A and B in order to compute Matrix<double, -1, -1> AB = A * B,
> ...
> We need the doubles for A and B again in chain.

Then I don't see how we can avoid saving them.

> I can run some benchmarks to see how much of a speed hit (if any) we get from pulling the doubles out into local variables.

You mean recomputing them? That would also save memory

...

> I can't come up with an elegant way of storing adjoints and values as matrices. We're always going to have complications when interfacing with the scalar vars.
>
> Here's another idea. What if an N by M matrix_var contains a pointer to N*M block_varis whose vals and adjoints are laid out contiguously in the arena. The matrix_var also stores pointers to the starting locations of the vals and adjoints. That way, when the var operator()(int i, int j) is defined for the matrix_var, the operator creates a var, whose vi_ is a block_vari, whose val_ and adj_ members map to their correct locations in the block allocated val and block allocated adjoint.

That's an interesting proposal.

We can't just have block_vari extend vari, because vari allocates
the storage directly for the adjoint:

class vari {
const double val_;
double adj_;
...
};

whereas with this strategy we'd need a reference to the value and
adjoint. Then I don't see how we can avoid paying the virtual
function cost with two different base classes for vari.

I recently eliminated the chainable class that we used to have as
a base class for vari because it wasn't doing anything. But I
don't think that would've helped avoid the virtual function call or
the lack of ability to resolve the required class statically.

- Bob

P.S. I think the problem's in my mailer's interactions because your
mail scans OK.

Michael Betancourt

okunmadı,
22 Ağu 2016 12:53:1722.08.2016
alıcı stan...@googlegroups.com
Wouldn’t this just take the flavor of an Eigen map,
which I believe is a template expression?  In other
words, provide a wrapper to the underlying matrix
vari that is addressable by index to return a scalar
vari?

Rob Trangucci

okunmadı,
22 Ağu 2016 13:23:0722.08.2016
alıcı stan...@googlegroups.com
Right, but that new scalar vari has an adjoint and a value that are pointers to a location in a double array of adjoints and a location in a double array of values that were allocated by the matrix var. So we'd need a way to create a base class that both vari and "block vari" can inherit from which would return adj_ and val_ differently. I'm wondering if you could use CRTP with an adj() and val() functions defined for the base class to return the derived implemenations of adj() and val() which would return adj_ and val_ in current vari scheme and adj_* and val_* in the block_vari scheme.

To Bob's questions:

> We need the doubles for A and B again in chain.
Then I don't see how we can avoid saving them.
> I can run some benchmarks to see how much of a speed hit (if any) we get from pulling the doubles out into local variables.
You mean recomputing them?  That would also save memory

 Yes, I mean that rather than store the doubles for A and Bin multiply(Matrix<var, -1, -1> A, Matrix<var, -1, -1> B) in the arena, we could instead create local Eigen matrices on construction to compute A_local * B_local, and then on chain, do the same in order to compute the adjoint for A and the adjoint for B. I'll do the comparisons and get back to everyone.


Rob

Bob Carpenter

okunmadı,
22 Ağu 2016 13:50:1422.08.2016
alıcı stan...@googlegroups.com
No, you can't use the CRTP because that needs to be resolved
statically. It's not like regular OO programming with a base
class. So, for example, you can't make a container like a std::vector
to hold different extensions of a CRTP base class---the specific
extension needs to be completely resolved statically. This is the real
problem I keep running into every time I tried to think about doing this
in the past. And then if you go to virtual functions, everything gets
very expensive (we could test that presupposition).


Rob Trangucci

okunmadı,
22 Ağu 2016 23:31:0822.08.2016
alıcı stan...@googlegroups.com
We could have one var stack that holds vars that point to vanilla varis and another var stack that doesn't get chained with matrix vars that point to matrix varis (this is pretty much how matrix varis work today). I do see the encapsulation problem though. Even if we had a VanillaVari and a MatrixVari that derived from a BaseVari class, the math functions would need to be written for BaseVari pointers, and then I'm not sure how you could interface with them from the language. I've attached some code for VanillaVari and MatrixVari, but I'm stumped now. At least I know how CRTP works now, so all is not lost.


Rob

crtp2_test.cpp

Bob Carpenter

okunmadı,
23 Ağu 2016 06:40:1423.08.2016
alıcı stan...@googlegroups.com

> On Aug 23, 2016, at 5:30 AM, Rob Trangucci <robert.t...@gmail.com> wrote:
>
> We could have one var stack that holds vars that point to vanilla varis and another var stack that doesn't get chained with matrix vars that point to matrix varis (this is pretty much how matrix varis work today).

I get that part, but how do you do indexing of a matrix?

> I do see the encapsulation problem though. Even if we had a VanillaVari and a MatrixVari that derived from a BaseVari class, the math functions would need to be written for BaseVari pointers, and then I'm not sure how you could interface with them from the language.

That would be OK---it'd be the usual complex CRTP because
the function signatures being called will be resolvable at
compile time.

The problem is how we deal with the chaining. I'm not saying it's
impossible, just that I have never figured out how to do it. At
some point, all my work always devolves into not knowing the CRTP
instance statically, so having to revert to virtual function calls
in the deepested nested calls.

> I've attached some code for VanillaVari and MatrixVari, but I'm stumped now. At least I know how CRTP works now, so all is not lost.

:-) That's a tough pattern and it doesn't buy you a whole
lot, becuase you can't use it in the usual OO way of having
containers of the base type.

- Bob

Tümünü yanıtla
Yazarı yanıtla
Yönlendir
0 yeni ileti