How to effectively write derivatives of custom batch matrix-vector multiplication

Kirill Klimov

unread,

May 20, 2021, 12:05:54 PM5/20/21

to CasADi

Hi! First of all - thank you for awesome library.

I'm solving a large-scale optimization problem (non-linear least squares, hundreds of thousand variables) via nlpsol. Initially i used SX to construct my loss. Unfortunately it takes a lot to compute gradient and hessian of my function, even more than actual nlp solve. As far as i understand how casadi AD works, several big MX operations should be easier to differentiate than thousand of SX. So i decided to rewrite my loss via MX.

While my whole loss can be easily expressed as several (4-5) casadi matrix operations there is one which i struggle to write effectively via MX - batch matrix-vector multiplication. I need to multiply a several thousands of small vectors (3x1) by a same number of small matrices (3x3).

So i have two questions:

1. Is there a way to implement batch matrix-vector multiplications via vanilla casadi operations? I tried to implement it by map, but it gives me quite small speed up for symbolic hessian (probably less than x2 compare to fully SX version).

2. I can implement my own callback which performs batch matrix-vector multiplication as one big matrix operation with jacobian and jacobian-of-jacobian with appropriate sparsity. Is there are any tricks to help casadi to take and compute derivatives? Maybe i should implement forward/reverse operations for my jacobian instead of pure jacobian-of-jacobian. Maybe there are some usefull flags, etc

Sorry for mistakes, English is not my native language

Joel Andersson

unread,

May 21, 2021, 4:14:25 PM5/21/21

to CasADi

Hi,

1. Yes, just create a function for performing say N=100 operations and then call this multiple times via a map operation. That might be a good trade-off for overall speed and memory use.

2. I wouldn't do this. Callback will have a lot of overhead by itself and complicates derivative calculations.

If you use MX instead of SX for these operations, you might get a good speed-up by doing just-in-time compilation, at least a factor 10.

Best regards,

Joel

Kirill Klimov

unread,

Jun 1, 2021, 4:16:04 AM6/1/21

to CasADi

Hi! Thank you for answer!

>1. Yes, just create a function for performing say N=100 operations and then call this multiple times via a map operation. That might be a good trade-off for overall speed and memory use.

Wow, actually i never thought that such "unrollinig" can speed up computations. Can you explain why it's faster than "fair" map over all operations? Which function ?

>If you use MX instead of SX for these operations, you might get a good speed-up by doing just-in-time compilation, at least a factor 10.

I observe ~x3 speed up for gradient on SX vs MX. But my MX implementation was with custom Callback for batched matrix multiplications. Didn't tried to take hessian yet.

пятница, 21 мая 2021 г. в 23:14:25 UTC+3, j.a.e.a...@gmail.com:

Joel Andersson

unread,

Jun 16, 2021, 10:32:40 AM6/16/21

to CasADi

Hi,

There will always be a certain overhead for a function call, say X in addition to the cost of evaluating the actual function, say C. So calling something 10,000 times might cost 10000*(C + X). For a very cheap operation, X might dominate over C. If you unroll 100 calls into a new single function, it might cost (100*C + X), so total cost is 100 * (100*C + X). You probably don't want to unroll all calls because then memory overhead might be prohibitive (function no longer fits in cache etc).

Joel

Reply all

Reply to author

Forward