Hello there!
First of all, let me say thanks for the awesome library. It has already helped me a lot with my work several times.
I am now solving a difficult problem and trying to get the maximum speed. Ideas are running out, so I want to ask the developers and more experienced users for help.
There are a few bottlenecks and a few thoughts to discuss. I will describe them point by point.
1. In my casadi graph, there are many multiplications of large block diagonal matrices by a vector. Maybe I was not working with them very optimally, but calculating derivatives was very slow. So I came up with the following trick (see attached picture). I shifted the blocks into one column to get rid of sparsity and also expanded the vectors, duplicating the information in it. Element-wise multiplication and then row-wise addition did the same math, but I got a noticeable speedup when working with large MX matrices.
But when the matrix is a constant (DM), by using this trick when multiplying by a vector (MX), I get the same time as in ordinary sparse multiplication. I now have one such multiplication in a graph and it takes a lot of time when calculating Jacobian. Could you advise something that will allow me to work with block diagonal matrices more optimally? Can using callbacks help here? I often have to count derivatives, maybe I can define part of the graph in such difficult places manually?
2. Another bottleneck in my task is frequent vector subsets. I often have to form new structures based on old ones. I used to do this with slices and vertcats, but it took a long time:
casadi::MX input;
std::vector<casadi::MX> temp;
for ...
temp.push_back(input(casadi::Slice(...)));
casadi::MX output = casadi::MX::vertcat(temp);Later I noticed that you can do the same thing in one casadi operation using indices:
casadi::MX input;
std::vector<casadi_int> tempIndices;
for ...
tempIndices.push_back(...);
casadi::MX output = input(tempIndices);
This gave a significant acceleration in the calculation of the Jacobian, but there are still many places where such operations take up to half the time when taking derivatives.
My question is, is it possible to do such operations even more optimally? For example, the data in my vectors is often grouped as it describes 2D and 3D points.
3. There was a lot of advice on the forum and in the documentation about using jit-compilation. I tried it in two ways (using "casadi::external" method and low c_api interface. Unfortunately, in my case, I got a very small speed-up, while the compilation took so much time that in the end the program ran even slower. Maybe there are some articles on how to use compilation to speed up your code?
4. Also I'm wondering if it is possible to move the casadi calculation from doubles to floats? For example, during jit-compilation it is possible to pass a casadi_real parameter, but the c_api interface only works with doubles. Is it possible at all to set a MX::Scalar for myself to float or pass it to heavy methods?
5. Also wondering if heavy methods (jacobian(), hessian(), jtimes()) have any parameters (const Dict& opts) that can speed up the computation? Tried some of them and got nothing so far. I suspect that the default settings are the most optimal, but what if I should try something else?
6. And my last question is about combining MX and SX types efficiently in casadi graph. So far, I have not been able to achieve anything. For example, I tried to switch to SX for the places where the subsets described in the second point take place. I count large operations on MX, wrap them to a MX function, then call for element-wise operations, wrap the total in a function and turn it into a SX using an expand() method. Then I send the function further along the MX graph. So far, I get only a noticeable slowdown, but maybe I'm using something wrong, or I'm using SX for the wrong operations.
Sorry for so many points. May be some of your answers will be useful not only for me, but also for other users. I will be very glad for any help.
Thank you!
