>> a. parameters / residuals not being in cache
> The code in bundle_adjuster.cc parameters are in contiguous arrays. Residuals in internally stored in ceres in a scratch space I believe (which is contiguous)
Residuals and parameters are stored contiguously, but parameters are being accessed in a non-linear order (and everything might be not able to fit into cache, taking into account simultaneous accesses to jacobian).
If there is no local parametrization [and requested output is block-sparse matrix] - jacobians of residual blocks are written directly into block-sparse matrix (when CostFunction::Evaluate() is called), otherwise - to per-thread buffers first, and then the result of multiplication by local parametrization jacobian is written directly to block-sparse jacobian matrix.
>> b. evaluation preparation (needs to traverse some part of "logical structure" of the problem, that will probably not fit in the cache and might require pretty-non-linear memory access)
> The code in bundle_adjuster.cc has no PrepareEvaluation method.
Please take a look onto ceres::internal::ProgramEvaluator class ( ./internal/ceres/program_evaluator.h ); as I've mentioned above, ceres::Solver::Summary::jacobian_evaluation_time_in_seconds includes all time spent in ceres::internal::ProgramEvaluator::Evaluate method, when jacobians are requested.
Inside the loop over residual blocks it has logical sections of
* Preparation of all pointers to residuals-jacobians-etc
* Evaluation via ResidualBlock::Evaluate call
- Evaluating cost-function value
- Applying local parametrization jacobians
- Applying loss functions
* Gradient computation
>> d. local parametrization / jacobian write-back / etc
> The above code doesnt use local parametrization and uses dense jacobian (which should again be cheap to write to)
Memory access is not cheap if memory is not in cache; one still needs to check if a particular parameter block referenced by a current residual block has trivial parametrization.
> I am using bundle_adjuster.cc which comes with the ceres code. I have added only 5-10 lines of code to benchmark things, no logic changes. Therefore, the above numbers can be kind of reproduced.
1. In the code you've supplied, a typical time consumed by a single execution will be near the resolution of timer being used (on most modern machines; even in your results reported time is 407ns < 1us).
I would expect that it "averages" over bunch of zeros and, occasionally, 1us values; this might skew results into one side or another; I would recommend checking if std::chrono::high_resolution_clock has nanosecond resolution on the target platform and using it instead (or some sort of platform-specific hardware counter with required resolution).
2. The average time required for calling CostFunction::Evaluate is 15% slower from "inside the solver" than from the "creation point" on my pc.
If I make sure to pollute cpu caches before calling CostFunction::Evaluate -- it becomes 3x slower at "creation point" than "from inside the solver"
Thus, it seems reasonable to think that 15% slower performance in the first case is due to not all memory being in cached while CostFunction::Evaluate method is being called from the solver.
3. The total work performed in jacobian evaluation inside ceres-solver (i.e. not only CostFunction::Evaluate() call itself) for this particular problem on my pc costs 2.017x of cost-function invocation, and is distributed as follows (I will omit discussion on standard error of these estimates):
I. Common initialization: 0.063
II. Loop over residuals in ProgramEvaluator::Evaluate (0.18 is left uncategorized):
a. ResidualBlock::Evaluate() 1.412 (0.089 left uncategorized)
* Cost function evaluation: 1.000
* Invalidation/validation: 0.088 / 0.133
* Preparation: 0.052
* Local parameterization: 0.050 <-- keep in mind that even if you have trivial parametrization of parameter blocks, you still have to loop over all parameter blocks of current residual block and check if their parametrization is trivial; this is non-linear memory access and, thus, even if the operation itself is no-op, it still takes time to check if we have to do anything
b. Gradient evaluation: 0.129
c. Evaluation preparation: 0.197
d. Output write-back: 0.038
My conclusion is that cost of invocation CostFunction::Evaluate per se is +- the same, taking cache effects into account; besides invoking CostFunction::Evaluate ceres-solver performs some additional work, but most of it is easily identifiable.
Bundle adjustment is a memory-bandwidth limited problem, thus execution time is expected to be not completely consumed by computational routines.
PS: if you're interested in reducing total run time on the problem of similar size & structure - I would suggest trying DENSE_SCHUR linear solver first, instead of trying to reduce jacobian evaluation times; in this particular problem the block of schur complement, corresponding to camera parameters has size of only 351x351.