Costfunction::Evaluate() time vs Jacobian evalutation time

414 views
Skip to first unread message

Vivek Bagaria

unread,
Mar 9, 2021, 4:29:40 AM3/9/21
to ceres-...@googlegroups.com, Navneet Dalal, Alexander Maiorella
Hi

I ran the following command `bundle_adjuster --input=problem-39-18060-pre.txt`.

Problem: After constructing residual blocks, I ran `Evaluate()` on the cost_function objects and it  took 407 nanoseconds on average

Solver: Jacobian & residual evaluation in total takes 0.464 seconds for 6 runs with ~63k residual blocks, which turns out 1218 nanoseconds per evaluation. Any idea where the extra ~800 nanoseconds are spent? How to reduce this?

PS: I have attached the full summary report and modified bundle_adjuster.cc file

Cheers :)
--Vivek
CostFunction evaluate vs Jacobian evaluate
bundle_adjuster.cc

Vivek Bagaria

unread,
Mar 9, 2021, 4:52:32 AM3/9/21
to ceres-...@googlegroups.com, Navneet Dalal, Alexander Maiorella
Also, can someone let me tell me how much time the above problem takes on their machine using the "best" blas?

Sameer Agarwal

unread,
Mar 9, 2021, 8:34:42 AM3/9/21
to ceres-...@googlegroups.com, Navneet Dalal, Alexander Maiorella
On Tue, Mar 9, 2021 at 1:29 AM Vivek Bagaria <vi...@matician.com> wrote:
Hi

I ran the following command `bundle_adjuster --input=problem-39-18060-pre.txt`.

Problem: After constructing residual blocks, I ran `Evaluate()` on the cost_function objects and it  took 407 nanoseconds on average 

Solver: Jacobian & residual evaluation in total takes 0.464 seconds for 6 runs with ~63k residual blocks, which turns out 1218 nanoseconds per evaluation. Any idea where the extra ~800 nanoseconds are spent? How to reduce this?

Two things. If you have the same cost function and you are calling evaluate on it over and over again, then for one thing you are benefitting from it sitting in the cache. So the time you are observing is a lower bound. (In your case did you only evaluate the residual or did you evaluate the Jacobian also?)

That said, when ceres evaluates cost functions, it does two more things, one it applies the loss function to the residual and jacobians. It then writes the Jacobian of the cost function into the Jacobian matrix for the entire problem.

Sameer


PS: I have attached the full summary report and modified bundle_adjuster.cc file

Cheers :)
--Vivek

--
You received this message because you are subscribed to the Google Groups "Ceres Solver" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/CA%2BA0BBpPxabmYAWL0-20bDZACETiHXGgR3EcbwNfe89v7XrXkg%40mail.gmail.com.

Vivek Bagaria

unread,
Mar 9, 2021, 12:17:47 PM3/9/21
to Ceres Solver
[Replied Inline]

On Tuesday, March 9, 2021 at 5:34:42 AM UTC-8 Sameer Agarwal wrote:
On Tue, Mar 9, 2021 at 1:29 AM Vivek Bagaria <vi...@matician.com> wrote:
Hi

I ran the following command `bundle_adjuster --input=problem-39-18060-pre.txt`.

Problem: After constructing residual blocks, I ran `Evaluate()` on the cost_function objects and it  took 407 nanoseconds on average 

Solver: Jacobian & residual evaluation in total takes 0.464 seconds for 6 runs with ~63k residual blocks, which turns out 1218 nanoseconds per evaluation. Any idea where the extra ~800 nanoseconds are spent? How to reduce this?

Two things. If you have the same cost function and you are calling evaluate on it over and over again, then for one thing you are benefitting from it sitting in the cache. So the time you are observing is a lower bound. (In your case did you only evaluate the residual or did you evaluate the Jacobian also?)
The problem has one cost function for each "observation" between camera and 3d point, and I evaluate each such cost function only once. I evaluated both residual and jacobian.

That said, when ceres evaluates cost functions, it does two more things, one it applies the loss function to the residual and jacobians. It then writes the Jacobian of the cost function into the Jacobian matrix for the entire problem.
Applying cost function to residuals should be cheap I suppose. Just for clarification, does it apply loss function to the jacobians also? Is the second step expensive?

Thanks!
Vivek

Dmitriy Korchemkin

unread,
Mar 9, 2021, 1:53:37 PM3/9/21
to Ceres Solver
> Applying cost function to residuals should be cheap I suppose. Just for clarification, does it apply loss function to the jacobians also? Is the second step expensive?
- Applying loss functions and computing jacobian in tangent space happens in ResidualBlock::Evaluate method.
  If global size of parameter block is N, local size M and there are K residuals - then applying loss functions costs [note that you have loss function set to trivial by default in your code]:
  * O(K) for residuals
  * O(K*M) for jacobians
- Multiplying jacobian by jacobian of local parametrization will cost O(K*N*M) in this notation
Note that loss-function-related operations need to be applied to all parts of jacobian of residual block, but local-parametrization-related -- only to parts that correspond to parameter blocks with non-trivial local parametrization.



I would make an educated guess that in your case additional computational cost is dominated by (ordered in decreasing impact):
a. parameters / residuals not being in cache
b. evaluation preparation (needs to traverse some part of "logical structure" of the problem, that will probably not fit in the cache and might require pretty-non-linear memory access)
c. gradient computation
d. local parametrization / jacobian write-back / etc
Summary::jacobian_evaluation_time_in_seconds is filled using ScopedExecutionTimer created in ProgramEvaluator::Evaluate method; you can instrument this call further if you want to get exact timings for your particular problem.

Vivek Bagaria

unread,
Mar 9, 2021, 5:08:54 PM3/9/21
to Ceres Solver
Hi Dmitriy

Thanks for your response.

I am using bundle_adjuster.cc which comes with the ceres code. I have added only 5-10 lines of code to benchmark things, no logic changes. Therefore, the above numbers can be kind of reproduced.
At a high level, I am running plain vanilla bundle adjuster with no local parameterization, pre evaluation funcs etc and hence the wrapper around evaluate function taking 2x time the CostFunction::Evaluate is still puzzling :). Do you see similar numbers/trends on your end? My libraries might be missed up
I have replied inline to your points in detail


On Tuesday, March 9, 2021 at 10:53:37 AM UTC-8 Dmitriy Korchemkin wrote:
> Applying cost function to residuals should be cheap I suppose. Just for clarification, does it apply loss function to the jacobians also? Is the second step expensive?
- Applying loss functions and computing jacobian in tangent space happens in ResidualBlock::Evaluate method.
  If global size of parameter block is N, local size M and there are K residuals - then applying loss functions costs [note that you have loss function set to trivial by default in your code]:
  * O(K) for residuals
  * O(K*M) for jacobians
- Multiplying jacobian by jacobian of local parametrization will cost O(K*N*M) in this notation
Note that loss-function-related operations need to be applied to all parts of jacobian of residual block, but local-parametrization-related -- only to parts that correspond to parameter blocks with non-trivial local parametrization.



I would make an educated guess that in your case additional computational cost is dominated by (ordered in decreasing impact):
a. parameters / residuals not being in cache
The code in  bundle_adjuster.cc parameters are in contiguous arrays. Residuals in internally stored in ceres in a scratch space I believe (which is contiguous)
b. evaluation preparation (needs to traverse some part of "logical structure" of the problem, that will probably not fit in the cache and might require pretty-non-linear memory access)
The code in bundle_adjuster.cc has no PrepareEvaluation method.
c. gradient computation
This is a matric-vector multiplication. Should be cheap
d. local parametrization / jacobian write-back / etc
The above code doesnt use local parametrization and uses dense jacobian (which should again be cheap to write to)

Sameer Agarwal

unread,
Mar 9, 2021, 5:10:49 PM3/9/21
to ceres-...@googlegroups.com
blas does not affect any of this.
also the cost function is evaluated twice by the algorithm. every successful step evaluates the residual once and residual and jacobian. that should explain some of the discrepancies.
Sameer


Vivek Bagaria

unread,
Mar 9, 2021, 9:24:21 PM3/9/21
to Ceres Solver
"Residual only evaluation" shows the time taken by the 'residual only' computation right? The above discrepancy is for "Jacobian and Residual evaluation"

Dmitriy Korchemkin

unread,
Mar 10, 2021, 11:07:15 AM3/10/21
to Ceres Solver
>> a. parameters / residuals not being in cache
> The code in  bundle_adjuster.cc parameters are in contiguous arrays. Residuals in internally stored in ceres in a scratch space I believe (which is contiguous)
Residuals and parameters are stored contiguously, but parameters are being accessed in a non-linear order (and everything might be not able to fit into cache, taking into account simultaneous accesses to jacobian).
If there is no local parametrization [and requested output is block-sparse matrix] - jacobians of residual blocks are written directly into block-sparse matrix (when CostFunction::Evaluate() is called), otherwise - to per-thread buffers first, and then the result of multiplication by local parametrization jacobian is written directly to block-sparse jacobian matrix.

>> b. evaluation preparation (needs to traverse some part of "logical structure" of the problem, that will probably not fit in the cache and might require pretty-non-linear memory access)
> The code in bundle_adjuster.cc has no PrepareEvaluation method.
Please take a look onto ceres::internal::ProgramEvaluator class ( ./internal/ceres/program_evaluator.h ); as I've mentioned above, ceres::Solver::Summary::jacobian_evaluation_time_in_seconds includes all time spent in ceres::internal::ProgramEvaluator::Evaluate method, when jacobians are requested.
Inside the loop over residual blocks it has logical sections of
* Preparation of all pointers to residuals-jacobians-etc
* Evaluation via ResidualBlock::Evaluate call
   - Evaluating cost-function value
   - Applying local parametrization jacobians
   - Applying loss functions
* Gradient computation

>> d. local parametrization / jacobian write-back / etc
> The above code doesnt use local parametrization and uses dense jacobian (which should again be cheap to write to)
Memory access is not cheap if memory is not in cache; one still needs to check if a particular parameter block referenced by a current residual block has trivial parametrization.



> I am using bundle_adjuster.cc which comes with the ceres code. I have added only 5-10 lines of code to benchmark things, no logic changes. Therefore, the above numbers can be kind of reproduced.
1. In the code you've supplied, a typical time consumed by a single execution will be near the resolution of timer being used (on most modern machines; even in your results reported time is 407ns < 1us).
   I would expect that it "averages" over bunch of zeros and, occasionally, 1us values; this might skew results into one side or another; I would recommend checking  if std::chrono::high_resolution_clock has nanosecond resolution on the target platform and using it instead (or some sort of platform-specific hardware counter with required resolution).

2. The average time required for calling CostFunction::Evaluate is 15% slower from "inside the solver" than from the "creation point" on my pc.
     If I make sure to pollute cpu caches before calling CostFunction::Evaluate -- it becomes 3x slower at "creation point" than "from inside the solver"
    Thus, it seems reasonable to think that 15% slower performance in the first case is due to not all memory being in cached while CostFunction::Evaluate method is being called from the solver.

3. The total work performed in jacobian evaluation inside ceres-solver (i.e. not only CostFunction::Evaluate() call itself) for this particular problem on my pc costs 2.017x of cost-function invocation, and is distributed as follows (I will omit discussion on standard error of these estimates):
    I. Common initialization: 0.063
    II. Loop over residuals in ProgramEvaluator::Evaluate (0.18 is left uncategorized):
       a. ResidualBlock::Evaluate() 1.412 (0.089 left uncategorized)
          * Cost function evaluation: 1.000
          * Invalidation/validation: 0.088 / 0.133
          * Preparation: 0.052
          * Local parameterization: 0.050 <-- keep in mind that even if you have trivial parametrization of parameter blocks, you still have to loop over all parameter blocks of current residual block and check if their parametrization is trivial; this is non-linear memory access and, thus, even if the operation itself is no-op, it still takes time to check if we have to do anything
      b. Gradient evaluation: 0.129
      c. Evaluation preparation: 0.197
      d. Output write-back: 0.038



My conclusion is that cost of invocation CostFunction::Evaluate per se is +- the same, taking cache effects into account; besides invoking CostFunction::Evaluate ceres-solver performs some additional work, but most of it is easily identifiable.
Bundle adjustment is a memory-bandwidth limited problem, thus execution time is expected to be not completely consumed by computational routines.

PS: if you're interested in reducing total run time on the problem of similar size & structure - I would suggest trying DENSE_SCHUR linear solver first, instead of trying to reduce jacobian evaluation times; in this particular problem the block of schur complement, corresponding to camera parameters has size of only 351x351.
Reply all
Reply to author
Forward
0 new messages