Hi all,
I recently ran into an issue that running Ceres on the same problem, single-threaded, produced slightly different results per run. This was so noticeable that the number of iterations to convergence could vary by 10-20 between runs, and the final minimum was also different. After checking through all residuals, and Jacobians, and ensuring they are identical at Evaluate, I found the problem is actually as a result of non-deterministic parameter re-ordering.
As background, the problem I am working on has a Jacobian with a large block-diagonal structure on the left (the "E" block) and then a dense section on the right (the "F" block). To setup the Problem, I first add the parameter blocks manually and then add the residuals in the correct order so that they are lexicographically ordered and this block structure is established. I also set the parameter blocks pertaining to "E" to be group 0, and the parameter blocks to "F" to be in group 1. Given this setup I did not expect any re-ordering to take place, however currently in Ceres this is not guaranteed.
In `Solve`->`SolverImpl::TrustRegionSolve`->`SolverImpl::CreateReducedProgram`, `original_problem` initially has `parameter_blocks_` where each parameter block has its correct (original) index. However, after the initial call to `RemoveFixedBlocksFromProgram` all `index` values are set to 1. Now, while `parameter_blocks_` still has its initial ordering, the actual recorded indices are lost. Going forward, in `ReorderProgramForSchurTypeLinearSolver`->`ApplyUserOrdering`, `parameter_blocks_` is cleared and the parameter blocks are then re-added based on the groups. This ensures that the user-ordering of "E" and "F" blocks is respected, but it doesn't guarantee deterministic ordering within each group. This is because parameter blocks are added to `parameter_blocks_` based on their ordering within the std::set<double*> for each group. Since this ordering is based off the address of each parameter block, this can be quite arbitrary especially if the variables are heap-allocated. As a result, even if column re-ordering takes place afterwards, the final permutations of columns can be different because of this different initialisation. Personally, I think this is a bug because the permutation shouldn't depend on the address of the parameter blocks.
I will try to put a small example together to illustrate this, but I am currently pretty busy and this has consumed a pretty substantial amount of time already. In saying that, I don't mean to sound unappreciative as I am enjoying all of the other benefits of this framework and I massively appreciate it being available at all!
Please let me know if you need any more information/clarification.
Kind Regards,
Richard