Dear Sameer,
I have set up a sample side project in order to tune the ceres solver more easily. I chose 2D heat transfer because it exhibits some similarities with my problem (sparsity, problem size).
Compilation of Ceres under Windows
- Visual Studio 2015
- latest git sources
- Intel Math Kernel Library
- miniglog
Compilation / linking issues (solved)
I modified PATH to allow MKL detection by cmake (FindBlas.cmake and FindLapack.cmake). I add to alter two source files:
/internal/ceres/covariance_impl.cc:
#pragma omp parallel for num_threads(num_threads) schedule(dynamic) collapse(2)
#pragma omp parallel for num_threads(num_threads) schedule(dynamic) // collapse(2) <--- unkown keyword
\include\ceres\jet.h
jet.h(489): error C4996: 'j0': The POSIX name for this item is deprecated. Instead, use the ISO C and C++ conformant name: _j0. See online help for details.
j0 was replaced by _j0
j1 was replaced by _j1
jn was replaced by _jn
I chose miniglog because I had linking issues with gflags / glog.
Problem setup
m_Options.minimizer_progress_to_stdout = false;
m_Options.logging_type = ceres::SILENT;
m_Options.minimizer_type = ceres::MinimizerType::LINE_SEARCH;
m_Options.linear_solver_type = ceres::LinearSolverType::SPARSE_NORMAL_CHOLESKY;
m_Options.use_inner_iterations = false;
m_Options.use_nonmonotonic_steps = false;
m_Options.line_search_direction_type = ceres::LineSearchDirectionType::LBFGS;
m_Options.line_search_type = ceres::LineSearchType::WOLFE;
m_Options.line_search_interpolation_type = ceres::LineSearchInterpolationType::BISECTION;
m_Options.use_approximate_eigenvalue_bfgs_scaling = true;
m_Options.parameter_tolerance = 1e-6;
m_Options.function_tolerance = 1e-6;
m_Options.gradient_tolerance = 1e-6;
m_Options.max_num_iterations = 2000;
m_Options.num_threads = 1;
m_Options.num_linear_solver_threads = 1;
Residual block definition (1 block per unknown)
for (int j = 0; j < m_H; ++j)
{
for (int i = 0; i < m_L; ++i)
{
m_Problem.AddResidualBlock(new CostFunctor(...), nullptr, parameters[0], parameters[1], parameters[2], parameters[3], parameters[4]);
}
}
There are m_H*m_L unknowns / residual blocks. A residual only depends on 5 parameters.
CostFunctor constructor
The constructor is used to set some constants and size residual and parameter blocks:
set_num_residuals(1);
mutable_parameter_block_sizes()->resize(5, 1);
CostFunctor evaluate
The real problem is nonlinear, this one was setup just to play with ceres configuration
if (jacobians != nullptr)
{
if (jacobians[0] != nullptr) { jacobians[0][0] = -sz; }
if (jacobians[1] != nullptr) { jacobians[1][ 0] =-sx; }
if (jacobians[2] != nullptr) { jacobians[2][0] = (1.0 + 2.0*sz + 2.0*sx); }
if (jacobians[3] != nullptr) { jacobians[3][0] = -sz; }
if (jacobians[4] != nullptr) { jacobians[4][0] = -sx; }
}
residuals[0] = -sz*parameters[0][0] - sx*parameters[1][0] + (1.0 + 2.0*sz + 2.0*sx)*parameters[2][0] - sz*parameters[3][0] - sx*parameters[4][0] - T - Q*f;
Questions / Remarks:
OpenMP related:
With only one thread, the CPU load is between 10 and 20%, and the simulation (for a 200x200 size) runs at ~5fps. I can increase the number of threads, but it only increase the CPU load without inreasing the fps. If i set num_threads to 4 by example, the CPU load is 50% but the fps remain const. I have read somewhere on this google group that the general nonlinear solver wasn't multithreaded - that must be the reason.
CostFunction related:
Is it efficient to declare one costfunction per residual ? I have read that I could use one global cost function, but how the cost function can now on which residual it is working ?
Performance:
I think with only 10% of CPU load there is still room for improvement - but honestly, I don't know how to speed up the computation. Do you have any tips ?
Thank you again, working with Ceres is fun (head scratching sometimes but fun) .
Best regards,
Jean-Pierre