Essential Matrix

1,375 views
Skip to first unread message

james kain

unread,
Dec 10, 2012, 1:51:25 PM12/10/12
to ceres-...@googlegroups.com
I am looking at use of the Essential Matrix as a way to reduce the order of the BA problem for aerial imagery.
The Essential Matrix (E) provides a scalar constraint on unit vectors to a common feature in a pair of frames. 
The E constraint can be expressed in terms of the relative position vector between the two frames and the Direction Cosine Matrix (DCM) relating to the frames.
E = [pX]DCM -- where [pX] is the cross product matrix from the relative position.
There are many methods for estimating E from feature matches between image frames.
My thoughts are to use a local estimator for each overlapped frame pair to estimate its E matrix.
If the relative position is available for the frame pair, then the DCM relative attitude is easily obtained using only three well distributed features.
This enables the feature position vectors to be completely removed from the BA problem and reduces the number of estimated parameters to 3 per overlapped image pair.
Note that after the BA is solved in this manner the archived feature unit vectors can be used to recover the feature positions. 
 
Does anyone have any thoughts on this?
If this is practical, I would be interested in developing such a method for inclusion within ceres.

Sameer Agarwal

unread,
Dec 10, 2012, 2:33:00 PM12/10/12
to ceres-...@googlegroups.com
James,

The objective function you are constructing has been proposed in a similar form by these guys

Reduced Epipolar Cost for Accelerated Incremental SfM

I have never heard the term DCM before, but I imagine you mean to say that

E = [t]_x R

where [R|t] is the relative pose between the two frames.

This method is straight forward to implement using Ceres. Nothing special is needed. Having said that, this is a purely algebraic objective function, which says  nothing about the actual reprojection error.

As regards efficiency, there are two things here.

1. Reduction in the number of parameters being reconstructed. This is indeed interesting. 

2. Reduction in optimization time - I am very skeptical of this. Here is why. For most bundle adjustment problems of large size, it is the factorization of the schur complement which is the time constraint and not the evaluation of the jacobian and construction of the schur complement. I do not mean to say they do not count for anything, but they are not a particularly large fraction of time. 

Interestingly, the matrix you will be factorizing, will have the same sparsity structure as the schur complement, so the time to factorize that should be comparable.

So, I would recommend first setting up the usual bundle adjustment problem with a high quality bundle adjuster (may I recommend Ceres ;) ), and then doing a head to head comparison in time.

While you are looking at two view cost functions, a much better (but also much more complicated) two view cost function is described by John Oliensis in his work on exact two view structure from motion, where he is able to analytically eliminate the 3d points.


Equation 11 is what you are interested in.

Hope this helps,
Sameer




--
--
----------------------------------------
Ceres Solver Google Group
http://groups.google.com/group/ceres-solver?hl=en?hl=en
 
 
 

james kain

unread,
Dec 10, 2012, 7:30:16 PM12/10/12
to ceres-...@googlegroups.com
Sameer
Thanks for the excellent papers. I reviewed both!
Do you have a reference for the recent Epipolar paper -- there was not one on the manuscript!
 
What is most interesting to me is that the Essential Matrix (E), solved in isolation using only image pairs, can actually be used as a sort of final solution for the camera locations.
The E provides an estimate of both the relative attitude and relative position in 3D between all successive camera sites.
Of course, with no ranges, there is a global scale ambiguity.
 
Consider an airborne trajectory that circles a target of interest with the camera staring towards the center.
This might be typical for a 3D scene recreation of a building site.
The images are taken in temporal order as a sequence and the pairs will be processed that way.
The result using only pair-wise image processing, is a set of relative position vectors, and relative attitudes (Direction Cosine Matrices DCM)
 
We could actually prepare the complete trajectory of the aircraft/camera using only this data -- up to an unknown initial condition
If we make an assumption that the aircraft/camera starts at a known position and attitude, then the future position for the circular trajectory comes from simply adding up the relative positions.
Similarly the attitude trajectory comes from multiplying together the DCMs.
 
Moreover, consider the case where we have an onboard GPS and IMU.
All modern airborne imaging systems will include this today -- and even handheld cameras tomorrow.
Using the IMU/GPS filte/smootherr and the precise camera shutter timing, we can get the attitude and position (independent of the camera info) to 0.1m and 0.1 deg with reasonable covariance matrices.
 
The synergism of these independent datasets is explosive! and there are NO features other than those archived in the initial E(i,j) computations.
However, this may be more a Bayesian recursive estimation problem than an optimization problem.
Does ceres have any ability to deal with this type of mixed sensor problem?

Sameer Agarwal

unread,
Dec 10, 2012, 8:52:38 PM12/10/12
to ceres-...@googlegroups.com
Hi James,
Both the papers were included as links in my earlier post.

1. Reduced Epipolar Cost for Accelerated Incremental SfM


and 

Exact Two-Image Structure from Motion

What is most interesting to me is that the Essential Matrix (E), solved in isolation using only image pairs, can actually be used as a sort of final solution for the camera locations.

This is not correct. Besides the fact that you do not know anything about the global orientation + scale and the scale ambiguity in the relative position of the two view,  quite often just two views at this height do not have enough parallax to estimate the relative pose reliably.
 
Consider an airborne trajectory that circles a target of interest with the camera staring towards the center.
This might be typical for a 3D scene recreation of a building site.
The images are taken in temporal order as a sequence and the pairs will be processed that way.
The result using only pair-wise image processing, is a set of relative position vectors, and relative attitudes (Direction Cosine Matrices DCM)

OK.
 
 We could actually prepare the complete trajectory of the aircraft/camera using only this data -- up to an unknown initial condition.

No, this is not true. The relative position between a pair of cameras is know only upto a scale.
 
If we make an assumption that the aircraft/camera starts at a known position and attitude, then the future position for the circular trajectory comes from simply adding up the relative positions.
Similarly the attitude trajectory comes from multiplying together the DCMs

No.

 
.Moreover, consider the case where we have an onboard GPS and IMU.
All modern airborne imaging systems will include this today -- and even handheld cameras tomorrow.
Using the IMU/GPS filte/smootherr and the precise camera shutter timing, we can get the attitude and position (independent of the camera info) to 0.1m and 0.1 deg with reasonable covariance matrices.

Yes this information would be useful to fix the global and relative scale ambiguity. 
 
 The synergism of these independent datasets is explosive! and there are NO features other than those archived in the initial E(i,j) computations. 
However, this may be more a Bayesian recursive estimation problem than an optimization problem.
Does ceres have any ability to deal with this type of mixed sensor problem?

Recursive estimation is also an optimization problem. But this problem has nothing to do with bayesian/frequentist, recursive or not.

The usual bundle adjustment problem is

[A]  \sum_i | x_{ij} - P_i(X_j) |^2

i.e,  minimize the sum reprojection error of point j in image i. You can add some global constraints on the camera positions using various sensors and it becomes

[B] \sum_i | x_{ij} - P_i(X_j) |^2 + \sum_i f_i^2(P_i)

where, f_i contains whatever information you know from the sensors, gps etc.

What you are proposing instead is doing

[C] \sum_i,j,k | x_{ik} E_ij x_{jk} |^2 + + \sum_i f_i^2(P_i)

where x_{ik} and x_{jk} are the images of the same point k in images i and j.

Stated like this, this is a batch optimization problem, which will simultaneously optimize all the camera. It is straight forward to implement this or the full bundle adjustment problem in Ceres.

What you are proposing in your incremental/recursive formulation is to estimate E_12, E_23,... E_{n-1, n} using only pairwise information, and then stringing/integrating them together. Which is a cheap, innaccurate and highly error prone approximation to [C]. If you do not have gps information it is also incorrect.

I understand your desire to do away with 3d points, but if you are going to do that, atleast consider minimizing [C] as a batch optimization. It can easily be done using Ceres. Having said that, if time complexity is your only consideration -- 
and I am surprised that it is, since these computations are run in batch after capture, I really do recommend that you try doing the full bundle adjustment to see how long it takes, before going about using [C], especially since the bulk of the linear algebra required to do all of this is the same for [B] and for [C].

In all of this, we have not even talked about the role of outliers (which are plentiful), which need to be handled carefully in any SfM system.

Sameer


 
 

On Monday, December 10, 2012 12:51:25 PM UTC-6, james kain wrote:
I am looking at use of the Essential Matrix as a way to reduce the order of the BA problem for aerial imagery.
The Essential Matrix (E) provides a scalar constraint on unit vectors to a common feature in a pair of frames. 
The E constraint can be expressed in terms of the relative position vector between the two frames and the Direction Cosine Matrix (DCM) relating to the frames.
E = [pX]DCM -- where [pX] is the cross product matrix from the relative position.
There are many methods for estimating E from feature matches between image frames.
My thoughts are to use a local estimator for each overlapped frame pair to estimate its E matrix.
If the relative position is available for the frame pair, then the DCM relative attitude is easily obtained using only three well distributed features.
This enables the feature position vectors to be completely removed from the BA problem and reduces the number of estimated parameters to 3 per overlapped image pair.
Note that after the BA is solved in this manner the archived feature unit vectors can be used to recover the feature positions. 
 
Does anyone have any thoughts on this?
If this is practical, I would be interested in developing such a method for inclusion within ceres.

--

Pierre Moulon

unread,
Dec 11, 2012, 4:22:32 AM12/11/12
to ceres-...@googlegroups.com
Hi,

The GEA paper make me think also to this one:
Incremental Light Bundle Adjustment   http://borg.cc.gatech.edu/projects/ilba
Indelman V, Roberts R, Beall C, Dellaert F
British Machine Vision Conference (BMVC)
2012

Regards,
Pierre


2012/12/11 Sameer Agarwal <sameer...@google.com>

james kain

unread,
Dec 11, 2012, 8:44:28 AM12/11/12
to ceres-...@googlegroups.com
Thanks Pierre for the vector to the Ga Tech site -- I'll get access to that material today but havent reviewed it at this time.
 
Sameer.
My motivation for this discussion is the desire to process multiple aerial datasets, each containing many 1000s of images very quickly, with little memory and using standard PC hardware, in the field, using unskilled staff -- with 99.9% reliability. Our current practice is to run a very thorough (and well proven) navigation KF smoother that processes data from a low cost MEMS IMU and GPS that provides exceptional per-image pose and its covariance. This GPS/IMU KF is 100% reliable and only  takes a minute or so. The massive BA problem that results is the bottleneck. Our standard pre-planned mission uses parallel image collection swaths with 60% downlap and 30% sidelap.
 
Returning to the circular trajectory as a simple (but practical) example:
Consider that you have preprocessed all the overlapped image pairs to extract their E Matrices.
Assume you have loop closure so that the nth image overlaps the n-1th image -- and the E(n,n-1) matrix is computed.
Can we form a cost function in the E matrices, which when optimized, causes a trajectory that is consistent with all the feature observations? 
I you draw a circle trajectory about a point with image points on a regular polygon, then it looks like this is possible, ending up with a single scale uncertainty.
I understand that the imagery-alone will result in an overal scene scale uncertainty.
 
More to the point, do the pre-computed E matrices, form an information set that summarizes all the information from the observed features we have used in our pre-processing?  The "reduced measurement matrix" from the GEA paper (near eq 7) suggests this. And they describe numerous examples of actual imagery processing with seemingly good results. They dont get a "better" result (based on reprojection error) -- just a faster result. 
 
In the example case, we might pre-process all the image pairs using "enough" local features for each pair so that we have a "good" E matrix.
This local computation, is actually an estimation problem that can be recursively solved (locally) unitil the covariance of the E Matrix terms are "converged"
That is, we have extracted all the information from this image pair, along with the covariance information, that summarizes this local computation.
During this local estimation process, we can perform the "outlier" rejection process by oversampling the feature population from the overlapped region.
If this local estimation process is conditioned on the available pose from the IMU/GPS, then maybe we can do the above steps in a thorough statistical manner.
 
For our aerial problem, we use a forward projection of the image boundaries to a DEM (using the GPS/IMU pose estimates) and this enables the initial selection of the approximate projected overlap areas. We are careful to select (at our option) the location of features in the overlap region so that the matched features are well spaced about the overlap region. The features might be selected recursively with a stopping condition based on the E Matrix covariance information This set of operations lends itself to excellent parallelism for a multiple CPU machine -- and with low memory requirements.
 
Our current methods use a pretty standard BA solution to adjust the per-frame orientations based on the re-projections process; but the BA solution gets awkward when we have many thousands of images and many matched features per image.
 
I certainly agree that any attempts along the suggest line will require validation with the more traditional BA methods.
 
All thoughts are appreciated!
 
regards
  Jim
 
 

On Monday, December 10, 2012 12:51:25 PM UTC-6, james kain wrote:

james kain

unread,
Dec 11, 2012, 10:31:22 AM12/11/12
to ceres-...@googlegroups.com
Pierre noted the excellent work at Ga Tech on Structureless BA.
Their most recent papers on "Light Bundle Adjustment" are not avalable on the Ga Tech site.
However, search for "Incremental Light Bundle Adjustment" and you can get the excellent recent paper.
But better -- check out  this is excellent summary lecture by Indelman on the incremental LBA approach.
 
Most notably Indelman uses a BA cost based on the muti-view epipolar constrtant that removes the "structure" -- the individual features.
He expressly notes the order of the optimization becomes proportional to the number of poses -- not the original features (structure).
In fact, they use the 3-view constraint that acts to constrain the uncertain scale of the overall problem between the various 2-view constraints.
Thus, in my circular trajectory example, the scale could be solved as a single unknown parameter (I think).
 
   --- Jim
 
 
 

On Monday, December 10, 2012 12:51:25 PM UTC-6, james kain wrote:

Pierre Moulon

unread,
Dec 11, 2012, 2:25:51 PM12/11/12
to james kain, ceres-...@googlegroups.com
Hi,

As long you have Gps info you could be interested also in this work : Full 6DOF Pose Estimation from Geo-Located Images. Accv 12.

Regards,
Pierre

De : james kain
Envoyé : 11/12/2012 16:31
À : ceres-...@googlegroups.com
Objet : [ceres-solver] Re: Essential Matrix

--

james kain

unread,
Dec 11, 2012, 2:50:43 PM12/11/12
to ceres-...@googlegroups.com
Pierre ---
I couldnt access the actual paper --- but the abstract is very pertinent...
Do you have a copy of the paper itself?
 
//////////////////////////  abstract //////////////
Estimating the external calibration -- the pose -- of a camera with respect to its environment is a fundamental task in Computer Vision (CV). %For practical applications such as robot navigation or Augmented Reality, an approach delivering highly accurate results in real-time is required. In this paper, we propose a novel method for estimating the unknown 6DOF pose of a camera with known intrinsic parameters from epipolar geometry only. For a set of geo-located reference images, we assume the camera position - but not the orientation - to be known. We estimate epipolar geometry between the image of the query camera and the individual reference images using image features. Epipolar geometry inherently contains information about the relative positioning of the query camera with respect to each of the reference cameras, giving rise to a set of relative pose estimates. Combining the set of pose estimates and the positions of the reference cameras in a robust manner allows us to estimate a full 6DOF pose for the query camera. We evaluate our algorithm on different datasets of real imagery in indoor and outdoor environments. Since our pose estimation method does not rely on an explicit reconstruction of the scene, our approach exposes several significant advantages over existing algorithms from the area of pose estimation
////////////////////////
 
This paper is doing exactly what I had hoped!! the position is "easy" the attitude is hard ...
We can today get the geodetic positions to about 0.1m using Precise Point Positioning w/o a base station.
The referenced paper used RTK which is a pain for any airborne application.
Adding a $2000 IMU gives us attitude to 0.1 deg (actually about 0.05 deg).
 
I am convinced that pre-computing the relative pose for select image pairs will allow us to inject this observable to our current navigation filter as a new update.
The Ga Tech paper talks about culling the image pairs for appropriate use in the BA.
By doing the backward smoother, we should be able to get an optimal estimate of positions, poses conditioned on all the data.
Our current nav filter runs "in a few minutes" for several hours of collections and I would hope that, after the pre-computations, the new BA would filter woulf take the same time. 
 
My hope is I can get a good data set and try the BA mutiple ways ..
 
Regards and thanks the papers.
 
 
 
 
 

On Monday, December 10, 2012 12:51:25 PM UTC-6, james kain wrote:

Sameer Agarwal

unread,
Dec 12, 2012, 1:04:48 AM12/12/12
to ceres-...@googlegroups.com
My motivation for this discussion is the desire to process multiple aerial datasets, each containing many 1000s of images very quickly, with little memory and using standard PC hardware, in the field, using unskilled staff -- with 99.9% reliability. Our current practice is to run a very thorough (and well proven) navigation KF smoother that processes data from a low cost MEMS IMU and GPS that provides exceptional per-image pose and its covariance. This GPS/IMU KF is 100% reliable and only  takes a minute or so. The massive BA problem that results is the bottleneck. Our standard pre-planned mission uses parallel image collection swaths with 60% downlap and 30% sidelap.

Can you quantify what the cost is here? memory? time? what kind of bundle adjustment software are you using? how many images, how many 3d points and how many features are we talking about here?

Given the sparsity structure of such captures, solving bundle adjustment problems with 1000s of images is no problem, since solving the reduced camera matrix ends up being fairly efficient.
 
Returning to the circular trajectory as a simple (but practical) example:
Consider that you have preprocessed all the overlapped image pairs to extract their E Matrices.
Assume you have loop closure so that the nth image overlaps the n-1th image -- and the E(n,n-1) matrix is computed.
Can we form a cost function in the E matrices, which when optimized, causes a trajectory that is consistent with all the feature observations? 
I you draw a circle trajectory about a point with image points on a regular polygon, then it looks like this is possible, ending up with a single scale uncertainty.
I understand that the imagery-alone will result in an overal scene scale uncertainty.

Yes if you do a joint optimization of all the cameras together, then yes you can do something like this. The key being the loop closure. If you are unable to close the loop then all bets are off. That is a very delicate condition to depend on.
 
More to the point, do the pre-computed E matrices, form an information set that summarizes all the information from the observed features we have used in our pre-processing?  The "reduced measurement matrix" from the GEA paper (near eq 7) suggests this. And they describe numerous examples of actual imagery processing with seemingly good results. They dont get a "better" result (based on reprojection error) -- just a faster result. 

The reduced measurement matrix has the same sparsity structure as the schur complement in normal bundle adjustment, and factorizing this matrix is he bulk of the cost in each step. If it is indeed the case that they can solve the newton system for the reduced measurement matrix, then you can also factor the schur complement.

There is some more overhead for eliminating the 3d points and substituting them back in each iteration, but it is not significant at all.

This is why I find it hard to buy claims of efficiency of these methods. 
  
Our current methods use a pretty standard BA solution to adjust the per-frame orientations based on the re-projections process; but the BA solution gets awkward when we have many thousands of images and many matched features per image.

What does standard here mean SBA/SSBA/g2o? do you use sparse linear algebra? are you able to use multithreading to compute the reduced camera matrix?
 
Sameer
 
 

On Monday, December 10, 2012 12:51:25 PM UTC-6, james kain wrote:
I am looking at use of the Essential Matrix as a way to reduce the order of the BA problem for aerial imagery.
The Essential Matrix (E) provides a scalar constraint on unit vectors to a common feature in a pair of frames. 
The E constraint can be expressed in terms of the relative position vector between the two frames and the Direction Cosine Matrix (DCM) relating to the frames.
E = [pX]DCM -- where [pX] is the cross product matrix from the relative position.
There are many methods for estimating E from feature matches between image frames.
My thoughts are to use a local estimator for each overlapped frame pair to estimate its E matrix.
If the relative position is available for the frame pair, then the DCM relative attitude is easily obtained using only three well distributed features.
This enables the feature position vectors to be completely removed from the BA problem and reduces the number of estimated parameters to 3 per overlapped image pair.
Note that after the BA is solved in this manner the archived feature unit vectors can be used to recover the feature positions. 
 
Does anyone have any thoughts on this?
If this is practical, I would be interested in developing such a method for inclusion within ceres.

--

james kain

unread,
Dec 12, 2012, 9:38:51 AM12/12/12
to ceres-...@googlegroups.com
My motivation for this discussion is the desire to process multiple aerial datasets, each containing many 1000s of images very quickly, with little memory and using standard PC hardware, in the field, using unskilled staff -- with 99.9% reliability. Our current practice is to run a very thorough (and well proven) navigation KF smoother that processes data from a low cost MEMS IMU and GPS that provides exceptional per-image pose and its covariance. This GPS/IMU KF is 100% reliable and only  takes a minute or so. The massive BA problem that results is the bottleneck. Our standard pre-planned mission uses parallel image collection swaths with 60% downlap and 30% sidelap.

Can you quantify what the cost is here? memory? time? what kind of bundle adjustment software are you using? how many images, how many 3d points and how many features are we talking about here?
Given the sparsity structure of such captures, solving bundle adjustment problems with 1000s of images is no problem, since solving the reduced camera matrix (RCM) ends up being fairly efficient.
 
What we do today is pretty Naive .. i'm really looking ahead at next camera designs. My thoughts are replacing multi-camera designs (that achieve large spatial coverage) with somewhat random/imprecise scanning single boresight designs. Consider a single boresight that is nominally downlooking, but is caused to scan side to side. As an aerial system moves forward, this creates a cluttered patchwork of projected overlapping frames where each frame may be 24MP. Think about 1 frame/sec for ~2 hrs -- 14,000 frames.  The single scanning camera will have physically attached IMU/GPS that provides "tactical grade" position/attitude at each shutter. Cause its physically attached to the camera, the scanning can be imprecise (so long as we have coverage) but the pose is precisely measured. So you have this massive patchwork of >10,000 frames that needs to me mosaicked quickly.  This system is available today.
The computer: I just bought the 16-core parallella (www.adapteva.com) for $100.  So everything has to be massively parallel. I have some concern that classic BA, even with the RCM -- doesnt fit well in that model. Big matrices need to have access to lots of stuff that, as a minimum, have the order of the number of cameras!  The final output product must produce a 3D image of some sort that is viewable on a 3D display. So we need to recover enough features to do a proper seamless drape of the 2D imagery.  I'm thinking along the lines of the paper Pierre provided: "6DOF pose from Geolocated imagery".
 
 
 
Returning to the circular trajectory as a simple (but practical) example:
Consider that you have preprocessed all the overlapped image pairs to extract their E Matrices.
Assume you have loop closure so that the nth image overlaps the n-1th image -- and the E(n,n-1) matrix is computed.
Can we form a cost function in the E matrices, which when optimized, causes a trajectory that is consistent with all the feature observations? 
I you draw a circle trajectory about a point with image points on a regular polygon, then it looks like this is possible, ending up with a single scale uncertainty.
I understand that the imagery-alone will result in an overal scene scale uncertainty.

Yes if you do a joint optimization of all the cameras together, then yes you can do something like this. The key being the loop closure. If you are unable to close the loop then all bets are off. That is a very delicate condition to depend on.
 
For current airborne operations, flying a precision circle is trivial -- and we are talking about basic random pilots and aircraft that are leased. The nav display and cues from the IMU/GPS give good feedback for manual flight. If the features are projected using a reasonable terrain DEM (e.g., SRTM) and the initial IMU/GPS pose, then recapturing features from the earlier part of a circle is automatic. But of course, this was just a theoretical exercise to address the type of relative pose information available from the Essential matrix.  Our more typical coverage trajectory is the parallel swath method.  
 
More to the point, do the pre-computed E matrices, form an information set that summarizes all the information from the observed features we have used in our pre-processing?  The "reduced measurement matrix" from the GEA paper (near eq 7) suggests this. And they describe numerous examples of actual imagery processing with seemingly good results. They dont get a "better" result (based on reprojection error) -- just a faster result. 

The reduced measurement matrix has the same sparsity structure as the schur complement in normal bundle adjustment, and factorizing this matrix is he bulk of the cost in each step. If it is indeed the case that they can solve the newton system for the reduced measurement matrix, then you can also factor the schur complement.
There is some more overhead for eliminating the 3d points and substituting them back in each iteration, but it is not significant at all.
This is why I find it hard to buy claims of efficiency of these methods. 
 
What you say above makes perfect sense. This is why the papers only mention improvements of 2-3 (Indelman's paper Table 2). So removing the gigantic 3-element feature contribution to the estimated terms doesnt really help that much. And you still have to do all the matching and RCM computations. What we need are methods that blow the traditional methods away!  In a sense, the Ga Tech LBA method replaces the Schur breakout of the cameras with a more physical (at least to me) breakout in therms of the Essential Matrix.
  
Our current methods use a pretty standard BA solution to adjust the per-frame orientations based on the re-projections process; but the BA solution gets awkward when we have many thousands of images and many matched features per image.

What does standard here mean SBA/SSBA/g2o? do you use sparse linear algebra? are you able to use multithreading to compute the reduced camera matrix?
 
We have used commercial stuff but we have preferred our home-grown methods. Mostly because we can capitalize on the exquiste pose from the IMU/GPS. By doing the matches in projected space, we dont need the scale/affine invariance of SIFT but do a more 100% successful match approach using more tradition pyramid-ed correlations over large feature templates. We have tried all the various feature selection/match/RANSAC methods, but even a single bad match destroys any attempts at automation. With our custom approach,we end up with only a handful of 100% reliable and well-placed matches per overlap -- quality over quantity! Then we can use colinearity to get the geodetic match locations back to the focalplane for the proper unit vectors. Note that the pre-computation of pair-wise feature matches is indeed massively parallel. Today this is done using the VS2010 concurrency capability for multi-threading -- now replaced with the VS2012 AMP to access the GPU.  We just spread all the image pairs around the various CPUs and get the matches out of the way proportional to the number of CPUs. This is where I hope to use the Parallella when I get it in my hands.
 
Our post-match home-grown BA solutions have used recursive methods that have evolved over many years. Careful use of Krylov techniques allow replacing all HP (H=measurement matrix and P=covariance) computations with a single vector computation that automatically accounts for all the sparseness in the H. We use a special formulation of the KF so that it is mechanized in fixed point (see Mohinder Grewal's work on the sigmaRho filter) -- this further allows removing all insignificant multiply-accumulate operations related to weakly correlated non-contributing states. This unique technique bypasses the notion of "block partitioning" and secondary structure sorting used by SBA methods replaciing it with a more organic "trigger" for omitting non-contributing computations.  Using this stuff makes BA feature match processing manageable for 1000s of images --- but still exhibits the "curse of dimensionality".
 
Where we want to go is a whole new approach to the problem. All organized imaging platforms move along in time and snap images according a pre-planned coverage approach. Tourist camera "happy snaps" seems to have driven a lot of the one-size-fits-all BA technologies! The Essential matrix actually provides real navigation information -- but its use within current BA is limited in the same way as the Schur decomposition -- resulting in a curse of dimensionality (as you point out above). From these discussions and the excellent papers I have now read, I am now convinced that we can "roll-up" the feature matching into a massively parallel pre-process resulting in the E matrices. The E matrix form can be linearized using a small-angle (cross product) matrix. This enables a local (per pair) estimator that is initialized with an a priori covariance of the relative attitude and iteratively produces an estimate of the relative pose Euler angles. The relative position part of this yields the unit vector direction of travel between frames (like a dead reckoning navigator).  So the E matrix now produces direct navigation information -- almost like integrating a gyro and accelerometer between image shutters. What this means is that we can treat the E matrix components just like Kalman filter updates to the traditional navigation filter. This classic filter already has the attitude and position being propagated as states and the relative information from the image data can be simply inserted as an update to this filter. Because each image-derived update includes 6 correlated quantities (3 relative positions & atttudes and their covariance), the overall computation is really not increased much from the traditional navigation KF.  Our KFs already run with sigmaRho so the complete forward/backward filter runs in minutes.
 
Your thoughts on this rather unorthodox approach are appreciated.
 
   Jim
 

On Monday, December 10, 2012 12:51:25 PM UTC-6, james kain wrote:

Sameer Agarwal

unread,
Dec 13, 2012, 4:08:39 AM12/13/12
to ceres-...@googlegroups.com
We have used commercial stuff but we have preferred our home-grown methods. Mostly because we can capitalize on the exquiste pose from the IMU/GPS. By doing the matches in projected space, we dont need the scale/affine invariance of SIFT but do a more 100% successful match approach using more tradition pyramid-ed correlations over large feature templates. We have tried all the various feature selection/match/RANSAC methods, but even a single bad match destroys any attempts at automation.

A single bad match destroys the whole system?
 
 
Our post-match home-grown BA solutions have used recursive methods that have evolved over many years. Careful use of Krylov techniques allow replacing all HP (H=measurement matrix and P=covariance) computations with a single vector computation that automatically accounts for all the sparseness in the H. We use a special formulation of the KF so that it is mechanized in fixed point (see Mohinder Grewal's work on the sigmaRho filter) -- this further allows removing all insignificant multiply-accumulate operations related to weakly correlated non-contributing states. This unique technique bypasses the notion of "block partitioning" and secondary structure sorting used by SBA methods replaciing it with a more organic "trigger" for omitting non-contributing computations.  Using this stuff makes BA feature match processing manageable for 1000s of images --- but still exhibits the "curse of dimensionality".

I have absolutely no idea what you are talking about here.
 
 
Where we want to go is a whole new approach to the problem. All organized imaging platforms move along in time and snap images according a pre-planned coverage approach. Tourist camera "happy snaps" seems to have driven a lot of the one-size-fits-all BA technologies! The Essential matrix actually provides real navigation information -- but its use within current BA is limited in the same way as the Schur decomposition -- resulting in a curse of dimensionality (as you point out above). From these discussions and the excellent papers I have now read, I am now convinced that we can "roll-up" the feature matching into a massively parallel pre-process resulting in the E matrices. The E matrix form can be linearized using a small-angle (cross product) matrix. This enables a local (per pair) estimator that is initialized with an a priori covariance of the relative attitude and iteratively produces an estimate of the relative pose Euler angles. The relative position part of this yields the unit vector direction of travel between frames (like a dead reckoning navigator).  So the E matrix now produces direct navigation information -- almost like integrating a gyro and accelerometer between image shutters. What this means is that we can treat the E matrix components just like Kalman filter updates to the traditional navigation filter. This classic filter already has the attitude and position being propagated as states and the relative information from the image data can be simply inserted as an update to this filter. Because each image-derived update includes 6 correlated quantities (3 relative positions & atttudes and their covariance), the overall computation is really not increased much from the traditional navigation KF.  Our KFs already run with sigmaRho so the complete forward/backward filter runs in minutes.


so in summary, what you are proposing is you run a kalman filter, which now not only uses the information from the sensors, but also uses the relative rotation and the direction of translation from the current and the last view? This seems fairly straight forward.

It won't give you any global correction, it risks drift and you will not be able to close loops, but otherwise sure the relative pose between two images is just another input to the kalman filter for integration. I do not think there is anything unorthodox or revolutionary here, visual SLAM folks have been doing stuff like this for a while from what I can tell.

Sameer


james kain

unread,
Dec 13, 2012, 9:42:49 AM12/13/12
to ceres-...@googlegroups.com
/////////  Bad matches /////////////////
 
"one bad match destroys automation ..."  was meant more as operational sarcasm. When we get a bad pose and this is used for the final image projection to a finished mosaic, the result is a bad seam that easily catches the eye -- even if its tiny. So edge blending is used at carefully chosen low-contrast seamlines placed between frames. This requires manually inspecting results to make sure that the seams have been properly made to "go away" to the eye. But the georegistration error is still there! Ideally, you would like for all poses to be perfect so we dont have to do this.
 
/////////////  New concepts for managing sparseness /////////////////////////
 
Sparseness relates to clustering and removing the zeros in matrix operations.
 
Krylov methods manage computations like HP where H is a measurement matrix (e.g., Jacobian) and P is the nxn matrix covariance. H is very sparse and P tends to "fill in" over time as states get correlated with new data. This can occur in our BA when we use "control points" in imagery or want to do map-matching to prior data. I dont see too much on this "prior map-matching" in the BA literature. So for Krylov, you provide a procedure that does H*r where r is an arbitrary input vector. The procedure uses the sparseness in H to structurally do the necessary multipy operations without any unnecessary multiply-by-zero. Then the HP matix operation just calls H*r n times. So the HP computations becomes proportional to n rather than n^n. 
 
Next consider the covariance matrix. This matrix carries the correlations that will provide insight into the quantative connectivity between states. If we know how the covariance evolves for an estimation problem, then we could not only treat sparseness but we could remove computations that do not contribute much. The binary concept of sparseness [its either a number or not a number] might be replaced by a quantative measure of the relative importance of terms to the outcome of a recursive estimation evolution.
 
Next what if we could scale all the terms in the estimation problem so that all terms fall between +1 and -1. Use of the correlation matrix rather than the covariance matrix is an obvious step. Then we use a floating point test (ugly) to skip all multiply-accumulate processes (c = c + a*b) where a OR b is less than a threshold. But is the test done n*n times any better than doing the multiply-accumulate?  BUT, what if we  multiply all the scaled internal variables by 2^7 so that they fit in a signed 8-bit word.  Now if we use ternary boolean logic operators, we can force the CPU to skip the "unimportant" math operations without very much overhead. So, in a sense, the "sparseness" concept is replaced with a "relative importance" concept that is built into the DNA of the estimation problem.  That is, any variable represented as 1 LSB, is never used in a multiply-accumulate operation cause this wont contribute anything. Its being multiplied by a number less than 1. This is the concept (called the sigmaRho filter) that Dr Grewal published in the paper below. His work is mostly related to running Kalman filters at MHz rates for communication problems but it is gerneralized for all estimation problems

"Kalman Filter Implementation With Improved Numerical Properties", IEEE Transactions, sept 2010 

 
Finally, consider the significant decomposition of the BA problem that has led to speed. We separate the feature estimation the from the pose (cameras), using the Schur or Essential matrix. Now we have a problem on the order of the number of poses/cameras. Possibly we can further combine the poses into a platform trajectory where we carry some notion of "smoothness" between poses. The IMU measurements are ideal to capture this "transitiion density between poses". The problem of estimating disconnected poses becomes a trajectory estimation problem. 
 
//// closure  /////
 
But we can also keep the very important benefits of feature closure along with the smoothness from the IMU. This "closure" comes for a mapping trajectory when we perform sequential parallel  m-image flight lines in opposite directions and get overlap with the first and last frames (1 and 2m) on each flight line. We can also get another form of closure, by "tieing down" the complete mosaic corners or edges either to a prior map or just restrict their movements (tightening their apriori geolocation covariance). The "closure" is actually the observation of two relative pose measurements that are widely separated in time or space. This is just another Essential matrix.
 
The closure is treated in a recursive filter by understanding where it occurs (from the planning stage), and keeping track of the necessary Essential matrices that will trap the closure. For example, in the mapping case, we will have an Essential matrix between the 1st and mth image pair at the ends of the two opposing flight lines.  Treatment of this relative attitude between current (i_th) and prior (i-2*m_th) KF states can be treated with statistical exactness by the KF. By running the smoother (backward filter), we can recover all the poses conditioned on all the data. The overall goal of such a process is to achieve the BA pose estimation "in minutes" once we have completed the pairwise Essential matrix computations. 
 
As noted, there is really nothing very new here. Just piecing together estimation material that has already been applied.
 
////  future directions ///
 
I am suggesting that there may be some alternatives to the highly evolved concepts of SBA. The key game-changer may be the now-ubiquitous MEMS IMUs that enable the trajectory smoothness to correlate the poses rather than treating them as independent states in a massive NLSQ problem. The use of the Essential matrix provides an easy method to merge the imagery-derived relative pose with the highly accurate IMU/GPS-derived smoothness constraints on this relative pose. 
 
It would be exciting to push forward with experimentation along these lines in an open forum where researchers around the world could contribute. 
All thoughts are appreciated..
 
  JIm
 
 
 
 

On Monday, December 10, 2012 12:51:25 PM UTC-6, james kain wrote:

james kain

unread,
Dec 29, 2012, 8:36:00 AM12/29/12
to ceres-...@googlegroups.com
Consider a case where we have a set of images with each image overlapping the prior image.
Also, assume we have processed the common features using known methods to compute the Essential Matrix (E) to achieve structureless BA.
Using methods from Horn [1], we can recover the Direction Cosine Matrix, DCM(i+1,i), relating the 3D sequential attitude change between frames i+1 and frame i.
It appears from [1] that this DCM can be recovered independently from the relative position change between frames.
 
Also consider that we have occasional out-of-sequence frame overlap between frame j and frame k which provides the observation DCM(j,k).
We can form an independent estimate of DCM(j,k) by the products of the incremental DCMs between j and k. Call this DCM'(j,k).
       DCM'(j,k) = product{ DCM(i+1,i) }  for i=j to i=k
 
We can form a cost function from DCM and DCM' with unknowns treated as errors in the transitional DCMs between successive frames.
Thus there will be 3N attitude errors where N is the number of image events.
This seems to be a general formulation of the BA problem for the case where sucessive frames are overlapping.
This is typically the case for airborne mapping situations.
 
Does this reduce the BA problem for this case from 6N to 3N by separatng out the position part of the pose from the attitude?  
 
Comments on this idea would be appreciated.
 

[1]    Horn B. K. P. , Recovering Baseline and Orientation from ‘Essential’ Matrix; January 1990; people.csail.mit.edu/bkph/articles/Essential.pdf

 

On Monday, December 10, 2012 12:51:25 PM UTC-6, james kain wrote:

Keir Mierle

unread,
Dec 29, 2012, 1:56:47 PM12/29/12
to ceres-...@googlegroups.com
Hi James,

I still don't understand your resistance to using Ceres to do regular bundle adjustment. Your problem is classic BA, and Ceres already does the equivalent of eliminating the position variables internally; and furthermore, a considerable amount of effort has go into making that code both fast and memory-efficient. I think you would be well-served by implementing the classic solution with a natural parameterization. There are people in the aerial imaging industry using Ceres to do exactly this today. If that is still not enough you can always apply variable partition schemes and optimize isolated parts of the problem (and then iterate).

Keir


--

james kain

unread,
Dec 29, 2012, 2:54:27 PM12/29/12
to ceres-...@googlegroups.com
Thanks Kier ---
 
We routinely use classic commercially-supplied BA solutions. However, we see an exploding frame count to 10s of thousands of 20MP images with side-to-side +/-45 deg oblique looks.  Most defintitely we will aways use the classic solutions to vet any new stuff. But as we operate globally today, we want to use inexpensive computational platforms and get the processing done via remote entry from 10,000 mi away.
 
GPU helps, mult-threads help -- but you still get geometrical execution growth with frame count. We have the benefit for our collections that we can carefully plan the mission. We know exactly where all closure will occur. I'd be very happy to have a BA solution that grows linearly with camera events.
Ceres is a generalized LSQ solver --- I am hoping to find an alternative cost function that makes use of both apriori knowledge of the collection plan and really good initial pose for all frames.
 
   --- Jim

Sameer Agarwal

unread,
Dec 29, 2012, 3:31:56 PM12/29/12
to ceres-...@googlegroups.com
James,
There are two way in which one can approach making bundle adjustment faster.

1. Bigger/Faster hardware/numerical linear algebra/solver.
2. Smaller optimization problems.

You seem to be interested in speeding up bundle adjustment using approach #2. Before addressing that, I would like to understand one of your claims better
 
GPU helps, mult-threads help -- but you still get geometrical execution growth with frame count.

I am not sure what you mean by geometrical growth here. From my experience, aerial/structured captures like this usually have an almost linear growth in problem complexity. So lets be more precise.

are you saying that the number of image observations grows geometrically? 
is the number of non-sparse entries in the reduced camera matrix growing geometrically?
Is the time taken to invert the reduced camera matrix growing geometrically?

Or are you talking about memory, in which case I would like to understand which part of the problem/solver shows geometric growth in memory.

Ceres is a generalized LSQ solver --- I am hoping to find an alternative cost function that makes use of both apriori knowledge of the collection plan and really good initial pose for all frames.

Okay so lets go to #2.

Refining pose by constructing these pairwise constraints, and not having a 3d scene to enforce rigidity is a fine heuristic. It is going to lead to geometric degeneracies, but lets assume you have enough sensor data for each camera to not be affected by that.

Then lets take a look at the structure of the problem. Ultimately you are going to put this problem into some sort of a solver, or write something custom of your own. If you are going down the recursive estimation route, then you are assuming there are no loops. If you have loops (I hope you do) then you will need to a real newton step with a sparse solver (iterative or direct) of some sort.

At this point we are more or less in the reduced camera matrix domain again, since the sparsity structure should more or less be the same.

So the question is, in your optimization is the bulk of your time being spent on linear algebra or not. 

If linear algebra is the big culprit in time and memory, then yes you should be simplifying the problem formulation to be more sparse.

If not, then you are spending time elsewhere, perhaps jacobian computation, or is it a memory thing?

In summary, I appreciate your attempts at speeding up bundle adjustment with various heuristics, but I think the proper approach depends on the bottleneck, and I still do not understand what that is in your case.

What is the processing you are willing to do, what is your hardware and what are you willing to let go in terms of quality... how accurate does the pose need to be.

Having recently seen the output of a pairwise bundle adjustment system, I can tell you this, please don't treat it as an actual advance in the state of the art. It is a heuristic and should be treated as such. 

Sameer
Reply all
Reply to author
Forward
0 new messages