# «MULTI-CAMERA SIMULTANEOUS LOCALIZATION AND MAPPING Brian Sanderson Clipp A dissertation submitted to the faculty of the University of North Carolina ...»

Chli and Davison show that a mixture of Gaussians model can be used along with the information gain criteria developed in (Davison, 2005) to achieve matching in ”perviously unmanageable cases of jerky, rapid motion,” (Chli and Davison, 2008). While Davison’s previous work was limited to synthetic imagery, Chli and Davison applied their mixture of Gaussians model to a real-time camera tracking 3D SLAM system. However, Chli and Davison’s primary contribution was to show that the correlation between image features could be used to reduce the search area for successive features after initial matches are found. This signiﬁcantly sped up feature matching over Davison’s standard monoSLAM (Davison et al., 2007) matching implementation based on Joint Compatibility Branch and

on the efﬁciency of Chli and Davison’s algorithm by sparsifying the dense probabilistic relationships between feature correspondences. With sparsiﬁcation, Handa et al. can match many more features in real time than were possible with a dense measurement covariance matrix.

2.3 Relative Pose Estimation Relative pose estimation is the process of calculating the rotation and translation of a camera or multi-camera system from two sets of images taken at two different times. Relative pose estimation methods can be broken up into two primary classes: those that use sparse feature correspondences as their input, and those that rely on dense optical ﬂow. The former are much more commonly used in practice. An example of the latter, dense optical ﬂow based method, was introduced in (Yang et al., 2007).

Relative pose methods that use sparse feature correspondences can further be broken into two classes, those that use a predictive motion model, and those that do not. A predictive motion model is generally a very simple assumption on the camera motion such as, ”the camera moves with constant velocity.” Filtering approaches to visual SLAM (Azarbayejani and Pentland, 1995; Davison et al., 2007; Eade and Drummond, 2006) use a predictive motion model which, in combination with the expected 3D position of the sparse features based on a static world model and their uncertainties, gives a search region in the current frame for previously mapped features. Once these features are found, the pose estimate can be reﬁned starting from the predicted pose.

The main advantage of a predictive motion approach is that pose estimation does not use random sampling and so the correct relative pose can be found in a ﬁxed amount of time, a necessary property in any hard-real-time system. A primary disadvantage of predictive motion models is that the model used must match the expected motion, or the uncertainty of the predictive model must be accounted for in the ﬁlter’s process noise. Otherwise, the true feature matches would not be within their predicted uncertainty region and the matching would fail, leading to failure of the relative pose estimate.

Relative pose estimates that do not use a predictive motion model generally rely on random sampling of correspondences between features in the current image and either another image or 3D features in the map itself. This random sampling and consensus (RANSAC) approach was ﬁrst introduced by Bolles and Fischler in (Bolles and Fischler, 1981). The basic RANSAC algorithm for relative pose is given in Algorithm 1.

Data: Set of putative correspondences C Result: Relative pose Pi and inlier correspondences Cin conﬁdence in having seen a good solution = 0;

best sample support = 0;

while conﬁdence in having seen a good solution threshold do select a minimal set of correspondences;

use minimal set of correspondences to calculate relative pose;

test whether or not correspondences support solution;

if current sample support best sample support then

calculate conﬁdence in having seen a correct solution;

end Algorithm 1: The RANSAC algorithm applied to relative pose estimation Using a minimal set of correspondences to calculate the putative relative pose is desirable because it minimizes the chance that an incorrect correspondence or outlier will contaminate the putative solution. The conﬁdence in the current best solution can be cal

introduced in (Bolles and Fischler, 1981). This conﬁdence is dependent on the probability of a correspondence being an inlier pin, the solution set size and the number of samples taken so far. pin can be taken from the inlier ratio of the best solution found yet in the data.

The interplay of the inlier ratio pin and the number of samples determines the number of random samples that must be drawn to generate an output from the RANSAC algorithm with an acceptable level of conﬁdence c(solution). This means that with low inlier ratios, pin, a large number of samples may be required to get a solution with a desired level of conﬁdence. Unfortunately, this means that unless some a priori bound can be given on the inlier ratio, the sampling time is not bounded and so RANSAC cannot be used in a hard-real-time system.

2008) but none has overcome this fundamental ﬂaw in RANSAC. Given a ﬁxed time budget these methods will return the least bad solution they can ﬁnd. However, the conﬁdence in that solution may be too low for the solution to be used in practice if a lower bound on the inlier ratio is not given.

The RANSAC algorithm returns the best minimal sample solution found as well as the set of inliers. To get a more accurate solution, a reﬁnement of the solution should be performed using all of the inlier correspondences. Even the best solution after RANSAC is still based only on a small set of correspondences which themselves contain some amount of noise. Assuming the noise on the inliers is normally distributed, reﬁnement with a larger set of inlier correspondences will arrive at a more accurate result. This reﬁnement can be done with a simple linear method or a more complicated non-linear minimization may be required.

2.4 Direct Solution Methods Direct solution methods are used to calculate the relative pose of a camera from correspondences in each RANSAC sample. Many different direct solution methods exist to calculate the relative pose of a camera from a sequence of images. Each method is aimed at a speciﬁc problem such as homography estimation (Brown et al., 2007; Horn, 1987), motion

Each direct solution method is typically formulated as a problem of algebraic geometry and then solved using a unique method of the author’s choosing. Recently, Kukelova et al. (2008) have developed a method to automatically generate minimal problem solvers.

Careful consideration must be given to numerical performance in a direct solver, as this can greatly inﬂuence the reliability of the solver’s result. Additionally some solution methods are based on ﬁnding the roots of a polynomial function. Each of these roots must be tested in the RANSAC framework as a separate possible solution and so the degree of the polynomial also has a great impact on performance.

This dissertation focuses on visual SLAM for rigid multi-camera systems and so some

on visual odometry (2004) used the perspective three-point (P3P) method (Haralick et al.,

1994) to generate relative pose samples. The P3P method uses correspondence between an image from a calibrated camera and a set of 3D point features to ﬁnd the pose of the

as he moved the camera, enabling him to estimate the next camera pose with respect to the existing map. His veriﬁcation in RANSAC was then done with correspondences from both cameras in his system’s rigid stereo head.

Nist´ r also developed a generalized three-point method that could work with a rigid e

between the images and the 3D point features could be drawn from any combination of the cameras to generate the solution with each feature projection having its own center of projection. However, improved results were not demonstrated with respect to the P3P method where all feature projections share the same center of projection.

A particularly challenging problem is to develop a method to calculate the six degree of freedom motion of a non-overlapping multi-camera system. Weng and Huang (1992) presented one of the earliest works in this area. In their work they described how this motion could be calculated using a non-minimal linear equation and described motions that could lead to degenerate solutions algebraically. Their equations allow for the non-overlapping multi-camera system’s extrinsic calibration to change between images. However, this calibration must be known at each frame.

Dornaika and Chung (2003) extended the work of Weng and Huang (1992) by showing that a non-overlapping multi-camera system could be calibrated up to an unknown scale factor without stereo correspondence. Their method uses a three-stage process. First, the ego-motion of each camera is calculated separately. These ego-motions are then combined to calculate the relative orientation between the multi-camera system’s cameras. Finally, the relative scaling between the motions of the various cameras is found, placing each of the cameras in a single arbitrarily scaled coordinate frame.

Frahm et al. (2004) also developed a pose estimation method for multi-camera systems with non-overlapping ﬁelds of view. Theirs is a linear but non-minimal approach. Interestingly, when applying their equations to model a single camera system the equations reduce to the linear single camera pose estimation method with respect to a known set of homogenous world features given in (Hartley and Zisserman, 2004). Frahm et al. also develop a method to automatically calibrate a multi-camera system up to an unknown scale factor from correspondences. In contrast to Dornaika and Chung (2003), the calibration method of Frahm et al. requires overlap in the cameras’ ﬁelds of view.

Kim and Chung (2006) studied the problem of estimating the six degree of freedom motion (including scale) of multi-camera systems from only temporal correspondences.

They give proofs for all of the degenerate motions the camera system can take that do not allow for the scale of the motion to be calculated. These motions will later be introduced in Chapter 3. They then develop an Extended Kalman Filter for 6DOF structure and motion estimation using the known geometry of the multi-camera system to ﬁnd the absolute scale of the scene and motion without stereo correspondence. Kim and Chung also point out that adding a simple rotation of the multi-camera system about a single axis as the rig is moved can prevent degenerate motions in almost all practical cases.

Kim et al. (2007) have also solved the problem of 6DOF motion estimation by formulating it as a triangulation problem. They ﬁrst solve for the rotation of the multi-camera system by averaging the rotations measured by each of the cameras independently. They then triangulate the translation directions of the system’s cameras to arrive at a scaled measure of the camera system’s translation. They use a second order cone programming approach for triangulation, which is optimal in the L-Inﬁnity norm. Since it requires 5DOF motion estimates for each of the cameras to calculate the scaled motion, theirs is a non-minimal solution method.

Tariq and Dellaert (2004) have also developed a method for estimating the pose of a multi-camera system. Their method uses a non-linear minimization of the reprojection error of known 3D features. Using synthetic data, they show that increasing the number of cameras in a multi-camera system can improve the rotation and translation estimation accuracy, and prevent catastrophic failures due to a lack of tracked features.

Ni and Dellaert (2006) developed a method for six degree of freedom stereo camera motion estimation based on decomposing the problem into estimating the rotation from points at inﬁnity followed by the translation. Their method assumes an initial starting point close to the true motion solution, and then performs a non-linear minimization to ﬁnd the rotation, and then translation in a two-stage approach. Assuming an initialization as they do is appropriate for applications processing video sequences but cannot be used on general photo collections.

Another approach to solving the 6DOF motion of non-overlapping rigid multi-camera system is the generalized camera framework (Grossberg and Nayar, 2001; Pless, 2003). A generalized camera is an image formation system, which is made up of a set of rays. These rays can intersect at any number of optical centers with each ray possibly having its own optical center. A generalized camera can be used to model any camera or multi-camera system. Nist´ r’s (2004) generalized three-point solution method is based on a generalized e camera model, hence its name.

Stew´ nius et al. (2005) presented a minimal solution method for estimating the six e degree of freedom motion of a generalized camera from image correspondences. They showed that there are up to sixty-four solutions to the relative pose of two generalized cameras given six ray correspondences. One of the limitations of their approach is that it is degenerate for generalized cameras where the rays’ centers of projection are all on a line.

This naturally excludes all two-camera systems, as two camera centers always form a line.

Chapter 3 will describe one of the contributions of this dissertation, a novel six degree of freedom relative pose estimation method for a non-overlapping rigid two camera system using a minimal set of six correspondences. Using a system of two or more non-overlapping cameras, scene coverage can be maximized with this method while still measuring the true scale of the motion. In Chapter 4, another contribution of this dissertation is described. This is a method for estimating the scaled, six degree of freedom motion of a stereo camera with only a small overlap in the cameras’ ﬁelds-of-view. This novel method overcomes some of the limitations of the method introduced in Chapter 3 while still giving the multi-camera system a large total ﬁeld of view.