# «MULTI-CAMERA SIMULTANEOUS LOCALIZATION AND MAPPING Brian Sanderson Clipp A dissertation submitted to the faculty of the University of North Carolina ...»

Since their map only contains on the order of eighty features they have no need to use an inverted ﬁle data structure to ﬁnd likely matching images to the current image, but can simply do exhaustive matching to the features in the map. However, they do cull the possible feature matches based on prior common visibility (the features were all seen together before) and proximity. Features that are not close together in the map are not likely to be correctly matched to the features in the current image. Finally, Williams et al. use the perspective three-point method in a RANSAC procedure to geometrically verify their matches.

2.5.2 Loop Correction After a loop is detected, some process must be performed to correct the accumulated error in pose estimates and scene structure over the camera’s path. A given system’s approach to loop correction depends on the kind of map that the SLAM system builds. A system may create a Euclidean map, one in which all points in the map are known with respect to each other and the shortest distance between two points is a line, or a topological map which models the connections or transitions between various locations, but does not give an overall picture of the shape of the environment.

An atlas is an example of a Euclidean map we have probably all used at some point.

The map has a single common scale that can be used to ﬁnd the distance between any two points on the map. On the other hand, a subway map is a common example of a topological map. A subway layout shows the way to get between a set of stations along a prescribed set of paths. However, it does not show the exact geometry of the subway system.

Euclidean maps are more common in SLAM systems because they allow paths to be planned through the map which have not been traversed before and because they naturally represent the Euclidean structure of the world we inhabit. However, this universal knowledge of location comes at a high computational cost. Topological mapping systems, because they do not have a Euclidean structure, can be much less costly. A topological map at its simplest is a graph with nodes representing locations and edges representing known paths between locations. Loop correction in a topological map is as simple as adding another edge to the graph. This makes topological maps an attractive alternative in real-time systems. Recently, hybrid approaches have also been developed which are locally Euclidean but globally topological in nature (Sibley, 2009; Sibley et al., 2009).

2.5.3 Globally Euclidean Loop Correction Two opposite ends of the global loop correction spectrum are taken up by Extended Kalman Filter based approaches and bundle adjustment. In EKF approaches, the map and only the latest camera pose are corrected after a loop is detected. In contrast, bundle adjustment tries to create a maximum likelihood estimate of all camera poses and the map. Sub-map based approaches fall in-between these two ends of the spectrum. They generally break the map into a set of many sub-maps. Each of these maps is then corrected separately in their own coordinate frames. Lastly, the sub-maps are held internally ﬁxed and a correction process is performed which minimizes the error of measurements between sub-maps by changing the sub-maps’ relative poses.

Extended Kalman Filter approaches to VSLAM such as Davison’s (2007) only estimate the most recent camera pose. After detecting a loop, the current camera pose and map are updated to reﬂect the new measurements. If one were to plot the camera’s path based on the EKF estimates, one would see a discontinuity in the camera’s pose just after loop completion, since the previous poses are not updated when the loop is found. This will cause signiﬁcant problems for procedures that rely on an accurate camera path, including dense stereo matching and scene reconstruction.

Bundle adjustment (Triggs et al., 2000) represents the opposite end of loop correction spectrum. Using a non-linear minimization, bundle adjustment seeks to ﬁnd the globally optimal camera path and 3D scene structure given the images in a video sequence and feature correspondences between the frames. Many different parameterizations of bundle adjustment exist, but the basic form of the bundle adjustment error term is

where i is the camera index, j is the 3D feature index, Ri is the rotation of camera i, Ci is the center of camera i, Xj is the three-dimensional position of feature j, xij is the measured value of feature j in camera i, Π models the projection of the 3D point into the camera, and d() is a measure of the distance between the measured and expected projection of the feature and is generally given in image pixels. Typically, bundle adjustment procedures minimize this sum of squared errors using a Levenberg-Marquardt algorithm (Levenberg, 1944; Marquardt, 1963). However, others have shown promising results using preconditioned conjugate gradients (Byrod and Astrom, 2009; Dellaert et al., 2010).

Bundle adjustment is a highly sparse minimization problem. Each expected measurement is only affected by a single camera and 3D point. This gives the Jaccobian of the projection function Π its sparse structure. Lourakis and Argyros (2009) take advantage of this structure in their open source implementation sparse bundle adjustment, or. The bundle adjustment problem state can be partitioned into cameras and points. The Schur complement can be used to factor out the points, converting their non-zeros in the full Jaccobian matrix into constraints between cameras that both see the same feature. The features are generally factored out rather than the cameras because there are many more features than cameras in a typical bundle adjustment problem. However, in a situation where a camera moves in the same environment for a long period of time, it may be more efﬁcient to factor out the cameras if there are more camera poses than points. At each iteration of the minimization, the Schur complement is used to factor out the features, the camera portion of the state is updated based on the measurement residuals, and ﬁnally the updated 3D feature positions are calculated.

Bundle adjustment re-linearizes the measurement functions in each update step. In contrast, the EKF linearizes the measurements only once about the state estimate of the feature and camera at the time of the measurement. The inﬂuence of a given measurement is then folded into the EKF’s pose and map estimates as well as their uncertainties. The problem with this approach is that if the linearization point was far away from the true position the EKF will not be able to recover from this error. In contrast, starting from a state somewhat far from the global minimum of the bundle adjustment cost function (Equation 2.2), re-linearization can allow bundle adjustment to converge to the correct solution where an EKF will not. Of course, the error function has local minima in which the optimization may get stuck.

In addition to problems with local minima,due to its least squares formulation, bundle adjustment cannot deal directly with outlier measurements. However, a lack of robustness is a problem for any least-squares framework, including the EKF. Robust bundle adjustment methods try to deal with outliers by down-weighing their inﬂuence in the minimization. A simple method to do this is to assign a higher uncertainty to measurements that exhibit large residual errors (McGlone et al., 2004). Other approaches involve down-weighing the error term for measurements with large residuals according to some function (McGlone et al., 2004).

Bundle adjustment can be extended to other types of sensors besides cameras. Thrun et al. describe the GraphSLAM algorithm in their book Probabilistic Robotics (Thrun et al., 2005). The GraphSLAM algorithm generalizes bundle adjustment to include robotic control inputs, odometry information, and any other type of sensor information that can be used to map a pose of the robot or camera to another pose or a pose to a feature in the world.

2.5.4 Loop Completion by Subdivision While bundle adjustment is the gold standard for loop completion methods, its computational complexity scales with the cube of the lesser of the number of cameras or features in the map. This makes bundle adjustment impractical for large loop closures that have to be done in real time. Various methods exist to deal with this complexity problem. Most fall into the category of divide and conquer or hierarchical approaches.

Fitzgibbon and Zisserman (1998) developed one of the earliest hierarchical scene reconstruction methods for image sequences. Their technique breaks the sequence into sequential groups of three images. These image triplets are then reconstructed independently, ﬁnding image correspondences and a trifocal tensor between the images. This yields a large set of projective reconstructions. Each of these projective reconstructions is then bundleadjusted independently. They are then combined by estimating a homography of threespace between the two projective reconstructions and bundle-adjusting the homography.

This same homography estimation and reﬁnement process can then be applied hierarchically to join the entire sequence. A ﬁnal bundle adjustment on all views is then performed to create the resulting reconstruction of the total sequence.

Shum et al. (1999) take a different approach to hierarchical bundle adjustment. They split a video sequence into many small sections which they independently reconstruct using bundle adjustment. Then they select two virtual key frames for each section. One virtual key frame could be the ﬁrst frame in the sub-sequence and the other is selected at some other location. They calculate the 3D uncertainty of the features based on the measurements and cameras in a subsequence. This uncertainty is projected into the two virtual key-frames and stored. This gives two frames for each sub-sequence that contain all of the information constraining the other camera poses and the features up to linearization error.

The sub-sequences are then joined together into a single model with only the two virtual key-frames included in the ﬁnal reconstruction for each sub-sequence. In the ﬁnal reconstruction, the uncertainty of the 3D points projected into the virtual key-frames is modeled by non-isotropic measurement errors on the virtual measurements. The use of virtual keyframes dramatically reduces the number of images in the ﬁnal reconstruction, achieving faster processing speed in the ﬁnal bundle adjustment. At the same time, the virtual measurements with their associated non-isotropic uncertainties contain all of the information about the structure in the original images.

Nist´ r (2000) extended the work of Fitzgibbon and Zisserman (1998) and dealt with the e problem of view selection. While Fitzgibbon and Zisserman assume that all sequential sets

for selecting sets of views that were more likely to work well. In view selection, there is a tradeoff between the amount of parallax feature correspondences exhibit, which increases over time and the number of correspondences between images in a sequence, which tends

that results in the best reconstruction possible with general amateur videos, which may have varying camera separation over time. The view selection criterion measures both the number of feature correspondences and the ratio of correspondences that are inliers to the trifocal tensor but not to a homography. Effectively, this is a measure of the parallax in the scene. More parallax is desirable to arrive at an accurate 3D reconstruction. In addition

reconstruction accuracy. The guided line matching approach used can be found in (Schmid and Zisserman, 1997).

How to break up a collection of cameras and feature correspondences to speed up bundle adjustment was further studied by Steedly et al. (2003). Their approach was to break the graph of cameras and features into weakly connected sub-maps and then bundle-adjust these independently. The sub-maps are then ﬁxed internally and optimized with respect to each other to arrive at the ﬁnal reconstructed cameras and scene geometry. They used spectral graph partitioning on the Hessian matrix to ﬁnd the low-error modes of the reduced Hessian with the 3D features factored out using the Schur complement. Their partitioning took into account the fact that each camera has multiple corresponding rows and columns in the Hessian and forces the partitioning to keep each camera’s parameters together in a single partition. They compared the reprojection error after their spectral partitioning method against a simpler partitioning approach based only on the non-zero entries of the Hessian.

This simpler approach does not take into account the degree of connectedness between different sets of cameras but only whether or not two cameras measure any common features.

They show that by using a partitioning in the low-error modes of the camera system, they can achieve lower ﬁnal reprojection errors than the simpler visibility based approach.

Chli and Davision presented a method to split a map of camera poses and cameras hierarchically based on the mutual information between expected feature measurements in (Chli and Davison, 2009). Using the mutual information between measurements, features can be clustered into groups to split the map into sub-maps. These sub-maps can be further split in a hierarchical fashion and independently optimized, then held internally ﬁxed and optimized with respect to each other.

Ni et al. (2007a,2007b) developed an out-of-core bundle adjustment method for largescale 3D reconstruction. Their method starts by decomposing the graph of cameras and features in the bundle adjustment into a set of sub-maps. This partitioning is done with a graph cut that minimizes the number of edges (visibility connections between cameras and 3D features) that span the sub-maps. Each of these sub-maps is parameterized independently with one camera of each sub-map serving as the base node or origin of the sub-map’s local coordinate frame. This allows the sub-maps to be optimized separately.