# «MULTI-CAMERA SIMULTANEOUS LOCALIZATION AND MAPPING Brian Sanderson Clipp A dissertation submitted to the faculty of the University of North Carolina ...»

Their method partitions the measurements in the reconstruction into those that depend on multiple sub-maps and those that depend only on a single sub-map. Cameras or features that yield measurements that only depend on a single sub-map are internal variables while those that span different sub-maps are separator variables. First, the internal variables are optimized for each sub-map. This can be performed in parallel. The linearizations of the internal variables are then cached and used in optimizing the separator variables while holding the internal variables ﬁxed. Finally, the separator variables are held ﬁxed and the internal variables are given a ﬁnal polish. This separation mechanism reduces the total time required to reconstruct a sequence even without using parallel processors because the Cholesky factorization used in each iteration of bundle adjustment is super-linear in its complexity.

Rather than a hierarchical approach, Snavely et al. developed a method that selects the most important cameras in a bundle adjustment and only bundle adjusts the cameras in this skeletal set (Snavely et al., 2008). After bundling the skeletal set, they could then ﬁx this set and the associated 3D geometry and use standard robust pose estimation techniques to ﬁnd the pose of the cameras not in the skeletal set. Their algorithm increased bundle adjustment speed by more than an order of magnitude with ”little or no loss in accuracy” (Snavely et al., 2008).

The skeletal set is built by ﬁrst creating a graph with edges between every pair of cameras that share features in common. The edge weight contains the trace of the covariance matrix representing the uncertainty of the relative pose between the two cameras. This graph is then pruned, removing edges, checking that removing an edge in the graph does not increase the relative uncertainty between any two nodes in the graph by more than a factor t. The relative pose uncertainty between any two nodes is measured by summing the covariance trace values along the shortest path between the two nodes. Snavely et al. developed an efﬁcient algorithm based on a maximum leaf spanning tree (Guha and Khuller,

1998) to prune edges and arrive at the skeletal set. This skeletal set contains considerably fewer cameras than the original graph for typical Internet photo collections, which are the application of this work. Although not an exact measure, having fewer cameras in a bundle adjustment is very likely to increase processing speed.

Since image sequences were not the target of the skeletal set approach, there is still some question as to whether the skeletal set approach would dramatically improve loop correction speed in visual SLAM. However, this appears to be a promising possible approach since it would eliminate the many redundant cameras in a video sequence, increasing processing speed.

Frahm et al. (2010) developed an approach to efﬁcient bundle adjustment that might be applied to loop correction. Their method clusters images based on appearance and selects a single image from each cluster as a representative of the cluster or iconic image. Only the poses of the iconic images are corrected in the bundle adjustment rather than the entire set of camera poses.

Eade and Drummond presented a state of the art monocular SLAM system in (Eade and Drummond, 2008) which uses many local coordinate frames to store the map and camera poses. Their system will be described in detail so that their uniﬁed loop correction and tracking recovery mechanism can be introduced. A many-coordinate-frame approach was ﬁrst put forward by Bosse et al. in their Atlas framework (Bosse et al., 2004). In contrast to Bosse et al. who used the number of features in a given sub-map as a measure of when to create a new sub-map, Eade and Drummond make note of the non-linearity of the projection of a feature into a camera and use this as the basis to decide when to create a new local coordinate frame or sub-map. In Eade and Drummond’s system, features can exist in many sub-maps and are stored in inverse depth parametrization (Montiel et al., 2006) with an associated information matrix. The inverse depth parametrization has the advantage that it is nearly linear close to the coordinate system origin, which in this case is the local coordinate frame’s origin. The projection of a feature with standard parametrization into a

**camera with no rotation and translation T is:**

Eade and Drummond use an iterative minimization to incorporate new measurements into a node and estimate the new camera’s coordinates. The near-linearity of the inverse depth parametrization for features makes the minimization converge very quickly, they say more than 99% of the time in one iteration. The near-linearity of the feature parameterizations depends on the current camera being close to the origin of the current node. They use the trace of the Hessian of a synthetic measurement with coordinates (0 0 1) in the current node to measure the non-linearity of the current camera in its node. If the non-linearity is too high, they move the camera to a different node that is more linear or create a new node.

In Eade and Drummond’s system each local node is connected to its neighbors (nodes it shares features with) using a similarity transform. Common features between any two nodes introduce constraints between those nodes. Eade and Drummond update the global map by ﬁnding loops in the graph of local nodes connected by similarity transforms. They then use a non-linear minimization to update the similarity transforms between nodes.

Since any similarity transform cycle in the graph should sum to identity, they use a minimal spanning tree to ﬁnd the cycles. Any transform connecting two nodes of the spanning tree along and edge not in the spanning tree introduces a cycle in the graph. A cost function is used that penalizes the divergence of any transform in the cycle from the transform that the inter-node feature constraints support. This cost function over all cycles is then minimized to update the global graph of local coordinate frames. Their method of converting inter-sub-map measurements into virtual measurements constraining the sub-maps is very similar to the FrameSLAM work by Konolige and Agrawal (2008) which will be discussed below.

The use of cycles in the graph of poses or local coordinate frames for global map correction is becoming a topic of interest in VSLAM. Zach et al. (Zach et al., 2010) use cycles in the graph of camera poses in bundle adjustment to detect incorrect feature matches between images. These incorrect correspondences might be due to repetitive structure in the scene as suggested by Zach, but may have other causes as well. By considering cycles in the graph in a Bayesian framework, they can determine which relative pose estimates between a pair of cameras are likely to be incorrect and prune the measurements connecting these cameras from the bundle adjustment before starting the minimization. Removing these gross outliers dramatically improves the quality of the bundle adjustment results when repetitive structures are present in the scene.

One ﬁnal approach to loop completion that will be covered here is FrameSLAM by Konolige and Agrawal (2008). In their approach, feature correspondences are found only between pairs of images, rather than across many images. For each pair of images with correspondences in common, a transformation between the cameras is calculated with an uncertainty found by marginalizing out the 3D features measured in both images. This camera-to-camera transformation with uncertainty is then used as a synthetic measurement in a non-linear minimization over all camera poses. The advantage of this approach is that it factors out the 3D features from the non-linear minimization once and for all and so saves computation over standard sparse bundle adjustment. Unfortunately, because Konolige and Agrawal only use two-frame matches to estimate the 3D features, the linearization point they use to calculate the relative pose and uncertainty between the pair of cameras can be inaccurate in comparison to methods using longer feature tracks. This can lead to inaccurate transformations, which are inconsistent, meaning they have greater certainty than they should because of linearization error. These measurements are not re-linearized as they would be in an update step in bundle adjustment. This can lead to inconsistency in the entire map and inaccuracy in the camera path.

With measurements of features converted to synthetic measurements between poses, their approach can then further marginalize out camera poses using the Schur Complement.

This creates synthetic measurements between camera poses that did not share any features in common and reduces the size of the state in the non-linear minimization. This state reduction dramatically speeds up the minimization but at the cost of some accuracy. By eliminating cameras from the map, the authors have a VSLAM system that creates maps that scale with the size of the environment rather than the number of key-frames used to measure the environment. This is a very desirable property for any SLAM system.

Unfortunately, factoring out a camera in their framework generates measurements between all of the other cameras that share a synthetic measurement with the removed camera.

This is not a problem when synthetic measurements are only allowed for cameras that are temporal neighbors in the video sequence. However, when loops are completed, or if longer tracks are allowed, the reduction in the state size through factoring out cameras can cause an explosion in the number of synthetic measurements, slowing the minimization process.

Performing loop correction in real-time remains an open research problem in computer vision and robotics. Each of the approaches mentioned above has its advantages. However, none has been shown to scale to large enough environments to support loop correction for mobile robots operating in city scale environments. Since loop correction only needs to keep up with the exploration speed of a robotic platform, real-time loop correction in city-scale environments is probably as fast as loop correction ever needs to be for practical purposes. At the moment, solutions that could scale to cities do exist but only for topological maps. These will be introduced next.

2.5.5 Hybrid Metric-Topological Loop Completion Topological maps have been used in SLAM for many years (Angeli et al., 2008; Kuipers, 1978; Werner et al., 2009). Since they do not support path planning based on Euclidean constraints, there is some question about their usefulness in robotics applications. Additionally, purely topological maps cannot be used for augmented reality applications. An example of a purely topological map is a graph with nodes representing locations each of which may contain many images, and with edges representing a temporal connection between areas (Angeli et al., 2008). This temporal connection implies that two locations are close to each other, since they have been traversed in close succession.

However, recent advances have been made in hybrid Euclidean-topological mapping that resolve some of the problems with topological maps and Euclidean maps. In hybrid Euclidean-topological maps, the geometry at any one point in the map is known and each point in the map (possibly a camera pose or key-frame) is connected to other nearby points in the map by geometric transformations. Sibley et al. ’s Relative Bundle Adjustment (Sibley, 2009) introduced this sort of map conﬁguration to VSLAM. In contrast to global bundle adjustment approaches that parameterize features with respect to the global coordinate frame, in relative bundle adjustment, feature points are parameterized with respect to the ﬁrst camera that measures them. Cameras are parameterized in relative coordinates with respect to each other. Each key-frame in a video sequence is parameterized with respect to some previous key-frame with which it has measured the largest number of common landmarks.

The advantage of hybrid mapping is that by dropping the constraint that the map must be globally Euclidean, loop correction becomes very simple. Loops are corrected by simply adding a new transformation to the graph of camera poses or local coordinate frames. A hybrid map contains all of the measurement information that a Euclidean map does and so at any point using standard bundle adjustment techniques it could be upgraded to a Euclidean map. However, this upgrading does not have to be done on a global scale. If a path needs to be planned in the robot’s local environment only the local environment needs to be upgraded to a Euclidean map. The remainder of the map can remain topological.

Additionally, as is suggested in (Holmes et al., 2009), the robot’s local environment can be represented with relative coordinates while at the same time the rest of the map can be globally corrected using bundle adjustment in a seamless framework.

This discussion of VSLAM has shown that the VSLAM problem can be broken into many sub-problems. Later Chapters in this dissertation will address the sub-problem of 6DOF pose estimation for multi-camera systems as well as exploiting the subdivisions of the VSLAM problem introduced here to parallelize VSLAM and map ofﬁce-build scale scenes online and in real-time.

3.1 Introduction Multi-camera systems are an attractive alternative to single cameras or omnidirectional cameras for VSLAM, where the absolute, scaled ego-motion must be calculated. These systems have also been used to capture ground based and indoor data sets for 3D reconstruction (Akbarzadeh et al., 2006; Uyttendaele et al., 2004). They are relatively inexpensive and provide wide scene coverage while at the same allowing scaled ego-motion measurement.