# «MULTI-CAMERA SIMULTANEOUS LOCALIZATION AND MAPPING Brian Sanderson Clipp A dissertation submitted to the faculty of the University of North Carolina ...»

Additionally, in Figure 5.5 we show the vertical drift using only visual odometry of approximately 2 meters over a traveled distance of 70 meters. Figure 5.5 shows two side views of the map without (top) and with (bottom) loop detection and global map correction.

In the top pane, the accumulated vertical error of 2.0 meters is clearly visible while in the bottom pane it is eliminated. Note the regular pattern on the ground in the bottom pane that reﬂects the repetitive pattern in the carpet there. We have overlaid the results of our system with loop detection and correction over an architectural layout of the ofﬁce as illustrated in Figure 5.6. This ﬁgure demonstrates the accuracy of our system. The results shown here were processed from 8304 stereo frames of video at ﬁfteen frames per second including multiple loop detections and global bundle adjustments, one for each time a loop was detected.

We use the hallway sequence to demonstrate our system working in a less open environment where feature tracks are typically shorter and fewer in number. As shown in the panes of Figure 5.8 some of the hallways have large amounts of texture while others are largely textureless. The system robustly performs camera tracking in either case, at times tracking with fewer than ten features.

** Figure 5.10 shows an architectural drawing of the building’s hallways with dimensions.**

From the overlay of our model on top of the ﬂoor plan, it is clear that our system creates accurate maps. Taking the difference between the measured center-to-center distance of the longest mapped hallways on the ﬁgure and the distances between comparable camera centers in our reconstructed map we ﬁnd an error of 30cm in the horizontal direction and 40cm in the vertical direction in the ﬁgure. These errors equate to 0.6% error in the ﬁgure’s horizontal direction and 1.4% in the vertical.

The three modules’ timing results on this sequence are shown in Figure 5.7. Note the four spikes in the Global SLAM module’s processing time. These are times when loops were detected and the map was bundle adjusted. In the Scene Flow graph the spikes in processing time occur when new features are detected. In a typical frame 200-500 scene ﬂow features are tracked. In the global SLAM module there are typically 60-120 scene ﬂow features and 100-200 SIFT features, which are inliers to the motion model. Reprojection errors in the Global SLAM module’s results average approximately 0.6 throughout the sequence.

Another example of loop completion is illustrated in Figure 5.9 showing the map before loop detection and correction in the top pane and the bottom pane gives the result after loop detection. In these panes, the camera returned to the known area from the right. Note how the accumulated error in the camera poses and features is eliminated by the loop detection Figure 5.3: Sample frames from the left camera of the stereo pair for the ofﬁce sequence.

Note the reﬂective glass walls, textureless regions, and moving person in the images.

** Figure 5.4: Results of ofﬁce sequence top view.**

Top: Camera path calculated using visual odometry only. Bottom: Camera path calculated loop detection and correction global slam module. Green points are differentially tracked scene-ﬂow feature and blue points are SIFT features.

** Figure 5.5: Results of ofﬁce sequence side view.**

Top: Camera path calculated using visual odometry only. Bottom: Camera path calculated loop detection and correction global slam module. Note that the path processed through the global slam module is planar where visual odometry has large vertical error accumulation as shown by the red arrow. Green points are differentially tracked scene-ﬂow feature and blue points are SIFT features.

** Figure 5.6: Results of ofﬁce sequence.**

Camera path from global SLAM module using bundle adjustment for loop correction. The camera made two complete passes around both loops. Blue walls are glass. Grey walls are half ceiling height partitions. Green points are differentially tracked scene-ﬂow feature and blue points are SIFT features.

and global map correction. Finally, a side view of the hallway map is shown in Figure 5.11.

The side view demonstrates the planarity of the map, which matches the ﬂat ﬂoor of the halls.

5.5 The Challenge of Loop Correction Creating large-scale Euclidean maps in real-time remains an unsolved challenge in VSLAM.

Real-time is a somewhat ambiguous term in this case, so for the purposes here real-time will be considered ”fast enough to support online path planning.” In terms of the multithreaded approach to VSLAM presented in this chapter, real-time operation means that the system can globally correct the map in the time between subsequent loop completions.

This means that real-time operation depends both on the size of the environment and on the complexity of the robot’s path through the environment.

Before delving into a few promising approaches for real-time bundle adjustment, a brief introduction why bundle adjustment scales poorly as the problem size increases is in order.

1.5 Figure 5.7: Timing results on the hallway sequence. Note the processing time spikes in the scene ﬂow module when new features are extracted. The four large spikes in Global SLAM processing time are caused by bundle adjustments after loop completions. The average scene ﬂow processing rate is 37.1 frames per second (fps), visual odometry is 61.0 fps and global SLAM is 33.5 fps. The VO and Global SLAM processing rates are calculated as total processing time/video frames (not key-frames).

** Figure 5.8: Sample frames from the left camera of the stereo pair for the hallway sequence.**

Note the lack of texture in some images and the forward motion, which makes visual odometry more challenging than if the camera was pointing to the side.

** Figure 5.9: Results of hallway sequence.**

Top: Camera path intersection before loop detection and global map correction. Bottom: Corrected camera path. Note that the paths are now in the same plane. Green points are differentially tracked scene-ﬂow feature and blue points are SIFT features.

** Figure 5.10: Results of hallway sequence.**

The camera made three rounds of the left loop and one of the right. Green points are differentially tracked scene-ﬂow feature and blue points are SIFT features.

** Figure 5.11: The ﬁnal hallway sequence model viewed from the side.**

Note that the model is planar matching the planar structure of the building. Green points are differentially tracked scene-ﬂow feature and blue points are SIFT features.

Bundle adjustment uses non-linear minimization to optimize Equation 5.1. This equations assumes that the cameras’ calibrations are known. In that equation N cameras and M points are involved in the minimization. Ri and Ci are the rotation and center of camera i. Xj is the 3D location of feature j and xij is the measured feature j in image i. Π is the projection function and d() is a distance function, typically the reprojection error in pixels.

In this equation J is the Jaccobian of the projection function Π evaluated at the current state estimate and δ is the change in the state parameters that reduces the residual error

iteration of the minimization.

One approach to solve Equation 5.1 is to split the problem into two groups, cameras and 3D features, solve for the camera update and then back substitute to recover the feature updates. This can be done using the Schur complement as shown below.

In the above equations A is the portion of the augmented Hessian approximation containing the camera parameters, D contains the 3D features and B stores the constraints between features and cameras. δ, the update, has been split into camera, δc and a 3D point

The, now decomposed, augmented normal equation is then solved using Equation 5.11 and substituting δc into Equation 5.12 to ﬁnd the point update, δp. C −1 is easy to compute because it is block diagonal and so this efﬁciently reduces the least squares problem size.

However, the camera Hessian matrix (A − BC −1 B T ) can be fairly dense, making the solution still slow to compute. The more loops there are in the camera trajectory the more ﬁll-in there will be in this matrix. This is because any two cameras that see a 3D feature in common will have a non-zero off diagonal term in the matrix. Block reordering approaches exist to try to make the matrix as diagonal as possible. Ultimately then it is the remaining off-diagonal ﬁll-in that limits the speed of each bundle adjustment update iteration.

Conjugate Gradient approaches have shown some promise in large scale bundle adjustment (Byrod and Astrom, 2009; Agarwal et al., 2010).

Even with more efﬁcient solvers, for large bundle adjustment problems the number of iterations required for convergence grows super-linearly in the number of cameras and features. A collection of images in a bundle adjustment can be visualized as a graph with cameras as nodes and edges connecting cameras with features in common. In the ﬁrst few iterations local errors in the camera poses and 3D features are reduced fairly quickly. These local error corrections happen quickly because changes to cameras and 3D features need to only move a short distance through the graph to neighboring cameras. At this point what is left is a fairly smooth error function. Corrections to camera poses and features that reduce this error take more iterations to propagate through the graph because they have to pass through a longer path from the node to it’s neighbor’s neighbors etc. This slow propagation of camera pose and 3D point corrections is shown in Figure 5.12.

To this point the community has addressed this error propagation problem by reducing

** Figure 5.12: The propagation of error through a bundle adjustment problem.**

At iteration zero the distance d01 between the ﬁrst two cameras is too small (as given by some external measurement system). After iteration 0 the ﬁrst feature on the left has moved away from the cameras to match the now father apart cameras. This forces the distance between the second two cameras to extend as well as the distance to the second feature and so on. After iteration three the system has reached the minimum of the cost function.

is tied to the number of cameras and the number of features in the map and also limiting the number of measurements. The number of measurements must be reduced because they

One approach to reduce the number of cameras and measurements is to subdivide the map into multiple sections, solve them separately and then re-combine them in the ﬁnal result. This sort of approach is exempliﬁed by the out-of-core bundle adjustment approach of Ni et al. ((2007)) described previously in Section 2.5.4. Out-of-core bundle adjustment has shown some impressive results. However, because it splits the map into multiple section, solves the sections separately, and then re-combines them only optimizing the inter-submap variables, it may not reach the same global minimum that a full bundle adjustment would.

FrameSLAM (Konolige and Agrawal, 2008) presents another possible reduction technique based on permanently marginalizing out the features from the minimization and possibly some of the cameras. This technique was introduced along with other subdivisionbased loop correction techniques in Section 2.5.4 as well. Permanently marginalizing out the features also has its drawbacks. Once marginalized out the feature measurements cannot be re-linearized. This will likely lead to higher error than a full bundle adjustment that re-linearizes the measurements.

What is needed is an approach that makes the large-scale errors propagate through the bundle adjustment graph faster. A scale space approach applied to the bundle adjustment graph could speed the error propagation. A similar problem can be found in heat transfer problems. Say we have a sheet of material with its edges ﬁxed at a lower temperature and at its center heat is applied. Starting with these initial conditions, the temperature of the sheet at steady state can be found using a system of partial differential equations. The surface is discretized into a 2D grid with the temperature given at each point on the grid. An iterative approach such as Gauss-Seidel relaxation can then be used to ﬁnd the heat distribution.

Each iteration propagates the temperature correction a small distance on the grid. This meas that, in a similar way to bundle adjustment error reduction, small scale temperature errors are eliminated quickly while larger scale corrections take many more iterations to propagate through the graph.

The multigrid approach can be used to accelerate the temperature correction propagation and arrive at a solution in fewer, faster iterations. A more complete introduction to the multigrid method can be found in (Briggs et al., 2000) but it will brieﬂy be introduced here. The basic idea of multigrid is to ﬁrst eliminate the small scale error with one or more iterations. At this point larger scale error corrections will be all that is left. These larger scale errors will take longer to propagate through the graph. However, if we reduce the size of the graph by restriction, say dropping every other grid point we can reduce the problem size by a factor of two (see Figure 5.14). At this coarser scale the large scale error at the ﬁne scale are now smaller scale errors which can be efﬁciently eliminated. Once the errors

are corrected at the coarser scale the original grid points are re-inserted by an interpolation operator. This might be as simple as bi-linearly interpolating the temperature values of a coarse grid point’s neighbors. Then at the ﬁne level another few iterations are performed to eliminate the error caused by the interpolation. This ﬁne level iteration, restriction, coarse level iteration, interpolation, ﬁne level iteration cycle is referred to as a V-cycle in the multi-grid literature and is shown in Figure 5.13.