«MULTI-CAMERA SIMULTANEOUS LOCALIZATION AND MAPPING Brian Sanderson Clipp A dissertation submitted to the faculty of the University of North Carolina ...»
Another area where correlations are critical is in loop completion. Consider the simple case were a camera is panned around the vertical so that it views the walls of a room. As it turns it builds a map of the walls and its pose. The farther it turns away from the origin (its starting pose) the more uncertain its pose is as well as its feature estimates. When the camera turns all the way around and re-detects features that were mapped in the ﬁrst image this completes a loop. Recognizing that the features at the end of the loop are the same as those at the beginning should reduce the uncertainty of the features seen in the frames just before the loop was completed as well as update their positions. Without modeling the correlations between features this update is impossible and the loop cannot be properly completed. As stated by Davison (2007), ignoring the correlations between features could lead to over-conﬁdence in the feature estimates’ accuracy and the inability to close loops or detect drift.
Azarbayejani and Pentland presented an Extended Kalman Filter (EKF) approach to recursively estimate the structure, camera motion and camera focal length from an image sequence (Azarbayejani and Pentland, 1995). Davison (2007) was the ﬁrst to demonstrate a real-time VSLAM system using a single camera as its only measurement device. His work was also based on an Extended Kalman Filter and could map areas the size of a desktop or small room, detecting and completing loops. Davison’s system modeled the correlations between mapped features. This made the ﬁlter’s estimate of uncertainty consistent, but it limited the number of features that could be mapped in real time to less than one-hundred.
A particle ﬁlter approach to VSLAM was presented by Eade and Drummond (2006).
Their system could also map small ofﬁce scale environments but the small number of particles that could be processed in real time limited their map size.
Recent work on VSLAM has focused primarily on overcoming speed limitations as the number of cameras or 3D features in the map increases. Clemente et al. (Clemente et al.,
2007) proposed a sub-map approach where the total map is made up of a set of smaller Euclidean maps connected by transformations. Since each Euclidean map is limited in size it can be updated in real-time. A major drawback of this approach is that global correction is done by ﬁxing the sub-maps and varying the transformations between them. This forces all of the error accumulated in each of the sub-maps into the joints between the maps.
Eade and Drummond (2007) used a sub-map based approach to accelerate the mapping process in. In their work, each sub-map contained about ﬁfty 3D features with their associated uncertainties and correlations stored in a covariance matrix. They used the Laplacian of the projection function as a measure of the non-linearity of the projection function in order to decide where to split the map into sub-maps. By limiting the sub-map sizes they can fold the information from a new image into the map in real-time. Additionally, they can optimize the sub-maps with respect to each other to arrive at a globally consistent map. This is critical because inconsistency can lead to problems in detecting and correcting loops.
Klein and Murray (2007) developed a system for VSLAM they call “Parallel Tracking and Mapping” or PTAM. In PTAM, online, real-time camera pose estimation (tracking) and structure estimation (mapping) are separated into two threads. During tracking, the camera’s pose is estimated by matching features extracted in the current image to features in the map. The map is represented as a set of point features and key-frame camera poses. Batch optimization processes such as bundle adjustment can be used to optimize the mapped feature positions and key-frame poses in the images in a second parallel thread. Bundle adjustment is a process that minimizes the difference between the measured and expected projections of the 3D point features by varying the camera poses and 3D feature locations in the map. This optimization need not be real-time since the camera tracking is done in real-time in the ﬁrst thread. Klein and Murray developed PTAM for use in augmented reality in small workspaces. Their use of parallel threads allows the camera pose estimation to run at frame rate to support adding synthetic objects to the scene while also reﬁning the map using consistent, batch optimization. Klein and Murray (2009) developed a version of PTAM for the iPhone which shows great potential as an augmented reality platform.
The VSLAM method presented in Chapter 5 differs from PTAM in that we can explore new areas while the map is globally corrected. In PTAM new key-frames can only be added between global map corrections, limiting the rate of key-frame addition as the map grows.
In our method global map correction must complete between new loop completions rather than between new key-frames, signiﬁcantly increasing the size of areas that can be mapped online and in real-time.
Scalability of performance to larger map sizes was a focus of Konolige and Agrawal’s method called “FrameSLAM” (2008). Like Klein and Murray, they also used selected key-frames from the image sequence and only included these in the map. Konolige and Agrawal’s innovation was to convert the constraints imposed by corresponding feature measurements in two images into a synthetic measurement with an associated uncertainty tying the two camera poses together with a sort of virtual spring. This permanently marginalizes out the features from the map, signiﬁcantly speeding up the minimization used to correct for loops in the camera’s path. Their method achieves this speedup at the cost of a less accurate map structure. Accuracy is reduced because the feature measurements cannot be re-linearized once the features are permanently marginalized out of the optimization.
2.1 Basic Structure of the VSLAM Problem Visual SLAM can be broken up into three primary operations: correspondence ﬁnding, relative pose estimation, and global mapping. Correspondence ﬁnding involves determining which areas of the current image correspond to known features in the map or to areas in the previous image. Relative pose estimation uses these correspondences to ﬁnd the camera motion from one frame to the next. Global mapping involves detecting long-range loops in the camera’s path and correcting the map to reﬂect these loops and eliminate long-term mapping error. Figure 2.1 shows the basic architecture of a VSLAM system and the data that passes between the modules. Sections 2.2 through 2.5 will introduce these operations
Figure 2.1: Basic VSLAM operations and data that passes between them and discuss relevant prior work.
Any visual SLAM approach must complete each of the three above operations. Some such as the one we have developed (Clipp et al., 2010) treat each of the operations as largely independent with well-deﬁned but limited data transfer between operations. Others, such as the method of Davison et al. (2007) perform the correspondence ﬁnding, relative pose, and global mapping operations in a single, uniﬁed manner. Both the separate and uniﬁed approaches have beneﬁts. Separating correspondence ﬁnding, relative pose estimation and global mapping operations introduces parallelism to VSLAM that can improve processing speed as well as allow for exploration while the map is globally updated after a loop is completed. However, a uniﬁed approach also has beneﬁts. For example, tying pose prediction to correspondence ﬁnding can aid in tracking features when camera motion is erratic.
2.2 Correspondence Finding There are two major types of correspondence ﬁnding approaches: unguided approaches that ﬁnd correspondences without modeling inter-correspondence relationships, and those guided by and coupled to 3D relative camera pose estimation. Unguided approaches were the ﬁrst to be developed by the computer vision community so they will be described ﬁrst.
2.2.1 Unguided Correspondence Unguided correspondence methods estimate optical ﬂow without any camera motion model or other external sensors, e.g. inertial sensors. Unguided approaches can be roughly divided into two classes: differential-tracking approaches that depend on a small feature motion assumption, and matching approaches that can ﬁnd correspondence in pairs of images with larger viewpoint changes.
In Kanade, Lukas, Tomasi (KLT) feature tracking (Lucas and Kanade, 1981; Tomasi and Kanade, 1991), a sparse set of corner features are extracted from the image. These features have a strong gradient in two orthogonal directions and so are well constrained in the image (Shi and Tomasi, 1994). Newton’s method is then used to ﬁnd the minimum of
placement from one image to the next in a temporal sequence. This equation integrates the difference between patches of images I and J over window size W parameterized by their center location x and a displacement d. The KLT tracking formulation assumes sub-pixel feature from image to image. A scale space pyramid is used for tracking motions greater than a single pixel. The KLT feature tracker treats each feature in an image independently and so can be easily parallelized and mapped to the graphics processor to achieve fast tracking. One implementation achieves greater than two-hundred frames per second tracking, including estimating the mean intensity change for all features on monocular 720x560 pixel resolution video (Zach, 2009).
Many approaches exist for ﬁnding feature correspondences over wide baselines. Harris features (Harris and Stephens, 1988) are extracted at strong corners in the image as measured by the gradient magnitude in a local region. Harris corners use an approximation to the eigenvalues of the structure tensor to detect features. For this reason, they give a less accurate response than Tomasi and Shi’s operator (1994) but are more computationally efﬁcient.
One of the limitations of Harris features is that they are point features, lacking an intrinsic scale or neighborhood size. Lowe (2004) showed how a scale space approach to detecting blob-like features using convolution with a difference of Gaussian (DoG) operator could lead to scale invariance. The difference of Gaussian operator is an approxima
tion x is the position with respect to the ﬁlter origin, σ is the ﬁlter’s standard deviation and K is a scaling factor tuned to make the DoG best approximate the Laplacian of Gaussian.
An example DoG ﬁlter is shown in Figure 2.2 with K = 1.6 and σ = 1.0. Lowe’s SIFT feature (2004) combines the scale space blob detection with patch orientation based on gradient magnitude and a descriptor based on a histogram of gradients to achieve correspondence between images at up to a thirty-degree change in viewing angle.
Another example of wide-baseline matchable features is the maximally stable extremal region (MSER) ﬁrst introduced by Matas et al. (2002). Rather than looking for a strong response to a corner or blob detector, MSER features are image regions that have a relatively stable size over a range of different image thresholds. To calculate MSER, the image is converted into a series of binary images with different thresholds where pixels below the threshold are zero and above the threshold are one. Connected components are then found in these binary images. Components that have a relatively constant size over a wide range of thresholds are detected as features. MSER features are invariant to intensity changes and to projective distortion. Hence they are ideal for wide baseline feature matching. Matas et
extracted with an algorithm that is at worst O(n).
Each of the preceding feature extraction methods extracts features in each frame indpendently and without regard for the correlations between feature positions that can come from an underlying model of the scene geometry and camera motion. The following section will introduce guided correspondence ﬁnding methods that make use of this correlation to improve matching.
2.2.2 Guided Correspondence Guided correspondence ﬁnding uses a camera motion model and inter-feature probabilistic relationships to match features. In its simplest incarnation, active matching can use a prediction of the camera’s motion, usually based on a constant rotational and translation velocity model, together with the estimated 3D feature positions to predict the projection of the features in the next image. This approach was used by Davison (2003) to constrain the search area for correspondences. A reﬁnement is then performed starting at these predicted feature locations to ﬁnd the features. The advantage of this framework is that it can handle faster camera dynamics than a differential feature tracker, can cope to some extent with repetitive features, and is less computationally expensive than many feature extraction methods.
This search-based approach does not take into account how informative a given feature is about the camera pose. Davison (2005) shows that the mutual information between the map state (3D feature positions and camera pose with uncertainties) and the measurements can determine how much information is gained by ﬁnding a given feature. This expected information gain can then be combined with the cost of detecting that feature to determine the relative efﬁciency in terms of information gain per unit computational cost of detecting a feature. Features can then be found in the most informative order. As each feature is found, it also reduces the uncertainty of the other features, which reduces the computational cost of ﬁnding those features. Given a ﬁxed time budget, ordering by relative information efﬁciency gives the most information about the map state possible.