«MULTI-CAMERA SIMULTANEOUS LOCALIZATION AND MAPPING Brian Sanderson Clipp A dissertation submitted to the faculty of the University of North Carolina ...»
Image motion can then be parameterized as the motion of the 3D feature in front of the camera. This gives accurate correspondences between the four cameras of a stereo pair at two different times. In addition, we also track features that cannot be matched between the left and right image using a standard image space KLT formulation. The image feature tracking runs at approximately seventy frames per second including stereo feature matching and feature re-detection.
Estimating the scene structure and camera motion is done using a RANSAC framework to ﬁnd an initial pose estimate, followed by a local bundle adjustment to reﬁne the camera poses and 3D feature estimates. Structure from motion is performed only on key-frames, which are determined based on the average feature motion in the images. This considerably speeds up processing of a video sequence without signiﬁcant loss in accuracy. The RANSAC framework uses the minimal solver described in this chapter and makes the modiﬁcations to RANSAC mentioned in Section 4.4. Local bundle adjustment is performed on the previous seven key-frames and all of the features visible in at least two of those views.
Additionally, the two least recent camera poses are held ﬁxed to ensure the continuity of the SfM results.
Figure 4.13 shows a top down view of the left camera path calculated using a video sequence shot in an ofﬁce environment.
Example images of the video sequences are show in Figure 4.14. The ofﬁce loop is approximately 18 by 10 meters. The camera rounded the loop twice. The path was not exactly the same in both trips around the loop, which accounts for most of the variation of the paths. Note the upper left of Figure 4.13, where the camera path crosses over itself three times just before the clockwise turn. This location is a point of constriction in the environment, which forced the camera to take the same position on each trip around the loop and is shown in the top right image of Figure 4.14.
4.9 RANSAC with Multiple Solvers Typically, in solving for a stereo camera’s motion using RANSAC one solution method is selected before hand and then applied to the available correspondences. Selecting a solver beforehand, without knowledge of the correspondences available may not produce optimal results. For example, say that four-view features have a very low inlier ratio for some reason in a given sequence. However, three-view features have a very high inlier ratio.
Then selecting case 1, 0, 3 as the solution method would be much less efﬁcient than the P3P method, case 0, 3, 0.
What is needed is a modiﬁed RANSAC approach that can ﬁnd the best solver to use, given the inlier ratios found in a video sequence. These ratios can be calculated with windowed averaging over the last few frames. Then, given these inlier ratios ﬁnding which solver to try next in a RANSAC procedure is fairly simple. We simply choose the solver which after it has run will most increase the probability of the solution being correct as shown in Equation 4.8. In that Equation j, k, and l are the number of four-view, threeview and two-view features in a given solver and in is the inlier ratio for n-view features.
|case j, k, l | then is the number of times the solver using j four-view, k three-view and l two-view features has been tried in the RANSAC iterations so far. The function C(j, k, l) returns the number of times the solver for case j, k, l has been used so far in the RANSAC iterations.
In this framework we can also take into account the relative cost or calculating each minimal solution. The ﬁve-point method (and so case 0, 1, 4 ) is perhaps a factor of twenty slower to compute than the P3P method based on my experiments. If we do not take this higher cost into account then we would not ﬁnd the correct solution in the fastest possible time. The RANSAC solution method selection equation which takes into account computation cost is shown in Equation 4.9. The next solver selected in the RANSAC is the solver which maximizes this equation. In the equation costcasej,k,l is the time taken to compute a solution of case j, k, l.
Of course, Equations 4.8 and 4.9 require knowledge of the inlier ratios for the various types of features. If these inlier ratios are unknown then a sort of Catch-22 appears, a logical paradox arising from a situation in which an individual needs something that can only be acquired by not being in that very situation. In order to select the best solver to get the solution as fast as possible we need to know the inlier ratios. However, until we ﬁnd the correct solution we have no knowledge of the inlier ratios. For this reason, RANSAC which selects between solution methods can only be applied to situations where the inlier ratios are at least roughly known a-priori. After the ﬁrst frame, visual odometry on a video sequence meets this requirement, since the inlier ratios from previous frames can be used as an approximation of the current frame’s inlier ratios.
In this chapter, we have introduced a novel minimal solution for the relative pose of a stereo camera using one feature observed in all four views, two two-view correspondences from the left (or right) camera and one two-view correspondence from the other camera. Our approach allows the scaled translation to be estimated between poses while at the same time enables a wide total ﬁeld-of-view to increase the relative motion estimation accuracy.
We have evaluated our solver on synthetic data with noise and outliers. Additionally, we demonstrated our solver’s application in a real-time visual odometry system.
This chapter completes the discussion of our work on motion estimation for rigid multicamera systems. The next chapter will introduce a parallel, real-time VSLAM system. This system takes advantage of the underlying parallelism in the VSLAM problem to enable the exploration and mapping of areas the size of a small ofﬁce building online and in real time.
Figure 4.12: Error in the length of translation direction after RANSAC without outliers for out solution method and P3P with varying stereo camera overlap.
No reﬁnement is performed on the best RANSAC sample.
Figure 4.13: View from above of the reconstructed camera path showing the overlapping loops.
The camera made just over two roughly 18x10m laps around an ofﬁce environment.
No global bundle adjustment was performed. We have attempted to remove features on the ceiling and ﬂoor so the layout of the environment is visible. Left camera axes are drawn as well as a purple line for the baseline.
Figure 4.14: Sample frames from the left camera of the stereo pair for the ofﬁce sequence.
The images are ordered left to right, top to bottom according to their time in the sequence.
Real-Time Globally Euclidean VSLAM
5.1 Introduction In recent years the visual simultaneous localization and mapping (VSLAM) problem has become a focus of the robotics and vision communities. This effort has been made possible by advances in camera hardware and the computational power available in a personal computer. In this chapter, we introduce a novel, real-time system for six degree of freedom visual simultaneous localization and mapping. Our system fully decouples Visual Odometry and Global map correction. This decoupling enables our system to create Euclidean maps of larger areas than have been mapped before in real-time. Real-time operation at more than 15 frames per second is achieved by leveraging a combination of data parallel algorithms on the GPU, parallel execution of compute intensive operations and producer/consumer thread relationships that effectively use modern multi-core CPU architectures.
Indoor environments, due to their lack of salient features, pose a particular problem for visual navigation. A combination of local tracking and global location recognition enables our system to robustly operate in these environments. The system is demonstrated on two challenging indoor sequences that include sections with very few salient features to track because of large textureless regions. To overcome inherent drift problems from local feature tracking the system detects loops once it re-enters an area it has mapped before using SIFT (Lowe, 2004) features. The loop closing mechanism additionally enables re-initialization into the global model after local tracking failure. We demonstrate the improvement in the maps after loop detection and loop completion in comparison to using only visual odometry, which does not detect loops.
5.2 System Description Our parallel, real-time VSLAM system is composed of three primary modules: Scene Flow (SF), Visual Odometry (VO), and Global SLAM (GS) as shown in Figure 5.1. The Scene Flow module calculates the sparse optical ﬂow and selects key-frames based on the magnitude of the average ﬂow. It then passes the key-frames and the tracks to the Visual Odometry module, which calculates the inter-key-frame motion and passes this motion as well as the 3D features to the Global SLAM module. The Global SLAM module then performs loop detection, and global error correction of the map based on the detected loops. The ﬁnal result of our method is a globally consistent sparse 3D model of the environment made up of 3D feature points and camera poses for the key-frames.
5.2.1 Scene Flow Module To determine the local motion of the camera we track features from frame to frame using multi-camera scene ﬂow proposed by Devernay et al. in (Devernay et al., 2006), which is an extension of differential KLT tracking into three dimensions. To meet the real-time goal our system uses an efﬁcient GPU based implementation. In multi-camera scene ﬂow, features are ﬁrst extracted using the approach of Shi and Tomasi (1994) and then matched from left to right image enforcing both the epipolar constraint and cross-validating feature matches to eliminate outliers. After the features are matched, they are triangulated to establish 3D positions for the inlier features. At this point, features are tracked as small 3D planar surface patches in front of the cameras. The feature motion in 3D is determined through the temporal image ﬂow of the 2D features in the stereo cameras. Using this parametrization the epipolar constraint is enforced without resorting to stereo matching a feature in each
Given the varying temporal redundancy in the video, which is mainly due to camera motion, we adaptively select key-frames through a threshold on the minimum average optical ﬂow of features since the last key-frame. To minimize costly feature detection, detection is only performed in the selected key-frames with the additional constraint that if too few features are tracked, another key-frame occurs. Hence, images with small camera motion are not taken as key-frames. The 2D feature tracks and the triangulated 3D points are then passed to the Visual-Odometry module.
5.2.2 Visual Odometry Module The stereo camera system enables estimation of the 3D points in the Scene Flow module.
motion. Our method uses the three-point perspective pose method (Haralick et al., 1994) in a RANSAC (Bolles and Fischler, 1981) framework to determine the camera pose using tracks from the Scene Flow module. While this method is sufﬁcient for tracking the differential camera motion it accumulates small inaccuracies over time, which theoretically lead to an unbounded drift.
To counter the drift our system detects camera path intersections using SIFT features (Lowe, 2004). SIFT features can be matched over large changes in viewpoint, in contrast to differentially tracked KLT-features. To boost performance we use a CUDA based GPU implementation (Wu, 2007). In addition to using the SIFT features for loop detection we also use them in reﬁning the incremental motion estimation. This reﬁnement is performed using a windowed bundle adjustment (Engels et al., 2006) delivering reﬁned camera poses and more accurate 3D points than those delivered by the scene ﬂow described in Section 5.2.1. In our windowed bundle adjustment a window spanning the last n (typically seven) key-frame poses is optimized. We selected seven key-frames in our windowed bundle adjustment because it seemed to be a good point in the trade space between computation speed and mapping accuracy. This trade space is explored in detail in Engles et al. (2006).
The oldest two key-frame poses are held ﬁxed while the youngest n − 2 key-frame poses are varied along with all of the 3D features, both SIFT and KLT. The bundle adjustment uses a robust cost function so outliers have a limited inﬂuence on the result.
Combining the reﬁned camera motion estimate based on KLT feature tracks with the 3D position of the SIFT features we can predict where the SIFT features should project into the current key-frame. We use this prediction to our advantage by limiting the candidate matches to close by SIFT features in the current key-frame (see Figure 5.2). The beneﬁts of this are twofold. We are less prone to problems caused by repetitive structures and given the smaller number of potentially matching features we can reduce the number of SIFTdescriptor comparisons. Furthermore, we empirically found that this prediction allows us to relax Lowe’s SIFT matching uniqueness criteria (Lowe, 2004) but still be robust to repetitive structures in the scene.
Figure 5.2: Repetitive SIFT features can be disambiguated using geometry calculated from the scene ﬂow features.
Additionally, using the geometry reduces the number of potential SIFT feature correspondences tested increasing matching efﬁciency.
Following predictive SIFT matching we match the remaining unmatched SIFT features from left to right images in the current key-frame using the stereo camera’s calibration to constrain the search for matches along the epipolar lines. These matches are then triangulated and un-matched SIFT features are discarded.