«MULTI-CAMERA SIMULTANEOUS LOCALIZATION AND MAPPING Brian Sanderson Clipp A dissertation submitted to the faculty of the University of North Carolina ...»
At this point, the newest key-frame has been completely incorporated into the local map. It will be considered until it leaves the bundle adjustment window or the visual odometry fails and a new sub-map is started. Please note that as soon as a frame has an initial pose in the visual odometry module, its 3D pose with respect to the global map can be found. This pose will be locally accurate and will be reﬁned through the windowed bundle adjustment. The pose may be changed when loops are detected in the Global SLAM module but this should not affect tasks such as obstacle avoidance. After exiting the bundle adjustment window, key-frames are processed by the Global SLAM module.
5.2.3 Global SLAM Module The Global SLAM module ensures global consistency in our VSLAM system. It incorporates the information of all currently available key-frame poses, feature measurements, and initial 3D feature estimates from the Visual Odometry module. The ﬁnal result is a set of globally consistent, Euclidean sub-maps each of which has its own global coordinate frame. The sub-maps are disjoint, meaning they cover separate areas in the environment or cannot be linked by common 3D features due to limitations of wide-baseline feature matching.
The key element to improve the incremental motion estimation provided by the VisualOdometry module is the detection of loop completions. Loop completions provide additional constraints to the local constraints found in the VO module. Our system uses the vo
alternative approach like the Fab-Map approach of Cummins and Newman (2008) could be used instead. In our approach, SIFT feature descriptors are quantized into visual words using a K-d tree over a descriptor space, which is pre-computed. The visual words seen in an image are then organized so that one can ﬁnd out quickly, which images a visual word is seen in. Finding similar images to a query image is then as simple as computing a vote to determine in what other images a query image’s visual words are found. In the vote higher weight is given to the more discriminative visual words that are found less frequently.
The Global SLAM module can operate in one of two modes. When exploring new areas the system operates in loop seeking mode while in previously mapped regions the system operates in known location mode.
22.214.171.124 Loop Seeking Mode Loop seeking mode performs loop detection for each new key-frame and after a successful loop identiﬁcation a global reﬁnement is computed through bundle adjustment. Loop detection begins by using the vocabulary tree to ﬁnd a list of the most similar images to the current key-frame sorted by similarity. Images from recent key-frames are removed from the list so that loops are only found to older sections of the map. In our system recent keyframes are selected by the number of key-frames between the current key-frame and other key-frames and is a selectable parameter. A more principled approach might use visibility constraints on the previous key-frames in the sequence to determine which ones could see the current key-frame’s features and therefore should be considered ”recent”. Images in the list are tested in order of similarity until a matching image is found or the similarity score of the next best match is too low.
Rather than simply match SIFT features from the query image to those visible in the next most similar image we use the putative matching image to ﬁnd a region of local 3D scene structure and match the query image to this structure. This can be seen as a form of query expansion based on 3D geometry. The expansion is done by ﬁnding the images near the next most similar image and including all of the 3D features visible in all of these images in the SIFT matching and geometric veriﬁcation. The SIFT matching is then performed from the image to the 3D structure. SIFT descriptor matching is performed from the descriptors of the features in the current key-frame to the 3D features’ descriptors. We only try to match SIFT descriptors with the same associated visual word, which reduces the number of descriptor dot products performed. A RANSAC process using the three-point perspective pose method is then used to ﬁnd the pose of the current camera and the pose is non-linearly optimized afterwards.
If the above method ﬁnds a solution supported by enough inlier matches, it is considered a loop. The features associated with the inlier measurements to the RANSAC are linked so that they are treated as a single feature in bundle adjustment. Using 3D feature to 2D projection matching with geometric veriﬁcation makes false positive loop detections much less likely than using an image-to-image, appearance only, matching approach. This is because our approach combines both visual and geometric similarity to detect loops. Still truly repetitive 3D structures which also look the same can cause incorrect loops to be detected. Dealing with repetitive structures remains an open research problem.
If no loop has been detected, then the next key-frame is tested for a potential loop closing. If a loop was detected, the system performs a global correction to the current submap incorporating the newly detected loop. Since the newly detected loop features have high reprojection errors in the current key-frame, they would be deemed invalid by our bundle adjustment, which uses a robust cost function. Hence, they would not inﬂuence the error mitigation process. To overcome this effect we re-distribute the error before bundle adjustment. This initializes the bundle adjustment much closer to the global minimum of its cost function, increasing its convergence rate and decreasing the chance of converging to a local minimum.
We re-distribute the accumulated error by starting with the difference in the current key-frame pose and the current key-frame’s pose calculated w.r.t. the old features. This gives us the amount of drift that the system has accumulated since it left the last known location in the sub-map. This last known location is either the ﬁrst frame in the sequence if no loops have been found so far or the last place the system was operating in known location mode. The system is operating in known location mode when it has reacquired features it has mapped before and is tracking with respect to that known map. The system linearly distributes the error correction for the cameras back to the point it was operating in known location mode. Spherical linear interpolation (Shoemake, 1985) of the rotation error quaternion is used to interpolate the rotation error. Feature points are similarly corrected by moving them along with the camera that ﬁrst views them. A global bundle adjustment of the map is then performed. After bundle adjustment, outlier measurements are removed as well as features visible in fewer than two key-frames. These features give little information about the scene structure and are more likely to be incorrect since they do not match the camera’s motion. After successfully detecting the loop and correcting the accumulated error the Global SLAM module enters known location mode.
126.96.36.199 Known Location Mode After successfully identifying a loop, this mode continuously veriﬁes that the robot is still moving in the previously mapped environment. Veriﬁcation is done by linking the current 3D SIFT features to previously seen 3D SIFT features in the environment surrounding the current location. These matches are added to a windowed bundle adjustment in the GS module, which keeps the camera path consistent with the older previously computed parts of the map.
In known location mode SIFT feature matching between the current key-frame and the old 3D SIFT features is done using the predictive approach described in the visual odometry module (see Section 5.2.2 for a discussion of visual odometry). Older features can be linked to the features visible in the current frame by projecting all of the 3D SIFT features seen in the previous key-frame and it’s neighboring images (two key-frames are neighbors if they see the same 3D feature) and comparing descriptors. If no matching older SIFT features are found then the robot has left the previously observed parts of the environment and the system reenters the ”Loop Seeking” mode.
The windowed bundle adjustment in GS is much the same as the one performed in the Visual Odometry module. The only difference in this case is that the older key-frames are also included in the bundle but ﬁxed. This ensures the new camera poses stay consistent with the existing map. Fixing the older cameras is also justiﬁed since they have already been globally bundle adjusted and are probably more accurate than the more recent keyframes. After the windowed bundle adjustment processing begins on the next key-frame.
5.3 Implementation Details A key to the performance our system is that each of the three modules Scene Flow, Visual Odometry, and Global SLAM operates independently and in parallel. To ensure that all captured information is used, only the Scene Flow module has to operate at frame-rate. The timing constraints on the visual odometry are dynamic and only depend on the frequency of key-frames. This module can lag behind by a few frames. The Global SLAM module is less time constrained since its corrections can be incorporated into the local tracking when they are available. The system’s modules operate in separate threads that each adhere to the individual module timing requirements.
5.3.1 Scene Flow Module The Scene Flow module begins by taking raw, Bayer pattern images off of the stereo cameras. These images must be converted to luminance images and radially undistorted before the sparse scene ﬂow can be measured. We use color cameras so that the video we record can later be used for dense stereo estimation and 3D modeling. While tracking could be performed on radially distorted images, we remove the radial distortion from the images so that later SIFT feature extraction in the Visual Odometry module can be done on undistorted images. Using undistorted images helps in SIFT matching when using cameras with a large amount of radial distortion.
De-mosaicing, radial undistortion, and sparse scene ﬂow are all calculated on the graphics processing unit (GPU) using CUDA. To increase performance we minimize data transfer between CPU to GPU by downloading the raw image to GPU for each frame, performing all computations in GPU memory, and then only uploading undistorted images to the CPU for the key-frames as well as the tracked feature positions.
After each key-frame the feature tracks (2D position and feature identiﬁer) and the undistorted images are passed to the Visual Odometry module. While the Visual Odometry module processes the key-frame the Scene Flow thread can track ahead of it, buffering new key frames until the Visual Odometry module is able to process them. Hence, the speed of Visual Odometry does constrain the Scene Flow module’s real-time performance. This is just one example of how parallelism adds robustness to our system.
5.3.2 Visual Odometry Module In this module we perform the incremental motion estimation from the KLT-features tracks and the detection of SIFT features in parallel. For efﬁciency we use one thread for each of the two stereo images. After the SIFT detection we release the image buffers to save memory.
As described in Section 5.2.2 the Visual Odometry module’s outputs are the relative camera motion and the new 3D points. These outputs are stored in a queue and are removed from Visual Odometry’s local storage. Using a queue decouples processing in the VO and GS module threads. Whenever tracking fails all the VO module’s internal data (key-frame poses and 3D features) is queued for processing by the Global SLAM module.
5.4 Experimental Results In order to demonstrate the speed, accuracy, and long-term stability of our VSLAM system we present results from two video sequences of two indoor environments with different characteristics. The ﬁrst sequence was taken in an ofﬁce environment, which has a large, open ﬂoor plan. I will refer to this as the ”ofﬁce” sequence. The second sequence was shot in a building with long, but relatively narrow (1.7m) hallways. It will be called the ”hallway” sequence. The closed ﬂoor plan does not allow features to be tracked for long periods of time since they quickly leave the stereo camera’s ﬁeld-of-view, yet the system successfully maps the halls accurately with an error of less than 30cm over the 51.2m length of the longest hall shown in Figure 5.10. This is an error of less than 0.6%.
Our setup uses a calibrated stereo camera pair consisting of two Point Grey Grasshopper cameras with 1224×1024 pixel resolution color CCD sensors delivering video at ﬁfteen frames (stereo pairs) per second. The system’s 7cm baseline is comparable to the median human inter-pupil distance. The cameras are mounted on a rolling platform with the computer. Using a rolling platform the planarity of the camera path can be used to evaluate the quality of the reconstruction results. However, the full six degrees of freedom are estimated for the camera’s motion. While performing real-time VSLAM the system also records the imagery to disk for debugging or archival purposes.
The ofﬁce sequence includes transparent glass walls and other reﬂective surfaces that make tracking more challenging (please see Figure 5.3 for example frames). It also has a hallway with relatively low texture, which our system successfully maps, showing it is robust to areas without a large amount of structure. In one section of the video a person moves in front of the camera, partially occluding some of the tracked features (see Figure 5.3). Even in this case, the system is able to reject the moving person’s feature tracks as outliers and continue tracking correctly.
Figure 5.4 shows the difference between operating only using visual odometry and performing the full VSLAM with loop detection and global map correction.
In the top pane of Figure 5.4, the map is shown using only visual odometry where the relative motion from frame to frame is accumulated to form the camera path. In visual odometry no loop detection of global map correction is performed hence, the system drifts over time. In this scene, VO accumulated drift of approximately 3m over an approximately 150 meter path. In the bottom pane, the results of our Global SLAM module are shown. Clearly, the long-term drift of visual odometry is eliminated by loop detection and the succeeding error mitigation through bundle adjustment.