«MULTI-CAMERA SIMULTANEOUS LOCALIZATION AND MAPPING Brian Sanderson Clipp A dissertation submitted to the faculty of the University of North Carolina ...»
LOCALIZATION AND MAPPING
Brian Sanderson Clipp
A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill
in partial fulﬁllment of the requirements for the degree of Doctor of Philosophy in the
Department of Computer Science.
Jan-Michael Frahm Gary Bishop Svetlana Lazebnik Jongwoo Lim Gregory Welch c 2010 Brian Sanderson Clipp
ALL RIGHTS RESERVEDii ABSTRACT BRIAN SANDERSON CLIPP: Multi-Camera Simultaneous Localization and Mapping (Under the direction of Marc Pollefeys and Jan-Michael Frahm) In this thesis, we study two aspects of simultaneous localization and mapping (SLAM) for multi-camera systems: minimal solution methods for the scaled motion of non-overlapping and partially overlapping two camera systems and enabling online, real-time mapping of large areas using the parallelism inherent in the visual simultaneous localization and mapping (VSLAM) problem.
We present the only existing minimal solution method for six degree of freedom structure and motion estimation using a non-overlapping, rigid two camera system with known intrinsic and extrinsic calibration. One example application of our method is the threedimensional reconstruction of urban scenes from video. Because our method does not require the cameras’ ﬁelds-of-view to overlap, we are able to maximize coverage of the scene and avoid processing redundant, overlapping imagery.
Additionally, we developed a minimal solution method for partially overlapping stereo camera systems to overcome degeneracies inherent to non-overlapping two-camera systems but still have a wide total ﬁeld of view. The method takes two stereo images as its input. It uses one feature visible in all four views and three features visible across two temporal view pairs to constrain the system camera’s motion. We show in synthetic experiments that our method creates rotation and translation estimates that are more accurate than the perspective three-point method as the overlap in the stereo camera’s ﬁelds-of-view is reduced.
A ﬁnal part of this thesis is the development of an online, real-time visual SLAM system that achieves real-time speed by exploiting the parallelism inherent in the VSLAM
operations such as loop detection and loop correction can be effectively parallelized. Additionally, we demonstrate that a combination of short baseline, differentially tracked corner features, which can be tracked at high frame rates and wide baseline matchable but slower to compute features such as the scale-invariant feature transform (SIFT) (Lowe, 2004) can facilitate high speed visual odometry and at the same time support location recognition for loop detection and global geometric error correction.
ACKNOWLEDGEMENTSI would like to thank my advisor Marc Pollefeys and co-advisor Jan-Michael Frahm for their support of this work. There were times, particularly close to conference deadlines, that it seemed like some of the methods developed in this dissertation might not succeed. I am particularly grateful for Marc and Jan’s suggestions and encouragement at these times that helped me push through to solutions.
Thanks also to my committee members, Gary Bishop, Jongwoo Lim, Lana Lazebnik and Gregory Welch, who gave helpful advice and suggestions for this work.
My co-authors have been instrumental in completing this dissertation. This list of coauthors includes Richard Hartley, Jae-Hak Kim, Jongwoo Lim, Jan-Michael Frahm, Marc Pollefeys, Rahul Raguram, Gregory Welch, and Christopher Zach all of whom I would like to thank. Thanks in particular to Christopher Zach who helped me turn my geometric intuition into working algebraic solution methods. Also, thanks to Greg for the many interesting discussions we have had over the last few years. I am also grateful to Phillipos Mordohai, who as a post doc at UNC, along with Marc and Jan-Michael, helped me to gain my footing in multi-view geometry and computer vision.
My work in multi-camera systems involved building a lot of strange contraptions including a backpack mounted data collection system, a helmet mounted stereo camera, and a roof top recording platform for use on the department van. I owe a great deal of gratitude to John Thomas who built these systems that helped to make many of my experiments, whether presented in this dissertation or not, possible. Thanks John.
David Gallup and I have shared an ofﬁce since July of 2005 when I came to UNC.
Thanks Dave for being a good friend, putting up with whatever annoying habits I am sure
procrastination sessions and mental breaks disguised as philosophical discussions.
Additionally, I would like to thank my family for their constant support and encouragement as I pursued my graduate studies. You all helped to keep me sane through this process by reminding me there are important things outside of academia.
Finally, to my wife Rachel, I could not have done this without your patience, sacriﬁce, understanding, and support. I am so glad you were on board with going to graduate school together. Sharing this experience has been wonderful.
4.1 Geometry of partially overlapping stereo camera pose problem....... 64
4.2 Minimal feature combinations for 6DOF stereo camera motion estimation. 73
5.4 Ofﬁce sequence, top view with and without loop detection and correction. 104
5.5 Ofﬁce sequence, side view with and without loop detection and correction. 105
5.6 Ofﬁce sequence, overlaid on architectural layout............... 106
5.10 Hallway sequence, results on architectural layout viewed from above.... 110
5.11 Hallway sequence, results viewed from side................. 110
5.12 Error propagation through a bundle adjustment graph............ 114
CUDA Compute Uniﬁed Device Architecture DOF Degree of Freedom DoG Difference of Gaussian EKF Extended Kalman Filter FAB-MAP Fast Appearance Based Mapping fps Frames per Second GNC Graduated Non-Convexity GPU Graphics Processing Unit GS Global SLAM ICP Iterative Closest Point KLT Kanade, Lukas, Tomasi feature tracker LIDAR Light Direction and Ranging MSER Maximally Stable Extremal Region PTAM Parallel Tracking and Mapping RANSAC Random Sample Consensus SBA Sparse Bundle Adjustment
SIFT Scale-Invariant Feature Transform SLAM Simultaneous Localization and Mapping TF-IDF Term-Frequency Inverse Document Frequency VIP Viewpoint Invariant Patch VO Visual Odometry VSLAM Visual Simultaneous Localization and Mapping
ing sensor system with one or more cameras to map an unknown environment and simultaneously keep track of the sensor system’s pose within the map. The sensor system might be as simple as a single camera or could be a multi-camera system including other sensors such as accelerometers, gyroscopes, and wheel encoders. Like many problems in artiﬁcial intelligence VSLAM is something that most humans do fairly easily but is highly complex and difﬁcult to automate.
The more general simultaneous localization and mapping (SLAM) problem has been studied extensively in the robotics community (Kaess et al., 2007; Paskin, 2003; Thrun et al., 2005; Smith and Cheeseman, 1987). The sensors used in SLAM typically include Light Direction and Ranging (LIDAR), acoustic range sensors, bump sensors, as well as accelerometers, gyroscopes and wheel encoders. What sets apart visual SLAM is the use of cameras as sensors. In contrast to LIDAR, cameras are purely passive sensors and so do not emit any electromagnetic radiation. Because cameras are non-emissive, they typically require less power and are suitable for applications where stealth or a lack of interference between multiple systems is crucial. Additionally, cameras are less expensive than specialized LIDAR sensors and are more pervasive in our world today. Most people today carry a mobile phone that includes a camera, which can be used for SLAM, as well as location recognition, which can support location-based services.
The peculiarities of cameras, in comparison to other sensing modalities, make the VSLAM problem a separate class of problem from general SLAM. Cameras provide bearingonly information, e.g. the direction to a target but not the distance to the target. Cameras also have effectively unlimited range; they detect the ﬁrst object a ray encounters as it emanates from the camera. In contrast, the range of LIDAR and acoustic sensors is limited by the amount of energy the sensor can broadcast into the environment and the reﬂectivity or absorbtion of the environment’s surfaces. This limited range actually simpliﬁes the SLAM problem since only what is near the sensor can be measured by the sensor. This can lead to certain subdivisions of the map, which can simplify the SLAM problem. In contrast, the position of a camera system may have little to do with the spatial distribution of the objects it measures.
The VSLAM problem is important because it has applications in augmented reality, robotic navigation, remote sensing, and generating dense three-dimensional models from video. In augmented reality, a user views the world through some form of output device, generally either with a head-mounted display or hand-held device such as a cell phone.
Synthetic objects are then placed on top of the real-scene in the user’s view. These objects could include information about the environment or synthetic game characters. In any case, to insert synthetic objects accurately, SLAM must be used to measure the pose of the display device in the environment. Visual SLAM (VSLAM) is an attractive option for augmented reality because of the low cost and power requirements of cameras and their relatively high angular resolution.
SLAM is also necessary for a robotic system to autonomously navigate its environment.
It must have some way to create a map of its surroundings and measure its pose in the environment. The use of cameras in SLAM for robots is motivated by many of the same factors as in augmented reality. In particular, low power requirements can drive the choice of using VSLAM.
The VSLAM problem is known in the vision community as Structure from Motion (SfM) and is the ﬁrst step toward creating three-dimensional models of the world from Figure 1.1: Textured 3D models reconstructed from multiple images.
video. Given the camera poses from VSLAM dense image matching can be performed to ﬁnd the depth of the scene with respect to the cameras, and given the camera poses the shape of the scene can be recovered in a global coordinate frame. Once the scene shape is recovered, it can be textured with the imagery to create visually appealing virtual models of the measured environment. Some example models are shown in Figure 1.1.
This thesis introduces the VSLAM problem and addresses two fundamental issues in VSLAM. The ﬁrst is the trade-off in two camera systems between ﬁeld-of-view overlap and accurate scaled motion estimation. We show this trade-off to be false and that nonoverlapping and partially overlapping two-camera systems can be used in absolutely scaled VSLAM. The second issue is real-time performance. Through a principled analysis of the VSLAM problem, we show how a combination of tolerable latency, parallelism, and integration of 3D pose estimation with 2D feature matching can accelerate six degree of freedom (DOF) VSLAM to a previously unachieved level of performance combining speed with accurate structure and motion computation.
The Visual SLAM Problem The “Visual SLAM” problem, which is also known as “Structure from Motion”, has been studied extensively in the robotics and computer vision ﬁelds. This chapter will give a brief history of the VSLAM problem as well as introduce the state of the art in VSLAM. It will then discuss the structure of the VSLAM problem and the various sub-processes that must be done in any VSLAM system namely, correspondence ﬁnding, relative pose estimation, and global mapping.
Harris and Pike demonstrated one of the ﬁrst VSLAM methods on an image sequence (Harris and Pike, 1988). Their work contained many of the major components of a VSLAM system including feature matching, relative pose estimation, and a Kalman ﬁlter based method for fusing the measurements from multiple views. Using a stereo camera, their system created a map of point and line features with covariance matrices representing their uncertainties. However, they neglected the correlations between features which can create problems.
With estimated correlations between features, if feature A is detected in an image but not feature B, then the measurement update of A can be propagated to an update of B through their covariance. This reﬂects what we would expect. If the system has previously build a map of the outside of my home and it detects my home’s front door in an image, then that also gives information about where the front window is even if the window was not seen in the same image as the door. Without modeling the correlations between window and door we may become over-conﬁdent in the door’s position with respect to the window and when we ﬁnally see the window reject it’s features as outliers.