«MULTI-CAMERA SIMULTANEOUS LOCALIZATION AND MAPPING Brian Sanderson Clipp A dissertation submitted to the faculty of the University of North Carolina ...»
Frese and Duckett (2003) showed that a multigrid style approach could be applied to the general SLAM problem. Their approach uses the temporal connections between robot poses as the grid. This temporal sequence can be sub-sampled to arrive at coarser representations of the robot’s path. Their approach uses scan matching or odometry to relate one robot pose to the next as well as to relate poses when loops complete. Then each measurement is a transformation with associated uncertainty between two camera poses. When coarsening (restricting) the graph they simply drop every other pose and combine measurements to arrive at measurements with uncertainties linking the dropped pose’s neighbors.
Converting back from coarse to ﬁne in the upward section of the V-cycle is performed using a linear interpolation of the camera poses at the coarse level.
With a general bundle adjustment problem, the situation is not so simple as Frese and Duckett’s case because features are measured in multiple cameras. This means that the
Figure 5.14: The ﬁnest level heat distribution solution grid and it’s down-sampled counterparts.
Grey dots are the grid locations of the heat distribution that are kept on the next coarsest level.
simple inter-robot pose measurement restriction operator they use cannot be applied to the measurements in bundle adjustment. Instead I have developed a bundle adjustment method that restricts the number of cameras in the bundle adjustment while leaving the number of features and measurements the same. Since the 3D features can be eliminated efﬁciently using the Schur Complement, this restriction operator should allow the error correction to propagate through a bundle adjustment problem more quickly than standard bundle adjustment.
The idea behind this multi-grid approach to bundle adjustment is to adjust not the camera poses themselves, but try to ﬁnd a temporal error function that describes how the camera pose error accumulates over time. Since the initial camera poses from a VSLAM system come from visual odometry, the error on the poses will be an accumulation of small interkey-frame errors. This temporal summation of error generates a random walk in 6DOF error space. This random error walk contains trends (generally increasing or decreasing segments) at various temporal scales. This sort of multi-scale error growth can be seen in the one-dimensional random walk in Figure 5.15. This random walk in this ﬁgure is the summation of samples from a gaussian distribution with standard deviation one.
Assuming that the initial camera path has error that can be modeled as a random walk,
Figure 5.15: A random walk generated by summing one-hundred samples of a normally distributed random variable with standard deviation one.
Note the increasing and decreasing trends at various temporal (sample number) scales.
it should be possible to use a sort of multigrid v-cycle to estimate the random error walk’s shape. Knowing the error’s shape, the inverse error can be applied to the cameras in order to eliminate it and arrive at the minimum of the bundle adjustment function, Equation 5.1.
In Figure 5.16 the bottom most pane shows the largest scale error approximation. The two purple cameras are control nodes. The error correction that is applied to the control nodes is linearly interpolated between any two neighboring control nodes in the sequence and applied to the blue cameras in between. The interpolation multiplier is based on the temporal distance between the blue non-control node and its neighboring control nodes in the sequence. The green lines show which error correction is linearly interpolated to ﬁnd the error correction of a blue camera. In this way in the bottom pane all the cameras can be moved by modifying only the six degrees of freedom of camera seven, not the full thirtysix degrees of freedom of all the cameras (camera zero ﬁxes the gauge). In the middle pane an additional control node is added allowing the error function to be approximated at a smaller, ﬁner scale. In the top pane all cameras are control nodes and the problem is equivalent to a standard bundle adjustment. Performing a v-cycle, starting at the top pane, moving to the bottom and back up should more efﬁciently approximate the error function and therefor eliminate the error than standard bundle adjustment because the multi-scale nature of the error is taken into account.
Unfortunately, in practice this has not proven to be the case. A series of experiments was performed comparing standard and multi-grid, error-state bundle adjustment. The performance of both methods tended to scale approximately equally as the number of cameras in a minimization was increased. This may be due to the fact that numerical derivatives were used to ﬁnd the Jaccobian of the multi-grid BA projection function. This was done because ﬁnding the analytic Jaccobians for interpolated error correction values proved to be very complicated, although they may be possible to calculate in the future. A more likely reason why a performance increase was not found is that the measurement space was not down-sampled at the same time as the cameras in the state. In standard multi-grid both the hidden state (the temperature values in a Poisson heat distribution problem) and the measurement error function (the deviation of the temperature values from the values predicted by the Poisson heat equation) are down sampled. However, in my approach the measurements are not down-sampled, all reprojection errors are measured at all scales. A possible way to get around this might be to use synthetic features and measurements in a similar way to the Virtual Key Frame work of Shum et al. (1999). These synthetic features and measurements could summarize the information in a collection of measurements and reduce the measurement space size in addition to the state size.
5.6 Conclusion In this chapter, we introduced a VSLAM system that fully exploits the recent parallelism gains in consumer computer hardware. The system uses imagery from a stereo camera as its only input. We implemented a two-view consistent 2D tracking module on the GPU
Figure 5.16: Three levels of multigrid, error-state bundle adjustment.
The top pane shows the ﬁnest scale approximation to the error function, which is equivalent to standard bundle adjustment. The middle and bottom panes show larger temporal-scale approximations to the camera path error. Purple cameras are control nodes which are included in the bundle adjustment problem state. The correction applied to blue cameras is calculated by linearly interpolating the error approximation Evw between the blue camera’s two closest control nodes.
to ﬁnd sparse optical ﬂow. We also used the GPU to extract wide-baseline features for use in loop detection. Our system exploits modern multi-core processors using multiple concurrent threads to perform sparse scene ﬂow, visual odometry and global mapping in parallel. This parallelism allows us to perform the full Euclidean map reconstruction in real time.
While our system operates in real time on ofﬁce building scale scenes, extension to larger environments remains a challenge. Hierarchical bundle adjustment methods like (Ni et al., 2007a) or (Steedly et al., 2003) may help to overcome these scaling issues. This would allow the system to optimize several sub-problems that are mutually very weakly dependent on each other in parallel. After that, a global combination step could create a globally consistent map.
This dissertation has introduced several novel improvements to the problem of multi-camera visual simultaneous localization and mapping. A concise summary of this dissertation’s
contributions is given in my thesis statment:
• Rigid, two-camera systems with little or no ﬁeld-of-view overlap can be used in efﬁcient, absolutely scaled visual simultaneous localization and mapping.
• Further, the inherent parallelism in the visual simultaneous localization and mapping problem can be exploited to allow real-time, Euclidean mapping of large areas by decoupling online exploration from loop detection and correction.
In Chapter 3, we developed a solution method for the scaled, six degree of freedom motion of a non-overlapping two-camera system that uses a minimal six temporal feature correspondences. This solution method was studied in detail and degenerate cases were identiﬁed where the scale of the motion cannot be calculated, primarily pure translational motion, and motion with the cameras’ baseline aligned with the axle of a vehicle in a constant velocity turn.
A second minimal solution method was developed to overcome the inherent degeneracies in non-overlapping stereo cameras while still allowing a wide total ﬁeld-of-view. The method, presented in Chapter 4, uses a stereo camera with a small region of overlap of the views. In contrast to the perspective three-point (P3P) method (Haralick et al., 1994), our solution method needs to ﬁnd only one feature correspondence in the cameras’ region of overlap. This is used to ﬁnd the camera system’s translation while the other three degrees of rotational freedom are solved with features in the non-overlapping regions. Our solution method was shown to be more accurate than the P3P method when the cameras’ region of overlap is reduced.
Real-time Visual SLAM requires much more than accumulating relative motion estimates to ﬁnd the camera system’s path. It requires real-time feature tracking, visual odometry, loop detection and loop correction. We showed in Chapter 5 that the inherent parallelism in the VSLAM task and a small amount of acceptable latency can be combined to create an effective online, real-time VSLAM system. Three primary computational processes run in parallel in our implementation, scene ﬂow estimation for temporally local correspondence ﬁnding, visual odometry for relative ego-motion estimation and global SLAM for loop detection and correction. Splitting the processing into these threads of execution feature tracking can always keep up with the cameras’ frame rate and select key-frames based on optical ﬂow for the visual odometry module to process. The visual odometry module can then estimate the relative pose of the cameras using RANSAC while simultaneously extracting wide baseline matchable SIFT features. These relative poses and measurements are then passed to the Global SLAM module which ﬁnds loops and corrects the map to reﬂect them. Current loop correction implementations based on bundle adjustment are too slow to keep up with the rate of key frames. However, our system allows the Global SLAM module to fall behind the visual odometry module while correcting a loop and then later catch up after the loop is completed. By separating visual odometry from loop correction, the camera system can continue to explore its environment while it corrects the map to reﬂect loops it has already found.
Many interesting, practical problems remain to be solved in visual SLAM. Feature matching over large changes in viewpoint remains a challenge. Using SIFT (Lowe, 2004) or MSER (Matas et al., 2002) features can be reliably matched at up to approximately a thirty degree change in viewing direction. This makes loop detection possible when a camera returns to the same part of a scene from roughly the same direction. However, when entering a hallway from the opposite direction for example, a loop cannot be recognized because features cannot be matched. In the future a variant of the Viewpoint Invariant Patch (VIP) by Wu et al. (Wu et al., 2008) may help with this matching. VIP features can be matched at up to a ninety degree viewing direction difference. While they will make loop completion from greatly varying viewing directions possible they will not allow for the most common case in indoor scenes, returning to the same hallway from the opposite direction.
The VIP is a kind of hybrid between sparse, image based matching methods and dense matching, where a depth value is assigned to most or all of the pixels in an image. With dense matching a more complete geometry for the scene can be recovered. In the future, this more complete geometry may allow loop completion using a form of iterative closest point (ICP) which uses the geometry rather than appearance to join models or detect loops.
Another major problem is extending VSLAM so that it can map larger areas. Promising methods exist such as the out of core bundle adjustment of Ni et al. (Ni et al., 2007a). Their work uses as single level decomposition of the environment to split the bundle adjustment problem into multiple sub-problems. In the future, a multi-level, hierarchical decomposition may allow mapping of areas of virtually unlimited size and complexity (limited by disk space).
Dynamic, changing scenes are another challenge for VSLAM. Current VSLAM approaches assume a static world and try to exclude non-static objects from the map. But over time the environment can change. In ofﬁces chairs move, cars enter and leave parking spaces, buildings are built and torn down, and even the terrain can change through erosion and other processes. Detecting these changes and incorporating them into the map will be necessary for any autonomous system using VSLAM over a long period of time. To know whether a point feature has moved or is simply an outlier it will need to be grouped with other points on a single object or re-detected in the same location over a number of frames.
Points on a single object will have to have moved with a single consistent motion to be considered inlier features. If a consistent, non-static motion is found for a group of features, even in a single frame, then the points could be grouped as an object and considered inliers to the camera’s motion and the dynamic object’s motion with respect to the ﬁxed scene.
One possible approach to dealing with dynamic scenes is to create a sort of fourdimensional map. Static parts of the scene, or parts that have not been detected to change yet, could be placed in one map while a temporal map can be added on top of this. The temporal map would store objects that have changed pose or disappeared with pose information as well as a time of detection. This would allow the system to form a more complete understanding of its environment.