«MULTI-CAMERA SIMULTANEOUS LOCALIZATION AND MAPPING Brian Sanderson Clipp A dissertation submitted to the faculty of the University of North Carolina ...»
Each of the processes described to this point can operate with only temporally or spatially local information, including feature extraction, feature tracking, and relative pose estimation. The next section will give background on global mapping operations, which use non-local information to create a map of the camera system’s operating environment.
2.5 Global Mapping In this dissertation, global mapping refers to the processes that a SLAM system must perform to create an internal representation of its operating environment when all of the environment is not visible from a single point of view. At a minimum, the global map must reﬂect the topology of the operating environment, i.e. the connectivity between the various parts of the environment. This requires that the VSLAM system perform loop detection.
However, for many practical applications, a topological map is not enough and a Euclidean map is required. Some example applications where a Euclidean map is required are efﬁcient path planning and automated targeting. A Euclidean map is the sort of map we are generally accustomed to where each point on the map is known with respect to every other point and the map has a single scale. Additionally, in a Euclidean map distances and angles are preserved. When creating a Euclidean map it is not simply enough to make note of a loop when it is detected. The system must also perform loop correction to make the geometry of the map reﬂect the Euclidean geometry, not just the topology, of the operating environment. Additionally, the global geometry of a Euclidean map can help to eliminate false loop detections. For example, when mapping a building by moving around it, if two corners of a building look the same and have similar geometry, we can use the shape of the path taken around the building to disambiguate the corners.
2.5.1 Loop Detection Visual loop detection has been studied extensively in the vision and robotics communities. Some authors chose to tackle loop detection based only on appearance (Cummins
ers included geometry in their approaches (Chli and Davison, 2008; Davison, 2003; Eade and Drummond, 2006; Ankur Handa and Davison, 2010; Irschara et al., 2009; Sivic and Zisserman, 2003; Williams et al., 2007).
Sivic and Zisserman (2003) were the ﬁrst to apply text retrieval techniques to ﬁnd objects within a video sequence. Their approach is suitable for loop detection since it ﬁnds images with common structure in a video sequence. They use both SIFT descriptors (Lowe,
2004) and maximally stable extremal regions (MSER) (Matas et al., 2002). Their approach consists of a preprocessing step followed by a retrieval phase. In the preprocessing step SIFT and MSER features are extracted in each of the images. A subset of these image’s features are then grouped using K-means clustering. Each of these clusters is referred to as a visual word. They use approximately 500 images in the clustering and generate approximately 10000 distinct visual words. They are limited in the number of clusters they can generate by the complexity of K-means clustering with such a large number of clusters. Each image’s features are then quantized into visual words. An inverted ﬁle is stored for each visual word, which contains which images a visual word is found within and the visual word’s position in that image.
When querying to ﬁnd an object in the video the user selects a region of interest containing this object. The system then ﬁnds the descriptors in this region and their visual words. It then uses a term-frequency inverse document frequency (TF-IDF) weighting formula to ﬁnd other images in the video that contain the same visual words. The TF-IDF weighting takes into account both the frequency of a given visual word in the current image and the log inverse frequency of a visual word in the documents. The TF-IDF formula
nd is the total number of visual words in document d, N is the number of documents in the database and ni is the number of documents containing visual word i.
Each document in the database (video) is represented by a vector of TF-IDF values for
each visual word:
The dot products of the vector representing the query region and each document in the database can then be taken to ﬁnd images that are likely to contain the object of interest.
The method then performs a ﬁnal geometric veriﬁcation step. Their method starts with two matched feature regions and looks at the local neighborhood of each feature. The same features in the local neighborhood of the matched features should be found in both frames.
This is a form of loose geometric veriﬁcation, which does not enforce any sort of global model such as a homography or essential matrix.
Sivic and Zisserman show that their system vastly out-performs this naive approach in terms or run time as well as precision-recall curves. The improvement in precision and recall is primarily due to the TF-IDF weighting of the descriptors and their weak geometric veriﬁcation.
trieval. Their vocabulary tree approach uses a hierarchical k-means clustering of the feature space to partition features into visual words. The tree-based approach overcomes scaling issues that limited the vocabulary size in Sivic and Zisserman (2003). With a larger number of visual words, the descriptor space can be broken up into more discriminative, smaller clusters. A larger number of smaller clusters improve image retrieval performance. Addi
that they did not need to use geometric veriﬁcation to achieve good performance. They show results on up to a one-million image database, three orders of magnitude larger than Sivic and Zisserman’s results.
Cummins and Newman (2008) developed a probabilistic approach to loop detection using only image information without geometric veriﬁcation. Their fast appearance based mapping (FAB-MAP) approach uses a bag-of-words (Sivic and Zisserman, 2003) approach to represent the images where the features are quantized into particular visual words. However, rather than use TF-IDF to determine the likelihood of two images matching, Cummins and Newman learn a generative model for the visual words. This model, learned from training data, reﬂects the fact that certain visual words are likely to co-occur in an image because they come from the same object. Knowing which features are likely to appear or disappear together because they are part of the same object or objects, the system in effect recognizes unique locations by ﬁnding unique groupings of objects in the imagery. The system also recognizes which visual words appear in many locations because of repetitive scene parts.
ing is handled by the inverse document frequency term while in Cummins and Newman’s model a probabilistic approach is taken.
Cummins and Newman make a comparison between a naive Bayesian assumption that the likelihood of a visual word being found in an image is independent of the other visual words in the image and a model that takes into account the correlations between visual word occurrences. They show that using a simpliﬁcation of the full correlation between all features based on a Chow Liu tree (Chow and Liu, 1968) can provide signiﬁcant performance improvements over the naive Bayesian approach at little additional computational cost. The Chow Liu tree is a maximal spanning tree over the correlation between features which, while it is a simpliﬁcation over the full correlation matrix, provides good performance with close to real-time computation.
Loop detection methods that use geometry as well as appearance can be divided into two classes: those that ﬁrst use a camera pose prior and scene geometry to guide appearance based matching (Chli and Davison, 2008; Davison, 2003; Eade and Drummond, 2006;
Ankur Handa and Davison, 2010), and those that ﬁrst match based on appearance and then verify the match geometrically (Irschara et al., 2009; Wang et al., 2005; Williams et al., 2007). Geometry-ﬁrst approaches presume that there is some non-uniform prior on the camera’s pose within the map and use this prior to guide feature matching. The mean and variance of the 3D features are projected into the next expected camera. In combination with the uncertainty on the current camera pose, this gives a search region in the image to look for each feature. Chli et al. showed that when sequentially ﬁnding features in this way, the information from one feature match can reduce the pose uncertainty and therefore limits the search region size for subsequent features. In later work, they developed more efﬁcient methods to represent the probabilistic constraints between the feature projections (Ankur Handa and Davison, 2010). These methods make real-time probabilistic feature matching possible.
Geometry-ﬁrst methods have been successfully demonstrated on small room-sized environments. However, were they to be used in mapping larger environments, the large uncertainty in the camera pose prior after traveling long distances would become an issue.
With a large pose uncertainty, such as one would have after traversing a large loop, the projection of the feature uncertainty regions in the image would be extremely large. Also, the number of features that might project in the image grows as the camera may take on a wide range of poses. A more efﬁcient approach is needed.
Appearance-ﬁrst approaches can detect loops of any size because they do not rely on a
2006) and inverted ﬁles are used to ﬁnd visually similar parts of a database of images.
These images have been previously processed into a 3D point cloud model of the environment. Features are then matched between the query image and the map based on descriptor similarity (dot product of SIFT features). The perspective three-point method (Haralick et al., 1994) in a RANSAC framework is then used to verify that the geometry of the scene could indeed generate the feature distribution in the query image. However, Irschara et al.
do more than just this. They create virtual views of the 3D point cloud model and use these as the database that query images are compared with. This signiﬁcantly reduces the number of images in the database, speeding up the histogram generation from the inverted ﬁles.
They also make extensive use of the graphics processing unit (GPU) to speed up processing query image features into visual words through the vocabulary tree.
A wholly different approach to feature or key-point identiﬁcation was developed by Ozuysal et al. (2007,2010) and demonstrated in a real-time monocular SLAM system by Williams et al. (2007). Their approach, which they call ferns, uses a collection of simple Bayesian classiﬁers to separate key-points into a set of classes. These classiﬁers (ferns) start with an image patch representing the key-point. Each classiﬁer is then made up of a set of binary decisions. Two pixels are randomly selected in the patch and if one pixel is brighter than the other the value is one, otherwise the binary decision value is zero. Using a large number of these random binary measurements over the image patch a distribution of resulting binary numbers (fern results) can be found over a large training set of examples of the same image patch. These examples can be synthetically generated afﬁne warps of an original image patch. In practice the number of binary decisions (leaves in the fern) must be large (L = 300) to achieve acceptable classiﬁcation performance. For each keypoint, the method must calculate the probability that it takes on any one of the 2300 possible fern output values. This is intractable and so the authors simplify the computation by breaking up the large fern into M smaller ferns each of which can take on 2L/M values.
The classiﬁcation results of this group of ferns are then multiplied together to form the ﬁnal probability of a feature being of a particular class. Grouping the binary decisions into small ferns performs better experimentally than assuming that each of the binary decisions is statistically independent.
Ferns were an outgrowth of and an improvement on Lepetit and Fua’s work on randomized forests (Lepetit and Fua, 2006). Lepetit and Fua were the ﬁrst to consider the wide-baseline matching problem as a classiﬁcation problem where a key-point should be mapped to a class or view-set. This is the same approach they later took in Ferns but with the difference that randomized forests make use of randomized trees (Amit et al., 1996) for the classiﬁcation while ferns makes use of the non-hierarchical fern structure.
Ozuysal et al. ’s ferns perform the same sort of quantization as a combination of afﬁne invariant feature extraction (SIFT for example) followed by hierarchical quantization through a vocabulary tree. However, ferns do not require an afﬁne invariant feature to be extracted with a descriptor vector, which is usually costly to compute. Also, the number of ferns and the number of binary decisions in each fern can be tailored to the computational resources available on a given platform. Fewer ferns might be used in an embedded system while more might be used in a laptop for example. Of course, this comes at a price. With fewer ferns, the features can be broken into fewer classes reducing the classes’ discriminative power.
Williams et al. (2007) showed that ferns could be used to recover from a loss of tracking in monocular VSLAM. While loss of tracking is not exactly the same problem as loop detection, Eade and Drummond (2008) have pointed out that the same location recognition mechanism can be used for both. Willams’s SLAM system is based on an Extended Kalman Filter for map and pose estimation (Davison et al., 2007). Each feature that is added to the map is passed to a background process that warps a patch about the feature in various ways such as rotation, scaling, and perspective warping. These warped patches are then passed into the ferns which learn the distribution of appearances that may occur for a given feature patch. Later when the camera becomes lost, the system can classify the features in a new image using the ferns to ﬁnd likely matches between the features in the current image and those in the map.