# «MULTI-CAMERA SIMULTANEOUS LOCALIZATION AND MAPPING Brian Sanderson Clipp A dissertation submitted to the faculty of the University of North Carolina ...»

Our solution method is based on the observation that a feature visible in all four cameras (corresponding to a pair of cameras at two poses) constrains the relative pose of the second stereo camera to be on a sphere around this particular feature, which has a known position relative to the ﬁrst stereo camera pose from triangulation, as shown in Figure 4.1. Features seen in only the left or right camera at both poses (two-view features) are labeled S1..3, and the feature seen in all four cameras is labeled Q (for quad or four-view feature). The constraint imposed by the four-view feature leaves three degrees of freedom: two degrees for the location of the second camera on the induced sphere, and one for the rotation in the tangent plane to the sphere.

It appears natural to employ two-view correspondences visible only either in the left or right view. There is a surprising degeneracy in this setting, if e.g. the left camera of the stereo pair has the same distance from the triangulated four-view feature in both poses (which can be readily veriﬁed e.g. using Macaulay 2 (Grayson and Stillman). More importantly, this degeneracy also has an impact on the numerical accuracy of the returned solution, if the camera-point distances at both time instances do not vary signiﬁcantly (which is usually the case in VSLAM settings). To avoid this degeneracy and to obtain well-conditioned solutions even in the small motion case, we use a pair of two-view correspondences in the left (or right) camera and one two-view correspondence in the other camera to solve for the remaining three degrees of freedom.

The major advantage of our approach is that it uses the total ﬁeld-of-view of the stereo camera system to determine the rotation, which maximizes accuracy. Given a stereo pair with a small overlap between its views, one could triangulate 3D points in the region of overlap in the ﬁrst stereo camera. And, use the three-point perspective pose solution (Har

the second stereo camera with respect to the ﬁrst camera. Due to the small overlap, the relative rotation and translation from these methods could be inaccurate because most of the 3D points would be in approximately the same viewing direction relative to the ﬁrst camera. Our method ﬁxes the rotation with features which do not have to be in the region Figure 4.1: Geometry of stereo pair relative pose problem. Features seen in only the left or right camera are labeled Si (two-view features). The feature seen in both left and right cameras at both times is labeled Q (four-view feature).

of overlap. With this method we get a more accurate rotation with small overlap, which in turn feeds into a more accurate translation measurement.

Our solver necessitates some modiﬁcation to standard RANSAC (Bolles and Fischler,

1981) to account for the differing amount of information in a two-view correspondence and a twice-triangulated four-view correspondence. We develop a method to weigh the relative information in each type of inlier to determine the best solution in the RANSAC process.

We also describe how to modify the RANSAC stopping criteria to account for two classes of measurement.

4.2 Background Estimating SfM using multi-camera systems has been a topic of recent interest in the literature. In some instances, one can take advantage of a reduced degree of freedom (DOF)

˚o Astr¨ m, 2004) where they theoretically derive solutions for planar SfM for many different combinations of rigidly mounted cameras, features, and multi-camera system poses. Scaramuzza et al. (Scaramuzza et al., 2009) take advantage of the fact that a vehicle moves with planar non-holonomic motion to calculate the absolutely scaled structure and motion using only one point correspondence. In contrast, our approach is to be used with stereo camera systems allowed to move in all 6DOF.

A few approaches to solve for six degree of freedom motion of a multi-camera system exist which can solve for the system’s motion without any overlapping views. Kim et al.

described one such method in (Kim et al., 2007). They ﬁrst calculate essential matrices for all the system’s cameras and decompose these essential matrices into rotation and translation direction. They show that the scale of the motion can then be solved as a triangulation problem.

Another approach was presented in Chapter 3 and in Clipp et al. (2008) where the same problem is addressed using a minimal number of six feature correspondences: ﬁve from one camera are used to determine the rotation and translation direction, and the sixth one from a different camera determines the translation scale. Both methods (Clipp et al., 2008;

Kim et al., 2007) are based on decomposing the translation of the individual cameras into translation of the total system and the rotation induced translation. This can only be done if these two vectors are not parallel or close to parallel. Accordingly, these methods are only applicable to certain key-frames in any general video sequence.

The generalized epipolar constraint of multi-camera systems was developed by Pless in (Pless, 2003) to allow a network of rigidly-coupled cameras to be treated as a single imaging device for SfM. Pless suggests a linear 17-point solution for relative motion using the generalized epipolar constraint but left numerical results for this approach to future work. Li et al. show in (Li et al., 2008) that this standard linear 17-point approach to solve for relative motion using the generalized epipolar constraint based on singular value decomposition does not work in many common scenarios and propose a non-minimal, linear solution that does.

for the relative pose of a generalized camera and develop a solution method for the relative pose of stereo cameras from six ray correspondences. They demonstrate that up to 64 solutions may exist for the proposed constraints. Their solution is degenerate if all cameras in the multi-camera system are on a line, which of course is the case with a stereo camera (our target here).

At the other end of the spectrum are methods that require all features used in pose estimation to be in the multi-camera system’s region of overlap in the ﬁrst pose. These include the before mentioned three-point perspective pose problem (Haralick et al., 1994) and the

not vulnerable to the pure translation degeneracy, in contrast to (Clipp et al., 2008; Kim et al., 2007), they require sufﬁcient overlap between the cameras’ ﬁelds-of-view, necessitating a large trade-off between overlap and total ﬁeld-of-view.

Our approach, using one four-view feature to ﬁx the camera translation and three twoview correspondences to ﬁnd the rotation, occupies a middle ground in the space of problems considered until now.

4.3 Solution Method In this section, we describe our approach for relative pose estimation between two poses of a stereo rig. Let P0 and P1 denote the projection matrices of the left and the right camera for the ﬁrst time instance, and P0 and P1 are those for the second time instance. Without loss of generality, we assume a rectiﬁed stereo system, i.e. P0 = (I|0) and P1 = (I|b), where −b is the baseline between the cameras on the stereo rig. General conﬁgurations can be reduced to this case by appropriate rotation of the 2D feature positions (corresponding to 3D camera rays emerging from the projection center). We also assume the camera intrinsics and lens distortion parameters are known, and that feature positions in the image are undistorted and normalized according to the intrinsic parameters.

Let (R|t) denote the Euclidean transformation between the time instances, i.e.

The 3D point visible in both cameras has coordinates x for the ﬁrst time instance and y in the second instance (always with respect to the left camera in the rig). The 3D points x and y are found through triangulating their associated correspondences in each stereo pair.

Hence,

where we used the fact that [Rx]× Ry = (Rx) × (Ry) = R(x × y) = R[x]× y for rotation matrices R.

Overall, Equations 4.1 and 4.2 allow us to express both essential matrices in terms of the unknown rotation R, i.e.

In general, with two correspondences in the left camera and one correspondence in the right camera, there are three equations for the three degrees of freedom of the rotation. Using e.g.

unit quaternions to represent the rotation matrix R, a polynomial system of four equations can be formulated, which contains the three epipolar constraints (two constraint equations for the two-view correspondence in the left camera and one constraint equation for the twoview correspondence in the right camera, or one left and two right)and the unit quaternion constraint. By computing the elimination ideal (e.g. (Cox et al., 1997)), a 16th degree univariate polynomial (with only even powers) is obtained. The two solutions differing only in sign correspond to the quaternions (q0, q1, q2, q3 ) and (−q0, −q1, −q2, −q3 ), which actually represent the same rotation.

of equations described above (using exact arithmetic in ﬁnite prime ﬁelds, as suggested in (Stew´ nius, 2005)), and generate efﬁcient code automatically using Buchberger’s algoe rithm to solve real instances of the pose estimation problem. In order to allow real-time processing, we utilize a root ﬁnding procedure based on Sturm bracketing instead of using one of the numerically more stable, but substantially slower approaches (e.g. (Bujnak

grades with decreasing baselines in terms of the 3D scene depths, which reduces its utility in real-world situations. Nevertheless, the assumption of small motion (in particular small rotations) of the camera system over time (also commonly employed in differential feature tracking methods) allows us to simplify the polynomial system as follows.

First, we represent the rotation R by modiﬁed Rodrigues parameters σ (Schaub and

θ/4 instead of e.g. θ/2 for the classical Rodriguez parametrization. Hence, this particular representation relaxes the assumption of small rotation angles in comparison with other representations.

Under the assumption of small rotations, we approximate R(σ) by its second order Taylor approximation and insert the resulting expression into Eqs. 4.3 and 4.4. We chose to use a second order Taylor series approximation so that our solution would work with larger rotations than a ﬁrst order approximation would allow. The resulting polynomial system has three equations of degree two. The corresponding Groebner basis trace leads to an 8th degree polynomial and consists of only a few steps, hence the induced solution procedure is considered to be numerically stable. Root ﬁnding and back-substitution give the modiﬁed Rodrigues parameters σ and the corresponding rotation through Eq. 4.5, which is only an approximate solution to the original problem. The reported possible rotations are nonlinearly reﬁned to satisfy Eqs. 4.3 and 4.4 precisely.

4.4 RANSAC Considerations The minimal solution method described in the previous section uses two modes of data points—point feature matches in 3D space, and feature correspondences in 2D. Hence, we have to deviate from the uniform treatment of samples employed in traditional RANSAC settings.

The ﬁrst modiﬁcation addresses the reﬁned stopping criterion to account for the potentially different inlier ratios for the two different types of correspondences. The algorithm

both cameras (four-view features). Without loss of generality we assume that two temporal correspondences from the left camera and one correspondence from the right camera are utilized in the minimal solver. Then the modiﬁed stopping criterion is given by

where n is the number of samples required to achieve conﬁdence α in the solution. Clearly, the term is the probability of selecting only inliers as required by our solution method lrd (Section 4.3).

The inlier scoring function for a pose hypothesis also needs adjustment. In standard RANSAC the total count of inliers is used to score the current hypothesis, which assumes that all data points contain the same amount of information. In the given setup we face a mixed set of data points consisting of 3D points visible in 4 images in total, and 2D feature correspondences. In both cases the latent variables have 3 degrees of freedom per sample—the coordinates of the underlying 3D point. However, the dimension of the observed measurement is either 4-dimensional (two 2D feature positions in either the left or the right camera) or 8-dimensional (2D positions in both cameras at both points in time).

Consequently, the residual error lies in the space orthogonal to the ﬁtted manifold (Torr, 2002). Accordingly, the error space for 3D points visible in all cameras is 5-dimensional (8-3), and the residuals for points visible either in the left or right camera is 1-dimensional (4-3). Therefore, inlier features visible in four views carry ﬁve times more information than two view correspondences. Thus, the utilized weighting to combine the respective inlier counts is given by

where #Inliers(3D) denotes the number of inliers among fully visible 3D points (fourview features), and #Inliers(lef t) and #Inliers(right) designate the inlier counts of the respective two-view correspondences.

4.5 Minimal Sets of Correspondences for Stereo Cameras Our method uses a novel combination of image correspondences to determine the relative pose of a stereo camera. This leads to the question of what other combinations of correspondences can be used to solve for a stereo camera’s relative pose. Four images are available as input to compute the relative pose between stereo cameras. Thus, a point feature may be detected and matched in all four (4-V), three (3-V) or only two views (2-V).

** Table 4.5 shows all of the combinations of respective feature correspondences, which result in a minimal set of constraints fully determining the relative pose of a stereo camera.**