# «MULTI-CAMERA SIMULTANEOUS LOCALIZATION AND MAPPING Brian Sanderson Clipp A dissertation submitted to the faculty of the University of North Carolina ...»

Only three sets of constraints (out of 6 possible ones) have been used for relative pose estimation for stereo cameras in the literature so far.

The displayed minimal sets of correspondences are established by looking at independent constraints induced by a combination of correspondences. A four-view correspondence constrains 3DOF between the two stereo cameras, a three-view correspondence 2DOF and a one-view correspondence 1DOF. It is clear the total number of constraints generated by the correspondences needs to be exactly six for a minimal case. However, one must be careful to consider the redundancy in combinations of correspondences. A simple counting argument would suggest a relative pose with two four-view correspondences

and no others would be minimal. In fact this situation is under-determined because of its geometry.

** Figures 4.3 to 4.**

6 illustrate additional minimal geometric conﬁgurations that could give rise to minimal solution methods. None of these methods has been veriﬁed by developing an algebraic solution method based on the geometry. These cases are described by their number of four-view, three-view and two-view correspondences in the following format case #f our − view, #three − view, #two − view. For example, the solution method using a non-overlapping stereo camera described in Chapter 3 is denoted case 0, 0, 6.

In the ﬁgures solid, thick lines connect camera centers with rays to points where the ray direction and point depth are both known. Dashed lines connect camera centers to points with rays where only the ray direction is known. Since the second camera in the stereo pair is only used to ﬁnd the depth of three-view and four-view features, it is not drawn in the ﬁgures.

Case 1, 1, 1 is shown in Figure 4.3. This geometry clearly fully constrains the two cameras since the four-view and three-view features constrain the two cameras so that their only degree of freedom is rotation about the vertical line through the two features. A single additional two-view correspondence should provide the additional constraint required to fully constrain the relative pose of the two cameras, P0 and P1.

** Figure 4.4 illustrates case 0, 1, 4 which also clearly reﬂects a fully constrained geometry.**

This can be seen by considering the solution method that might be applied to ﬁnd

** Figure 4.2: Over-constrained, minimally-constrained and under-constrained combinations of features for six degree of freedom motion estimation for a rigid, calibrated two-camera system.**

can be used with the ﬁve correspondences, ignoring for the moment that one of the correspondences has a known depth. The ﬁve-point method ﬁnds the transformation’s rotation and translation direction. At this point the transformation is known up to an unknown scale factor. Using the three-view correspondence’s depth this scale factor can be determined to recover the six degree of freedom, absolutely scaled transformation from P0 to P1. Additional combinations of three-view and two-view features may lead to other geometries for case 0, 1, 4. In these geometries rather than all of the two-view features coming from the same camera, e.g. the left camera in the stereo pair, they might come from some combination of left to left and right to right camera correspondences. These combinations of correspondences might be looked at in future work.

Two possible geometries exist for case 0, 2, 2 which we will refer to as case 0, 2, 2 a and case 0, 2, 2 b. Case 0, 2, 2 a fully constrains the relative pose of P0 and P1. After the two three-view correspondences are included in the geometry, camera P1 has two remaining degrees of freedom. It can rotate about the line through the two threeview features and it can move along the circle shown in Figure 4.5. These two degrees of freedom should be resolved with the two two-view correspondences. Upon inspection case 0, 2, 2 b, Figure 4.6, does not appear to fully constrain the relative pose but it is included here because it warrants additional study.

4.6 Degenerate Cases In this section, we describe certain conﬁgurations of features that lead to degeneracies in our solution for case 1, 0, 3. The major reason we use two two-view correspondences in the left (or right) camera and one two-view correspondence in the other camera to solve for the rotation is a degeneracy that can occur when using correspondences only from one of the cameras. If all three two-view correspondences are selected from either the left or

** Figure 4.5: The ﬁrst case of the minimal geometry consisting of two three-view and two two-view correspondences, case 0, 2, 2 a.**

While an algebraic solution is not provided in this dissertation, this geometry appears to fully constrain the relative camera poses.

** Figure 4.6: The second case of the minimal geometry consisting of two three-view and two two-view correspondences,case 0, 2, 2 b.**

It is not immediately clear that this geometry fully constrains the relative camera poses but it may.

** Figure 4.7: The ﬁrst degenerate case.**

When all the features (two-view and four-view) lie on a 3D line through the four-view feature the second camera can be anywhere on a circle which is the intersection of the sphere and a plane orthogonal to the line containing the four-view feature.

the right camera to solve for the rotation, and the four-view feature is equidistant from the left camera at both poses, then this setting is degenerate and the 1D variety of solutions exists. This situation is resolved by utilizing two-view correspondences from both the left and right cameras to solve for the rotation.

One truly degenerate case that may arise with our method in practice is when all three features, which give rise to the 2D correspondences, lie on a line through the 3D point used in the minimal solution at the center of the sphere. In this case, the camera can be anywhere on a circle described by the intersection of the sphere and a plane through the center of the sphere orthogonal to the line. This conﬁguration is depicted in Figure 4.7.

This conﬁguration might occur in man-made environments where straight lines are present.

However, this conﬁguration is a common degeneracy for all relative pose solution methods and can be easily avoided by never selecting correspondences that are all co-linear in both images.

4.7 Synthetic Experiments In this section, we evaluate the performance of our minimal solver using two synthetic experiments. First, we evaluate the performance after nonlinear reﬁnement of the solver under varying levels of image noise with and without outliers to test the solver’s ability to deal with corrupted data. Second, we compare our solver to the three-point perspective pose solution without reﬁnement while decreasing the overlap of the stereo pair by rotating the cameras on the rigid system around their vertical axes. In these and the experiments in Section4.8, the solution method based on the small angle approximation is used as it considerably simpliﬁes the polynomial equations that must be solved to ﬁnd the camera system motion.

The ﬁrst experimental setup tests random motions of a stereo camera. For ease of explanation, we assume all units of length are in meters. The two cameras have a baseline of

0.5 m, have parallel optical axes, and are placed in a standard stereo camera conﬁguration where both camera centers are on a line from the left optical axis to the right optical axis orthogonal to the axes. The ﬁrst camera is placed with identity rotation and the left camera of the stereo head at the origin. Three dimensional feature points are distributed in a 20x20x20m volume in front of the camera. The second camera pose is generated by ﬁrst randomly translating the camera between 0.2 and 3.5 meters in a random direction. The minimum translation reduces the effect of being close to the degenerate case when the two cameras are the same distance from the 3D feature, which is the center of rotation. The second stereo pair is then rotated randomly up to ﬁve degrees in each of the three rotation axes. Based on visibility we divide the 3D features into three classes: those that can be seen in both cameras of the stereo system at both times (four-view features), those that can be seen only in the left camera at both times, and those that can be seen only in the right.

In this way we model the effect of a limited overlap in the ﬁelds of view.

We test the performance of our proposed minimal solver under varying levels of noise and outliers. We use RANSAC (Bolles and Fischler, 1981) to ﬁnd an initial solution and do a bundle adjustment on the inliers to reﬁne the relative pose solution. Figure 4.8 shows the rotation and translation direction error with varying levels of noise added to the features

translation error is the angle between the true and estimated translation vectors. We use the translation direction rather than distance because in situations where the features are relatively far from the camera compared to the stereo camera baseline the magnitude of the translation will be only weakly observable. By comparing the translation directions we compare quantities which should have greater observability in the data. Given that our method uses the 3D location of one feature, we triangulate the noisy image measurements of this feature in both stereo pairs independently and use the triangulated point locations as input to our solver. Image noise is reported in degrees. Note that 0.1◦ corresponds to two pixels for a camera with a sixty-degree ﬁeld-of-view and 1200 pixels horizontal resolution.

** Figures 4.8 and 4.**

9 clearly show our solver is able to separate inliers from outliers in the presence of noise.

The second experiment is designed to test our method’s performance vs. the threepoint perspective pose method (P3P) in a typical indoor scenario. The camera is placed in a corridor, which has 3D features randomly distributed in the 0.1m thick walls. The corridor is 20m long, 2m high and 2m wide. The ﬁrst stereo camera is placed in the middle of the corridor pointing down the corridor at the far wall, which is 10m away from the camera. The second stereo camera is randomly translated and rotated in a way that it is moving down the hall and points on the far wall are visible.

We progressively reduce the overlap between the cameras by rotating the left and right cameras’ optical axes away from each other. We compare the accuracy of the relative pose calculated using our method and (P3P) after RANSAC but without non-linear reﬁnement.

This provides a measure of how close the RANSAC solution is to the true solution. The

** Figure 4.8: Absolute rotation error and translation direction error after non-linear reﬁnement under varying noise with and without outliers.**

The slightly larger error with outliers at larger noise values may be due to the fact that fewer inlier features are used to calculate these solutions than the pure inlier sets. For example, if 100 features were used in the experiment then with 10% outliers only 90 features would be used to calculate the motion vs.

100 with 0% outliers.

** Figure 4.9: Error in the scale of camera translation under varying noise with and without outliers.**

The displayed error is given as Test − Ttrue / Ttrue.

closer to the true solution the more likely it is that a non-linear reﬁnement will ﬁnd the global minimal error solution. We test both methods on exactly the same random motions, 3D feature locations and same noisy feature measurements over 1000 trials and report the average results.

For the P3P method we ﬁrst triangulate 3D points between the left and right cameras of the stereo system at the initial pose, P0. We then use the projections of these triangulated features in the left image of the stereo head at the second pose, P0, to calculate the relative pose. We calculate inliers and outliers and score both the P3P solution and our solution method in the same manner. We use an adaptive stopping criterion so that we can compare the required number of RANSAC samples to reach 99% conﬁdence. We also compare the rotation and translation direction error and scale error of the two methods. In Figures 4.10 through 4.12, the percentage on the legend shows the percent overlap at inﬁnity between

** Figure 4.10: Comparison of absolute rotation error after RANSAC without outliers.**

This is the error for the best sample from RANSAC without non-linear reﬁnement.

the cameras. The cameras have a 60◦ ﬁeld-of-view horizontally and vertically.

Comparing Figures 4.10 and 4.11 one can clearly see that the performance of the P3P method decreases with decreased overlap in the cameras while our method has virtually constant performance regardless of overlap. With 100% overlap, P3P outperforms our method. However, with 25% overlap, the two methods perform comparably and with 5% overlap, our minimal solution outperforms P3P for typical noise values. Figure 4.12 shows that our method performs with roughly the same scale error regardless of overlap while the P3P method degrades.

** Figure 4.11: Comparison of translation direction error after RANSAC without outliers.**

This is the error for the best sample from RANSAC without non-linear reﬁnement.

To demonstrate our minimal solver we have incorporated it into a real-time (12fps processed), stereo camera based structure from motion system. The system uses a stereo pair of 1024x768 resolution cameras with approximately 40o by 30o ﬁelds of view to collect video input. The system does 2D feature tracking using a graphics processing unit (GPU) implementation of multi-camera scene ﬂow (Devernay et al., 2006). This work is an extension of Kanade-Lucas-Tomasi (KLT) tracking (Lucas and Kanade, 1981) into three dimensions. Features are matched between the two cameras in the stereo head and triangulated.