3.5 3D EAR RECOGNITION LLE ( 43%) is improved significantly by their method to 60.75%. If the pose variation is only 10 degrees, the improved LLE approach achieved a rank-1 performance of 90%.

In their approach Nanni and Lumini [132] propose to use Sequential Forward Floating Selection (SFFS), which is a statistical iterative method for feature selection in pattern recognition tasks. SFFS tries to find the best set of classifiers by creating a set of rules, which best fits the current feature set. The sets are created by adding one classifier at a time and evaluating its discriminative power with a predefined fitness function. If the new set of rules outperforms the previous version, the new rule is added to the final set of rules. The experiments were carried out on the UND collection E and the single classifiers are fused by using the weighted sum rule. SFFS selects the most discriminative sub-windows which correspond to the fittest set of rules. Nanni and Lumini report a rank-1 recognition rate of 80% and a rank-5 recognition rate of 93%. The EER varies between 6.07% and 4.05% depending on the number of sub-windows used for recognition.

Yiuzono et al. consider the problem of finding corresponding features in ear images as an optimization problem and apply genetic local search for solving it iteratively [210]. They select local sub windows with varying size as the basis for the genetic selection. In [210] Yiuzono et al. present elaborated results, which describe the behavior of genetic local search under different parameters, such as different selection methods and different numbers of chromosomes. On a database of 110 subjects they report a recognition rate of 100%.

Yaqubi et al. use features obtained by a combination of position and scale-tolerant edge detectors over multiple positions and orientations of the image [204]. This feature extraction method is called HMAX model and is inspired by the visual cortex of primates and combines simple features to more complex semantic entities. The extracted features are classified with an SVN and a kNN. The rank-1 performance on a small dataset of 180 cropped ear images from 6 subjects varies between 62% and 100% depending on the kind of basis features.

Moreno et al. implement a feature extractor, which locates seven landmarks on the ear image, which correspond to the salient points from the work of Iannarelli. Additionally they obtain a morphology vector, which describes the ear as a whole. These two features are used as the input for different neural network classifiers. They compare the performance of each of the single feature extraction techniques with different fusion methods.

The proprietary test database is composed of manually cropped ears from 168 from 28 subjects. The best result of 93% rank-1 performance was measures using a compression network. Other configurations yielded error rates between 16% and 57%.

Gutierrez et al. [73] divide the cropped ear images into three equally sized parts. The upper part shows the helix, the middle part shows the concha and the lower part shows the lobule. Each of these sub images is decomposed by wavelet transform and then fed into a modular neural network. In each module of the network a different integrators and learning functions was used. The results of each of the modules are fused in the last step for obtaining the final decision. Depending on the combination between integrator and learning function, the results vary between 88.4% and 97.47% rank-1 performance on the USTB I database. The highest rank-1 performance is achieved with Sugeno measure and conjugate gradient.

In [133] Nasseem et al. propose a general classification algorithm based on the theory of compressive sensing. They assume that most signals are compressible in nature and that any compression function results in a sparse representation of this signal. In their experiments in the UND database and the FEUD database, Nasseem et al. show that their sparse representation method is robust against pose variations and varying lighting conditions. The rank-1 performance varied between 89.13% and 97.83%, depending on the dataset used in the experiment.



Table 3.4: Summary of approaches for 3D ear recognition. Performance (Perf.) always refers to rank-1 performance.

–  –  –

3.5 3D Ear Recognition In 2D ear recognition pose variation and variation in camera position, so-called out-ofplane-rotations, are still unsolved challenges. A possible solution is using 3D models instead of photos as references, because a 3D representation of the subject can be adapted to any rotation, scale and translation. In addition to that, the depth information contained in 3D models can be used for enhancing the accuracy of an ear recognition system. However, most 3D ear recognition systems tend to be computationally expensive. In Table 3.4 all 3D ear recognition systems described in this section are summarized.

Although ICP is originally designed to be an approach for image registration, the registration error can also be used as a measure for the dissimilarity of two 3D images. Because ICP is designed to be a registration algorithm, it is robust against all kinds of translation or rotations. However ICP tends to stop too early, because it gets stuck in local minima.

Therefore ICP requires the two models to be coarsely pre-aligned before fine alignment using ICP can be performed. Chen and Bhanu extract point clouds from the contour of the outer helix and the register these points with the reference model by using ICP [47]. In a later approach Chen and Bhanu use local surface patches (LSP) instead of points lying on the outer helix [49]. As the LSP consist or fewer points than the outer helix, this reduces the processing time while enhancing the rank-1 performance from 93.3% with the outer helix points to 96.63 % with LSP.

Yan and Browyer decompose the ear model into voxels and extract surface features from each of these voxels. For speeding up the alignment process, each voxel is assigned an index in such a way that ICP only needs to align voxel pairs with the same index [200] (see

3.5 3D EAR RECOGNITION Figure 3.8). In [201] Yan and Browyer propose the usage of point clouds for 3D ear recognition. In contrast to [47] all points of the segmented ear model are used. The reported performance measures of 97.3% in [200] and 97.8% in [201] is similar but not directly comparable, because different datasets were used for evaluation.

Cadavid et al. propose a real-time ear recognition system, which reconstructs 3D models from 2D CCTV images using the shape from shading technique [41]. Thereafter the 3D model is compared to the reference 3D images, which are stored in the gallery. Model alignment as well as the computation of the dissimilarity measure is done by ICP. Cadavid et al. report a recognition rate of 95% on a database of 402 subjects. It is stated in [41] that the approach has difficulties with pose variations. In [219] Zhou et. al. use a combination of local histogram features voxel-models. Zhou et. al. report that their approach is faster and with an EER of 1.6% it is also more accurate than the ICP-based comparison algorithms proposed by Chen and Bhanu and Yan and Browyer.

Simlarly to Cadavid et al., Liu et al. reconstruct 3D ear models from 2D views [115].

Based on the two images of a stereo vision camera, a 3D representation of the ear is derived.

Subsequently the resulting 3D meshes serve as the input for PCA. However Liu et al. do not provide any results concerning the accuracy of their system but since they did not publish any further results on their PCA mesh approach, it seems that it is no longer pursued.

Passalis et al. go a different way for comparing 3D ear models in order to make comparison suitable for a real-time system [144]. They compute a reference ear model which is representative for the average human ear. During enrolment, all reference models are deformed until they fit the reference ear model. All translations and deformations, which were necessary to fit the ear to the reference model are then stored as features. If a probe for authentication is given to the system, the model is also adapted to the annotated ear model in order to get the deformation data. Subsequently the deformation data is used to search for an associated reference model in the gallery. In contrast to the previously described systems, only one deformation has to be computed per authentication attempt. All other deformation models can be computed before the actual identification process is started.

This approach is reported to be suitable for real-time recognition systems, because it takes less than 1 milliseconds for comparing two ear templates. The increased computing speed is achieved by lowering the complexity class from O(n)2 for ICP-based approaches to O(n) for their approach. The rank-1 recognition rate is reported to be 94.4%. The evaluation is based on non-public data, which was collected using different sensors.

Heng and Zhang propose a feature extraction algorithm based on slice curve comparison, which is inspired by the principles of computer tomography [116]. In their approach the 3D ear model is decomposed into slices along the orthogonal axis of the longest distance between the lobule and the uppermost part of the helix. The curvature information extracted from each slice is stored in a feature vector together with an index value indicating the slice’s former position in the 3D model. For comparison the longest common sequence between two slice curves with similar indexes is determined. Their approach is only evaluated on a non-public dataset, which consists of 200 images from 50 subjects. No information about pose variations or occlusion during the capturing experiment is given.

Heng and Zhang report a rank-1 performance of 94.5% for the identification experiment and 4.6%EER for the verification experiment.

Islam et al. reconnect point clouds describing 3D ear models to meshes and iteratively reduce the number of faces in the mesh [85]. These simplified meshes are then aligned with each other using ICP and the alignment error is used as the similarity measure for the two simplified meshes. In a later approach Islam et al. extract local surface patches as shown in Figure 3.9 and use them as features [86]. For extracting those LSP, a number of points is selected randomly from the 3D model. Then the data points which are closer to the seed point than a defined radius are selected. PCA is then applied to find the most descriptive features in the LSP. The feature extractor repeats selecting LSP until the desired number of features has been found. Both approaches were evaluated using images from UND. The



Figure 3.8: Examples for surface features in 3D ear images. The left image shows an example for ICP-based comparison as proposed in [47], whereas the right figure illustrates feature extraction from voxels as described in [200].

Figure 3.9: Example for local surface patch (LSP) features as proposed in [86] recognition rate reported for [85] is 93.

98% and the recognition rate reported for [86] is 93.5%. However, none of the approaches has been tested with pose variation and different scaling.

Zheng et al. extract the shape index at each point in the 3D model and use it for projecting the 3D model to 2D space [211]. The 3D shape index at each pixel is represented by a grey value at the corresponding position in the 2D image. Then SIFT features are extracted from the shape index map. For each of the SIFT points a local coordinate system is calculated where the z-axis correspondents to the feature point’s normal. Hence the z-values of the input image are normalized according to the normal of the SIFT feature point they were assigned to. As soon as the z values have been normalized, they are transformed into a grey level image. As a result, Zheng et al. get a local grey level image for each of the selected SIFT features. Next LBP are extracted for feature representation in each of these local grey level images. Comparison is first performed by coarsely comparing the shape indexes of key pints and then using Earth mover’s distance for comparing LBP histograms from the corresponding normalized grey images. Zheng et al. evaluated their approach on a subset of the UND-J2 Collected and achieved a rank-1 performance of 96.39%.

3.6 Open challenges and future applications As the most recent publications on 2D and 3D ear recognition show, the main application of this technique is personal identification in unconstrained environments. This includes applications for smart surveillance, such as in [40] but also the forensic identification of perpetrators on CCTV images or for border control systems. Traditionally these application


fields are part of face recognition systems but as the ear is located next to the face, it can provide valuable additional information to supplement the facial images.

Multi modal ear and face recognition systems can serve as a means of achieving pose invariance and more robustness against occlusion in unconstrained environments. In most public venues surveillance cameras are located overhead in order to capture as many persons as possible and to protect them from vandalism. In addition, most of the persons will not look straight into the camera, so in most cases no frontal images of the persons will be available. This fact poses serious problems to biometric systems, using facial features for identification. If the face is not visible from a frontal angle, the ear can serve as a valuable additional characteristic in these scenarios.

Because of the physical proximity of the face and the ear, there are also many possibilities for the biometric fusion of these two modalities. Face and ear images can be fused on the feature level, on the template level and on the score level. Against the background of this application, there are some unsolved challenges, which should be addressed by future research in this field.

