Ear Recognition Biometric Identification using 2- and 3-Dimensional Images of Human Ears Anika Pflug Thesis submitted to Gjøvik University College

Deutscher IT-Sicherheitskongress, 14-16 May 2013 • [162]R. RAGHAVENDRA, KIRAN B. RAJA, ANIKA PFLUG, BIAN YANG, CHRISTOPH BUSCH, 3D Face Reconstruction and Multimodal Person Identification from Video Captured Using a Smartphone Camera, 13th IEEE Conference on Technologies for Homeland Security (HST), 2013 • [149]ANIKA PFLUG, DANIEL HARTUNG, CHRISTOPH BUSCH, Feature Extraction from Vein Images using Spatial Information and Chain Codes, Information Security Technical Report, Volume 17, Issues 12, February 2012, pp. 26-35 • [77]DANIEL HARTUNG, ANIKA PFLUG, CHRISTOPH BUSCH, Vein Pattern Recognition Using Chain Codes, Spacial Information and Skeleton Fusing, GI-Sicherheit, 2012

1.5 Contribution The following questions were derived from the requirements of the GES-3D project on the one hand and from questions that occurred during the ongoing research on the other hand.

These research questions mark the red line though this thesis.

Q0: What is the current state of the art in ear recognition?

(Addressed in Chapters 3. Also see[147]) In preparation of defining a number of research questions, an elaborate survey on the state of the art in ear recognition has been compiled [147]. The survey consists of three main


parts. In the first part, we give an overview over the publicly available datasets that can be used for evaluating ear recognition systems. Then we describe different approaches for ear detection in 2D and 3D images and compare their detection accuracy. We move on to ear recognition systems and give a complete overview of different approaches and their recognition performance. This survey serves as the basis for the further work in this thesis. A summary of more recent work on ear recognition that was published after the publication of the literature survey is provided in appendix C.

Q1: How can the outer ear be automatically detected from 2D and 3D images?

(Addressed in Chapters 4, 5 and 6. Also see [154], [155], [146]) A reliable segmentation is important for any recognition system. In this work, we explore the geometrical features of the outer ear and how they can be used for segmentation. Special focus is set on the fusion of texture and depth information and the robustness to pose variations.

For the detection of 2D ears, state of the art techniques from face recognition, such as the Haar-like features [183] yield satisfactory segmentation accuracy as long as the capture setting is controlled carefully and the image quality is sufficient. High detection accuracies can also be achieved with LBP (original implementation as suggested by Ahonen et al. [9]).

It is also possible to detect the outer ear with Cascaded Pose Regression [148] from coarsely segmented images (i.e. face profile images). In order to make sure that ear detection in 2D image performs well enough, the ear should not be smaller than 50 × 80 pixels.

Ears in depth images (3D) can be segmented by searching for the unique surface structure in the ear region. In [146] the ear detection approach by Zhou et al. [218] is extended to detect the outer ear in 3D profile images under different in-plane rotations. Due to the projection of the ROI from cartesian coordinates to a polar coordinates, the detection accuracy drops.

In [154] we have introduced a novel ear detection method, in which we reconstruct the ear outline by combining regions with high surface curvature to an ear outline. This work was extended in [155], where we added edge information from the co-registered texture image to the reconstructed 3D shapes. The high detection performance confirms that the surface structure and texture information in the ear region is clearly distinguishable for the surrounding areas.

Q2: How can cropped ear images be normalized with respect to rotation and scale?

(Addressed in Chapter 7. Also see [148]) In order to apply ear recognition in more challenging environments (as in GES-3D), the ear region needs to be normalized with respect to rotation and scale.

Cascaded Pose Regression (CPR) is an approach for face normalization, originally propose by Dollar et al [62]. CPR optimizes a loss function that tries to minimize the difference between local grey level-based features within the ellipse. Instead of localizing a number of visually defined landmarks, CPR uses weak features for estimating the orientation of the ear. Given that we have a sufficient number of training images, CPR can also be optimized towards being robust to partial occlusions and for normalizing ear images in different poses.

Using Cascaded Pose Regression (CPR), we fit an ellipse around the ear, where the major axis of the ellipse represents the largest distance between the lobule and the upper helix [148]. We then compensate scale and rotations by adjusting the center, the length of the major axis and the tilt of the ellipse such that the major axis is vertical and has a fixed length.

We show that the recognition performance of a pipeline using CPR prior to extracting the feature vector, is significantly higher than the same pipeline without normalization. We also show that CPR crops the ear region accurately for different ROIs representing different capture settings. Obviously, the benefit of using CPR increases with a larger variation


rotation and scale in the dataset.

Q3: Is it possible to combine 2D and 3D data in order to obtain a better descriptor that yields a better performance than 2D or 3D alone?

(Addressed in Chapter 8, Also see [146]) Motivated by the performance increase with fused texture and depth data in ear segmentation, we propose a combined descriptor for co-registered pairs of texture and depth images.

We consider the texture image and the depth image as separate channels of information that are be merged into a fixed-length histogram descriptor. The optimal settings for the merged feature vector are determined in a series of experiments using three different datasets.

The combined 2D/3D descriptor uses surface curvature information for determining the histogram bin and texture information for determining the bin magnitude. The method can be applied within a sliding window, which results in a spatial histogram. Our experiments show that the method has some potential. However, it is vulnerable against noise, especially in the depth channel.

Along with this, we conduct a study on different techniques for texture description and empirically determine the optimal algorithm and parameter settings for the capture settings represented by three publicly available datasets. We conclude that the optimal parameter set for each of the texture and surface descriptors is highly dependent on the resolution and the quality of the input images.

Q4: How can ear templates be represented in order to enable fast search operations?

(Addressed in Chapter 9, also see [151]) Given that we have a fixed length histogram descriptor that yields satisfactory performance in a given scenario, we would like to optimize search operations towards being as fast and reliable as possible. Based on the observation that many histogram descriptors are sparsely filled, we propose a sequential identification system (1:N search) that uses binary descriptors in the first stage and real-valued descriptors in the second stage (i.e. with double precision numbers).

In our test system (implemented in C++), the comparison of binary feature vectors is up to ten times faster than the comparison with real-valued feature vectors of the same length.

Obviously, there is a loss of information during the binarization process, which results in a lower true positive identification rate (rank-1 recognition rate). Despite this, the probability that the correct identity is among the first n subjects is high, such that we can use the binary feature vector for retrieving a short list from the database. Subsequently, we use the real-valued feature vectors for re-sorting the short list.

We show that we can do an 1:N search in 30% of the time compared to an exhaustive search using the real-valued feature vectors only. Additionally, we reduce the chance for false positives.

Q5: Which impact does signal degradation have on the performance of ear recognition systems?

(Addressed in Chapters 10 and 11. Also see [186],[153]) The performance of every biometric system is dependent on the quality of the input images. The quantification of image quality, however, is always dependent on the scenario.

We have conducted a series of experiments to learn more about the impact of noise and blur on the performance of ear recognition systems.

We have generated a series of degraded images and computed the Peak Signal to Noise Ratio (PSNR) in order to quantify the quality of the degraded images. The PSNR is defined via the mean squared error between an image of optimal quality and a degraded image and can be regarded as the ground truth information here). We then measured the decline of segmentation and recognition performance with different features for detection [186] and recognition [153]. In general, we noticed that noise has a greater effect on segmentation


and recognition performance than blur.

A similar series of experiments was conducted on images, that are compressed with JPEG and JPEG 2000 (also see Part III 13.1.4). JPEG 2000 - also named after the codec J2K - is a wavelet-based image compression method that, according to the ISO standard, should be preferred over the DCT-based JPEG compression for biometric identification systems. These experiments show, that the detection and recognition performance varies with a decreasing bit rate when compressing with JPEG. We suppose that there is a correlation between the size of the compression artefacts and the radius of the texture descriptors.

Q6: Is it possible to automatically find categories of ear images?

(Addressed in Chapter 12. Also see [152]) Most of the research efforts in ear recognition concentrate on achieving a high recognition performance in closed datasets. The next step towards an operational system is to provide techniques for fast and efficient search in large databases.

We analyse texture feature spaces with respect to cluster tendencies,which could be exploited for reducing the number of candidates in an 1 : N search. We create feature subspaces using linear and non-linear methods and analyze these subspaces for cluster tendencies. We apply different metrics for estimating the goodness of the clustering solutions and show, that the feature subspaces can be organized as convex clusters using K-means.

We show that clustering using 2D ears using K-means, PCA for subspace projection and LPQ for texture feature is possible. For this particular configuration, the search space can be reduced to less than 50% of the database with a chance of 99.01% that the correct identity is contained in the reduced dataset. We also show that a search that is extended to up to three adjacent clusters yields a better performance than a single cluster search. The classes depend on the skin tone of the subjects, but also on the capture settings of the dataset they originally came from. We also observe that feature vectors with a high performance in classification do not necessary yield high recognition rates.

In our project GES-3D, we develop a demonstrator system for exploring the virtues and limitations of 3D imagery in forensic identification in a semi constrained environment. The project is conducted within a consortium of seven partners, among which are technical partners, consulting partners and the German Criminal Police (BKA) as the stakeholder.

The goal of the project is to develop a fully integrated identification system that uses 3D head models as references and crime scene videos as probes. The reference data is collected under controlled conditions by a police officer, who sends a 3D head model, and a series of photographs to a central database where the data is assembled and stored.

If a subject is to be identified from video material, the identification system automatically extracts face and ear features from a CCTV video and returns a list of the n most likely candidates. In practice, a forensic expert would now further analyze the retrieved images and give an estimation of the similarity between the reference and the probe. Manual analysis is not part of the project though. GES-3D only concentrates on the automated retrieval of the n most likely candidates. After the retrieval process, the 3D model from the database can be used to assist the forensic expert, by offering the opportunity of adjusting the pose of the reference according to the pose on the input video.

For evaluating the system, we develop a test scenario, in which we simulate a typical scenario in the entrance hall of a bank. A person enters the room and walks towards an ATM. Whilst inside the bank, the subject is filmed by different off-the-shelf CCTV cameras from four different viewpoints. The selection of the viewpoints was proposed by BKA, based on their experience with typical surveillance cameras. Figure 2.1 shows a floor plan that illustrates the data collection setup. In existing identification systems and in many research experiments, reference and probe images are collected under the same conditions with the same capture device. One of the main challenges in this project is to compare images from different media (2D, 3D and video), different capture devices and different camera viewpoints. Based on this experiment, we formulate the following requirements for the prospect demonstration system.

• Usability: The user interface and the capture device should be easy to use and to understand for police officers, who do not have any image processing or biometrics background.

• Data protection: For data protection reasons, all images should be stored in a central database and they should not be distributed to any other third party system.

• Transparency: All decisions that are part of of the identity retrieval should be made transparent to the forensic investigator. The forensic investigator should be able to review the search result for every single biometric service provider.

