«BY GEORGIOS DIAMANTOPOULOS A THESIS SUBMITTED TO THE UNIVERSITY OF BIRMINGHAM FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF ELECTRONIC, ...»
Future developments in this kind of research may require that both the subject and experimenter’s eye movements are tracked simultaneously. If two or more cameras are required for each pair of eyes tracked, the setup quickly becomes more expensive and even harder to conceal. Furthermore, the more the cameras, the harder the system becomes to setup and simplicity is considered to be a key requirement in this application.
Head-pose estimation is required and it comes at the cost of additional computational complexity and where more than one camera is involved, additional hardware costs. In fact, as referenced earlier, some remote eye-trackers that use infrared illumination and a glint-centred reference system go to great lengths to ensure head-pose invariance by using multiple infrared light sources.
Depending on the accuracy and limits of head-pose estimation, head-movement may be restricted. Also, regardless of how good the head-pose estimation is, there may be cases when the pupil is not captured sufficiently well or at all by the remote camera. This may happen if for example the camera is placed below head level, the subject’s head is tilted upwards and the subject performs an extreme upwards eye-movement.
A fully calibrated system may be required (e.g. Meyer et al., 2006). This means that not only the camera’s intrinsic parameters need to be known but also the geometric topology of the camera(s) and the subject. While the calibration process is plausible for a rigid setup in front of a computer monitor, doing so for an interview-type experiment may be impossible as it requires much more flexibility.
While processing power is much cheaper than it used to be, the enhanced processing of a remote system significantly demotes its attractiveness.
49BUILDING A HEAD-MOUNTED EYE-TRACKER
A lightweight head-mounted eye-tracker can be minimally invasive (no contact to the eye, small weight, user experience similar to wearing vision glasses) and it does not require integration of different and isolated systems (remote systems require head pose estimation which is usually done with a second camera). In terms of accuracy, both types of systems can perform well depending on the hardware and setup used (e.g. Meyer et al., 2006; Villanueva et al., 2007).
Further, given that there exists a publicly-available low-cost design of a light-weight headmounted eye-tracker that is also easy to assemble (Babcock et al., 2003; Babcock and Pelz, 2004), a head-mounted design is a good choice for this project.
Amongst head-mounted eye-trackers, there are more design and implementation choices to be made, those of light source (natural light or passive versus infrared light or active) and gaze estimation method.
Adding infrared light source(s) is a relative easy procedure and the choice between active and passive illumination significantly affects the problem formation. When passive illumination is used, the input image is subject to severe brightness changes depending on the light source of the environment and artefacts caused by shadows, reflections from objects in the environment and pupil tracking becomes very hard to impossible depending on the subject’s eye colour; in fact, as seen earlier, it is common for passive eye-trackers to track the iris as it is indistinguishable from the pupil.
Of course, using infrared light does not solve every problem. One very important challenge of designing an eye-tracker (whether active or passive) is that of finding a reliable reference feature point. In most cases, a glint-centred reference system is used (e.g. Ebisawa et al., 2002; Li et al., 2005; Ohno et al., 2002; Benoit et al., 2005) where the distance between the pupil and the glint is what determines the output of the eye-tracker (usually a mapping to the screen or 3D world).
Specifically for remote systems, while eye corners are often used for head pose estimation (Lam and Yan, 1996; Zhang, 1996; Feng and Yuen, 1998; Tian et al., 2000; Sirohey et al., 2002; Wang et al., 2005; Xu et al., 2008) they are a lot less preferred as reference points (Zhu and Yang, 2002;
Valenti et al., 2008).
50 In terms of gaze estimation, 2D regression-based estimation requires a surface to map eyemovements onto; in the case of an interview where the subject is not looking at a screen, this method cannot be applied and thus rules out the design for this application. In the case of 3D model-based gaze estimation the camera’s intrinsic parameters and scene geometry (relative position of screen, subject, camera and light source(s)) need to be known. This is a great disadvantage, especially for head-mounted systems where the scene geometry is significantly affected by the subject’s face morphology (i.e. even if the camera is always placed at a fixed location, the variance of eye cavities in each subject will determine the exact distance between the camera and the eye). Furthermore, this information can be difficult to obtain and it restricts the eye-tracker to a rigid setup. Thus, for this application gaze mapping techniques are unnecessary.
From the above survey, it is appropriate to conclude with a remark made by Hansen and Ji (2010), who state that “each technique has its advantages and limitations, but the optimal performance of any technique also implies that its particular optimal conditions with regard to image quality are met”. In other words, each technique will work well for the image quality that it has been designed for. In the case of the REACT eye-tracker, the application requirements have driven the selection of the design that offers the best compromise and in Chapter 4 the feature detection algorithms that make tracking with the REACT eye-tracker possible are outlined.
51CHAPTER 4: FEATURE EXTRACTION
After the review of eye-tracking systems in Chapter 3 and the relevant discussion in terms of their requirements, a novel eye-tracker is introduced in this chapter which fulfils them. The hardware design of the eye-tracker is deliberately omitted from this chapter as it is largely based on a previous design by Babcock and Pelz (2004) and is described in Appendix A. Instead, this chapter focuses on the algorithms involved in extracting the features from the input images and calculating the 2D gaze angle.
Thus, what follows is an extensive discussion of the image acquisition and image properties that will help gain insight into the computer vision problem of extracting the eye features and its complexity. Then, the algorithms that are responsible for extracting the pupil, iris radius and eye corners are described in-depth. Finally, the calculation of the 2D gaze vector is also described indepth before moving onto the evaluation of these respective components in Chapter 5. In each section, the intermediate steps are visually illustrated; sample visualisations of the complete set of features for each subject is shown in Table 8 at the end of this chapter while randomly selected frames throughout each test sequence with the pupil marked are shown in Table 9, which is also located at the end of this chapter. The last section of this chapter discusses the computational complexity of the algorithms involved in extracting the features.
ACQUSITION AND IMAGE PROPERTIESThis section intends to examine in detail the properties of the source images taken with the REACT eye-tracker in order to better understand the problem at hand and how it may be best solved.
Camera and light source The camera used is Supercircuits PC206XP which is a grey-level pinhole camera and it captures images 640 pixels wide by 480 pixels tall at 29.97 frames per second (NTSC). A standard infrared (IR) LED that transmits light at a wavelength of 940nm is used as the light source and a Kodak Wratten 87c IR filter is placed on top of the camera to filter out non-infrared light. For further details on the hardware of the REACT eye-tracker, please refer to Appendix A.
First and foremost, the input images are interlaced and thus introduce a severe artefact to be dealt with. Interlacing is a technique that was first developed with the introduction of the cathode ray tube (CRT) televisions and was a means to improve the quality of the picture without increasing the bandwidth requirements (Luther and Inglis, 1999).
Interlacing works by labelling each frame into odd and even fields (fields are synonymous to scan lines or just lines) and refreshing the odd fields at different intervals to the even fields (Figure 5).
For vision processing, non-progressive (interlaced) images create several problems as if there is any motion in the image, no entity is continuous (this is often described as motion blur). An example of this can be seen in Figure 6 (a); as the eye is moving quickly upwards and the two fields are updated at different points in time, three different pupil outlines can be seen in the image: one from the pupil position at time T (odd field), another from the pupil position at time T 1 (even field) and their overlap. Thus, computer vision algorithms would face significant difficulties in detecting the pupil from the original image (or other features for the matter).
De-interlacing methods are largely undocumented in computer vision literature and are mostly discussed informally on the World Wide Web, e.g. Wikipedia, 2009. Of the reported methods of de-interlacing a video, some are more complex (and slower) than others. For the REACT eyetracker, the most simplistic3 of methods was used to de-interlace the video: splitting each field to a frame of its own (referred to as half-sizing, Wikipedia, 2009). Thus, the odd fields in frame number are collated to compose frame number and the even fields in frame are collated to compose frame number. The result video has twice the frame-rate of the original; an example output of the de-interlacing may be seen in Figure 6 (b). When the pupil or any other object is moving really fast some smearing will still be visible; however this smearing cannot be removed completely using de-interlacing as it is a limitation imposed by the speed that the camera captures frames at.
More complex schemes usually involve using interpolation to recover missing samples and/or motion 3 compensation (Wikipedia, 2009).
As discussed in Chapter 2, the REACT eye-tracker was purposefully built to use infrared lighting due to the immediate advantages it offers over normal light. Re-iterated here for completeness, recording the eye-tracker images without the infrared light source and non-infrared filter can potentially introduce severe changes in the output’s histogram, contrast and brightness. Natural light eye-trackers (e.g. Hansen and Pece, 2005) face a great challenge in dealing with such
variations; these variations can be a result of two potential reasons:
First of all, natural environment light sources during the day (e.g. sun) can vary depending on the current point in time.
Secondly, because we use the normal light spectrum in our everyday lives, it is open to interference from several sources whether working inside or outside. These interferences would be tough and potentially expensive to control. In a usual home or office environment, there are several types of light sources that can cause such interference; this may happen either when the light source varies the intensity of its output over time as noted earlier or, when the subject changes the orientation of his or her head with respect to the light source (for example, when turning the head towards or away a lamp).
Thus, the infrared setup of the REACT eye-tracker is ideal with respect to the above-mentioned problems as infrared lighting provides a consistent light source that does not fluctuate over time and it is not affected by the subject’s position in space as the light source is fixed in a position relative to the subject’s head. This effect can be clearly seen in Figure 7; in (a) where the eye is illuminated by visible light, the reflection of a window can be seen on top of the pupil, as well as the non-uniform lighting distribution across the image. In contrast, in (b) where the eye is illuminated by infrared light, the image brightness is solely dependent on the diffusion properties of the infrared source (infrared LED).
Furthermore, the infrared setup offers the significant advantage of the dark pupil effect where the pupil appears very dark and the iris appears very light independent of the subject’s eye colour, due to the different infrared light reflection properties of the pupil and the iris. In comparison, the visible light image makes the separation of iris and pupil very hard, especially considering the reflections of other objects on the eye.
(B) IMAGE TAKEN IN THE INFRARED LIGHT SPECTRUM
FIGURE 7: ILLUSTRATION OF IMAGES TAKEN IN THE VISIBLE VERSUS INFRARED LIGHT SPECTRUM.
57 Figure 8 illustrates a set of example source images and their corresponding heat map images.
Heat maps are generated by mapping the [0, 255] grey scale range to a special ordering of the RGB (Red-Green-Blue) colour space where low intensity values are coloured blue (cold) and high intensity values are colour red (hot). Heat map images were found to be useful in visually inspecting grey scale images as they expand the 8-bit [0, 255] range to possible values and can provide insights to the structure of the image by increasing the visibility of intricacies in the image that would otherwise be hardly distinguishable.
As can be seen in Figure 8, the infrared light source is directly pointed to the eye centre from below and thus the majority of “heat” can be seen along the lower eyelid and on the eyeball. The dark pupil effect can also be easily noticed both in the original and heat map images. The darkest/coldest area is usually the pupil whereas the brightest/hottest area is the glint4. Once again, it can be seen how the pupil is always dark whereas the iris appears as a shade of grey, regardless of the subject’s eye colour. As already mentioned, this is because of the different reflection properties of the pupil and iris.