«Figure-Ground Segmentation Using Multiple Cues Peter Nordlund Dissertation, May 1998 Computational Vision and Active Perception Laboratory (CVAP) ...»
Department of Numerical Analysis and Computing Science
TRITA-NA-P98/05 • ISSN 1101-2250 • ISRN KTH/NA/P--98/05--SE • CVAP218
Using Multiple Cues
Dissertation, May 1998
Computational Vision and Active Perception Laboratory (CVAP)
An important aspect of the thesis is that the problems are addressed from a systems perspective:
it is the performance of the entire system that is important, not that of component algorithms.
Hence, we regard the processes as part of perception-action cycles and investigate approaches that can be implemented for real-time performance.
The thesis begins with a general discussion on the problem of ﬁgure-ground segmentation and thereafter the issue of attention is discussed. Experiments showing some implementations of attentional mechanisms with emphasis on real-time performance are presented. We also pro- vide experimental results on closed-loop control of a head-eye system pursuing a moving object.
A system integrating motion detection, segmentation based on motion and segmentation based on stereo is also presented. Maintenance of an already achieved ﬁgure-ground segmentation is discussed. We demonstrate how an initially obtained ﬁgure-ground segmentation can be maintained by switching to another cue when the initial one disappears. The use of multiple cues is exempliﬁed by a method of segmenting a 2-D histogram using a multi-scale approach. This method is further simpliﬁed to suit our real-time performance restrictions. Throughout the thesis the importance of having systems with a capacity of operating continuously on images coming directly from cameras is stressed, thus we prove that our systems consist of a complete processing chain, with no links missing, which is essential when designing working systems.
i Acknowledgments First of all I am grateful to my supervisor Jan-Olof Eklundh, who enthusiastically have guided me through the world of computer vision. Without his support and interest this thesis could not have been made.
I am also very grateful to Tomas Uhlin who has provided lots of invaluable help both as an inspiring collaborator, and also with programming, especially for the MaxVideo and Transputers.
An other person formerly at this laboratory that I wish to especially tribute is my co-author Atsuto Maki, who now has returned to Japan.
Throughout my work at the Computational Vision and Active Perception laboratory I have been working with the head-eye system designed and constructed by Kourosh Pahlavan without which lots of this work had not been possible to carry out.
I would also like to thank Daniel Fagerström for long and interesting discussions over the years, Jonas Gårding for being a constant source of good ideas, Magnus Andersson, Demetrios Betsis, Henrik Christensen, Ambjörn Naeve, Stefan Carlsson, Lars Bretzner, Niklas Nordström, Göran Olofsson, Tony Lindeberg, Kjell Brunnström, Mengxiang Li, Antônio Fransisco, Antonis Argyros, Anders Orebäck, Mattias Lindström, Mårten Björkman, Jorge Dias, Fredrik Bergholm, Pär Fornland, Peter Nillius, Kristian Simsarian and David Jacobs for stimulating discussions.
I would like to thank Harald Winroth and Matti Rendahl for their support in all matters concerning programming and computers. Carsten Bräutigam is another member of the group that I would like to thank for his tremendous patience regarding my L TEX questions.
A I would like to thank Elenore Janson for proof-reading, and for her whole-hearted support during the time of writing this thesis.
In one of my ﬁrst projects I had the exciting opportunity to collaborate with Jean-Paul Bernoville, Henri Lamarre and Yann Le Guilloux, Michel Dhome, Jean-Tierry Lapreste and Jean-Marc Lavest.
This work has in part been supported by TFR, The Swedish Research Council for Engineering Science. It has during its latter stages also received support from CAS, The Centre for Autonomous Systems. We gratefully acknowledge this support.
Finally I would like to mention all the other members and former members of the group, Birgit Ekberg-Eriksson, Ann Bengtsson, Lars Olsson, Anders Lundquist, Kiyoyuki Chinzei, Kenneth Johansson, Kazutoshi Okamoto, Anna Thorsson, Pascal Grostabussiat, Akihiro Horii, Wei Zhang, Cornelia Fermüller, Yannis Aloimonos, Carla Capurro, Mattias Bratt, Björn Sjödin, Martin Eriksson, Danica Kragic, Lars Petersson, Danny Roobaert, Mikael Rosbacke, Hedvig Sidenbladh, Daniel Svedberg, Dennis Tell, Maria Ögren, Ron Arkin, Patric Jensfelt, Oliver Mertschat, Olle Wijk, Andres Almansa, Svante Barck-Holst, Josef Bigün, Jørgen Bjørnstrup, Henrik Juul-Hansen, Rosario Cretella and Uwe Schneider. Thank you all.
This thesis also exists in a long version with the papers below included. This version of the thesis only contains part one, the summary.
Paper A Nordlund, P. and Uhlin, T. (1996). Closing the loop: Detection and pursuit of a moving object by a moving observer, Image and Vision Computing 14(4): 265–275.
Paper B Uhlin, T., Nordlund, P., Maki, A. and Eklundh, J.-O. (1995a). Towards an active visual observer, Proc. 5th International Conference on Computer Vision, Cambridge, MA, pp. 679–686.
Paper C Uhlin, T., Nordlund, P., Maki, A. and Eklundh, J.-O. (1995b). Towards an active visual observer, Technical Report ISRN KTH/NA/P--95/08--SE, Dept. of Numerical Analysis and Computing Science, KTH, Stockholm, Sweden. Shortened version in Proc. 5th International Conference on Computer Vision pp 679–686.
Paper D Maki, A., Nordlund, P. and Eklundh, J.-O. (1998). Attentional Scene Segmentation: Integrating Depth and Motion from Phase, Extended version of tech report ISRN KTH/NA/PSE submitted to Computer Vision and Image Understanding.
Paper E Nordlund, P. and Eklundh, J.-O. (1997b). Towards a seeing agent, First Int. Workshop on Cooperative Distributed Vision, Kyoto, Japan, pp. 93–123. Also in tech report ISRN KTH/NA/PSE.
Paper F Nordlund, P. and Eklundh, J.-O. (1998). Real-time ﬁgure-ground segmentation, Technical Report ISRN KTH/NA/P--98/04--SE, Dept. of Numerical Analysis and Computing Science, KTH, Stockholm, Sweden. Will be submitted to Int. Conf. on Vision Systems, Jan 99.
When any of the papers above are cited, the citation is in boldface. Some of the papers also exist
as technical reports or conference contribution, here are these references:
Paper A Nordlund, P. and Uhlin, T. (1995). Closing the loop: Pursuing a moving object by a moving observer, Technical Report ISRN KTH/NA/P--95/06--SE, Dept. of Numerical Analysis and Computing Science, KTH, Stockholm, Sweden. Shortened version in Proc. 6th International Conf. on Computer Analysis of Images and Patterns. Also in Image and Vision Computing vol. 14, no 4, May 1996, pp 265–275.
Nordlund, P. and Uhlin, T. (1995). Closing the loop: Pursuing a moving object by a moving observer, in V. Klaváˇ and R. Šára (eds), Proc. 6th International Conf. on Computer Analysis of c Images and Patterns, Prague, Czech Republic, pp. 400–407.
1 Paper D Maki, A., Nordlund, P. and Eklundh, J.-O. (1996). A computational model of depth-based attention, Technical Report ISRN KTH/NA/P--96/05--SE, Dept. of Numerical Analysis and Computing Science, KTH, Stockholm, Sweden. Shortened version in Proc. 13th International Conference on Pattern Recognition.
Maki, A., Nordlund, P. and Eklundh, J.-O. (1996). A computational model of depth-based attention, Proc. 13th International Conference on Pattern Recognition, Vol. IV, IEEE Computer Society Press, Vienna, Austria, pp. 734–739.
Paper E Nordlund, P. and Eklundh, J.-O. (1997). Maintenance of ﬁgure-ground segmentation by cue-selection, Technical Report ISRN KTH/NA/P--97/05--SE, Dept. of Numerical Analysis and Computing Science, KTH, Stockholm, Sweden. Also in First Int. Workshop on Cooperative Distributed Vision 1997, Kyoto, Japan, pp 93–123.
Introduction Humans looking around in the world can seemingly without effort segment out and distinguish different objects in the world. Despite this, the corresponding capability has largely eluded the efforts of researchers in computer vision. Figure-ground segmentation, or as it is generally considered in computer vision, image segmentation remains a difﬁcult problem even when a more precise deﬁnition is given. Of course, it is worth noting that most work in computer vision concerns pictorial vision, i.e. analysis of (usually pre-recorded) images, and not any kind of natural vision, in which the observer directly samples the three dimensional world. One can argue that ﬁgure-ground segmentation is different in the two cases. Indeed, the work presented in this thesis supports this view. Nevertheless, one can often assume that very much the same information is available in the two cases and that the discrepancy in performance between humans and machines therefore may appear puzzling.
In this work we are not considering the segmentation or grouping problems in their full scope.
Rather we are investigating the applicability in computer vision of psychophysical results about the human visual system indicating that three dimensional cues play an important role in such processes (see e.g. Nakayama and Silverman, 1986). More generally, there are ﬁndings indicating that the percepts, and not only the retinal information as such, are crucial. A recent collection of results in this direction is given in (Rock, 1997), and an example described in Chapter 2.2.
A second and central theme of our work deals with the use of multiple cues. A human observer is normally gifted with a visual apparatus that can process binocular and monocular information as well as motion, color and form (see e.g. Zeki, 1993, for a deliberation on this multifunctional structure). Hence, it is reasonable to conclude that a machine vision system capable of functioning in a real environment analogously should be able to capitalize on whatever information the world offers and also to integrate this information. If we expect to achieve even a fraction of human performance in an unknown world we should consider systems that have similar capabilities.
The third major theme of the thesis concerns the systems perspective. Much of the traditional work in computer vision uses a reductionist approach and in isolation addresses speciﬁc visual problems, such as reconstruction of the scene from stereo-pairs of images, segmentation of an image into sets of regions optimizing some criterion, or, recognition of faces or objects seen against a simple background. Such approaches provide us with valuable insights into the limits of machine perception and the feasibility of algorithms. However, often this work only superﬁcially touches upon questions about to what processes or system (possibly a human interpreter) this information is going, or even from where it emanates. Therefore, it does not directly tell us how to design systems that can see in the sense of a human observer or even a seeing robot. In this work, which forms part of the longstanding work on active vision within the CVAP group (see e.g. IJCV, Vol 17:2, Eklundh, 1996), we are interested in processes that work in an integrated system of the latter type, and ultimately in a system that functions in a closed loop with the environment. Truly, not all the work we report here has this emphasis, but the perspective is such. In particular, the work forms an attempt to support the general claim that from a systems perspective many of
31.1. OUTLINE OF OUR APPROACH
the difﬁcult problems arising in machine vision can be solved, and that the algorithms required sometimes are both simple and coarse, and exactly because of that also robust. The ESPRITprojects VAP I and II formed a step towards demonstrating this principle (see e.g. Crowley and Christensen, 1995, especially the introduction). The work described here continues in the same spirit.
1.1 Outline of our approach The main theme in this thesis is ﬁgure-ground segmentation. We approach this problem by using multiple cues, especially 3-D cues, in order to utilize the fact that the world is rich on information.
We also apply a systems approach to these problems, in the spirit of earlier work by Pahlavan (1993) and Uhlin (1996) in our laboratory, who had a focus on systems operational in real-time in
interaction with the environment. Our goals have been:
1. To demonstrate that robust ﬁgure-ground segmentation can be achieved through the use of 3-D cues.
2. To demonstrate that integration of multiple cues is essential in this context.