FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:     | 1 || 3 | 4 |   ...   | 11 |


-- [ Page 2 ] --

In our model of curve grammars, the second kind of rule is given by the midpoint distribution µX→Y Z, and the third kind of rule is trivial (see Chapter 2).

–  –  –

While human vision crucially relies on global context to resolve local ambiguity (Bar [2004]), computer vision algorithms often have a pipeline which makes hard low-level decisions about image interpretation, and then uses this output as input to higher-level analysis. Algorithms will be more accurate and less brittle if they can avoid making such hard decisions, as advocated in Amit and Trouv´ [2007], Jin and Geman [2006], Felzenszwalb and Huttenlocher e [2003].

For example, in the visual chalkboard, there will be various stray marks on the chalkboard. We would prefer not to filter these out with some sort of quality threshold, but instead mark them as possibilities, try to assemble an overall interpretation of the board, and then discount any stray marks that do not participate in the interpretation. This seems much more fruitful than filtering out stray marks, along with some genuine letters, and then having to be very forgiving of words actually missing some of their letters altogether.

This requires us to combine and resolve information at different levels. Grammatical methods provide us with powerful inference algorithms for determining the most likely decomposition of a scene under a given compositional model. Since important local ambiguities will lead to different global decompositions, this is exactly what is needed: the overall likelihood of the decomposition is a common currency that allows us to negotiate between fitting the local data well, and explaining the local data in a way that allows a good decomposition of the rest of the image.

1.2.2 Modeling Clutter with Object Sub-parts

We would like to build specific and accurate models of clutter. For instance, for the visual chalkboard, it would be helpful to have a model for stray chalk marks, rather than a model for arbitrary unexplained patches; otherwise we will be tempted to explain stray chalk marks as some letter, possibly a lower-case ’i’. If we try to set our threshold high enough that we don’t do this, we might start labeling some genuine i’s as background. If we instead have a model for chalk marks, we can explain stray chalk marks and i’s as particular sorts of chalk marks, and differentiate them based on context and appearance.

Jin and Geman [2006] suggests modeling clutter in the background with sub-parts of the objects of interest. Since objects in the background are still objects, and are often related to the objects of interest, this might allow us to build a much stronger background model in many cases. In addition, by modeling clutter with sub-parts, we are less likely to hallucinate whole objects when we see sub-parts. Thus, it is especially important that we have a cheap way to explain clutter that closely resembles sub-parts of the objects of interest.

With such a system, we might even be able to ignore rather subtle clutter, such as some stray letters, or even words, from a previous lecture that was not completely erased. Clutter words would not be part of a line of text, and would thus be identifiable as clutter in the parsed output, where they would be excluded from the main body of text.

1.2.3 Whole Scene Parsing

It is useful to demand whole scene parses, since it avoids the need to fine-tune detection thresholds and decision boundaries (Amit and Trouv´ [2007]). Consider the example of the e visual chalkboard. Instead of having to set a filter on chalk marks to filter out stray chalk marks, we simply explain them and discount them, since they are not part of any larger structure, such as a word, that we find interesting.

1.3 Hierarchical Decomposition and Rich Description

It would be very useful if vision algorithms could achieve richer understanding of scenes, and produce richer descriptions of images. A rich understanding of a scene requires an understanding of the relationship between objects. Consider an image containing a person and two objects, where the person is pointing at one of the objects. This is an important piece of information, and it cannot easily be described by a list of the objects in the image.

Some scenes contain important objects that are nothing more than a particular grouping of other objects: a crowd is just a collection of people. Moreover, the nature of the collective object is determined partly by the relationship between its elements. A crowd and a marching band are two very different objects, but this difference cannot be expressed in a simple listing of objects. How can vision algorithms achieve this level of understanding, or even represent it?

One straightforward and general framework for rich description is a labeled hierarchical

decomposition of a scene (Zhu and Mumford [2006]). This takes the form of a tree, where:

• The root node describes the entire image.

• Each node is labeled with the name of an object, and an area of the image described, which may be approximate.

• The children of a node describe sub-parts of the parent, and have areas inside the parent node’s area.

We can explicitly encode such a description in an XML-like language. Since grammatical methods produce and work with such hierarchical decompositions, they give a natural model for such structures.

A marching band and a crowd would differ in the label assigned to the relevant node, but the children of those nodes would be similar, i.e., individual people.

In the example of the visual chalkboard, rich description would produce more useful output than simple text. For instance, in transcribing a series of boards as lecture notes, we would like to be able to label non-textual regions as figures and include them verbatim, or render them as a collection of lines. Figures could also include text labels, for which we would want to know both their relationship to the figure and their textual content; a hierarchical format such as XML would best represent the logical structure involved.

1.3.1 Training on rich annotations

We would also like to train grammars simultaneously on an image and a hand-made rich hierarchical description of that image. This ensures that our trained grammars will produce semantically meaningful decompositions of new images: the decomposition will have a structure similar to that produced by a human, and we will be able to transfer labels onto the nodes of the decomposition.

This will make it more feasible to output meaningful rich descriptions on a wide range of data. Stochastic grammatical methods allow us to put soft constraints on descriptions (“it is unlikely that a person will have an apple for a face”), which will be more flexible and less brittle than hard constraints (“a person’s face can have eyes, nose, etc., but not fruit”).

We can thus favor more realistic descriptions of scenes while still producing useful output on very unrealistic scenes (such as Magritte’s “The Son of Man”).

Note that we may still gain information from rich descriptions without a specified correspondence to particular parts of the image. Knowing the correct structure and guessing at the exact correspondence is no harder than guessing both the correct structure and the correspondence. Such descriptions would be less labor-intensive to produce, so we might be able to train on larger datasets.

In general, supervised learning is easier than unsupervised learning. In computational linguistics, this means that learning a grammar for natural language is much more tractable given samples of natural language that have been parsed. (These are called bracketed samples.) It is likely that training on rich annotations would also make learning visual grammars much easier.

1.3.2 Some Problems with Rich Description, and Solutions

For a number of reasons, dealing with rich descriptions is more complicated than dealing with simpler descriptions. Grammatical methods and hierarchical decomposition give ways around some of these problems. Some important issues are: Class/Subclass ambiguity and Questionable Parts.

First, it is worth noting that hierarchical decomposition degrades gracefully into a flat description, since we can always decompose the root node into a list of objects. Hierarchical decomposition presses, but does not force, us to explain how any two parts of our annotation are related, making for more useful description.

• Decomposition provides a reasonable and elegant solution to the Questionable Parts

problem. David Marr explained it thus:

–  –  –

In any annotation system rich enough that we might simultaneously label a wheel and a car, or eyes and a face, in the same image, there is an arbitrary choice of how many things to label. Forcing these descriptions to be consistent is very difficult (Russell et al. [2008]). This is especially pronounced with agglomerations, like crowds of people.

There is no point at which a group of people meaningfully becomes a “crowd”; two people are not considered a crowd, and one hundred people are considered a crowd, but it is impossible to draw a clear line between crowds and not-crowds.

Forcing consistency may even be counter-productive. If we must label every face in a crowd, then a crowd seen from a distance will have an enormous number of face labels, most of which are basically guesses. If we never label faces in a crowd, then our visual search engine may fail to retrieve images of a particular person when they are in a group, even if they are clearly visible.

Hierarchical decomposition describes a scene as a hierarchy of objects, and further decomposition of these objects may be optional; as long as we know something about the object’s appearance, we don’t have to demand that it be broken up into its constituent parts.

• Grammatical methods also naturally address the Class/Subclass Ambiguity problem:

descriptions of an object can be general or specific, and the level of specificity is fairly arbitrary. It is clear that we want the query “dancer” in a visual search engine to return images of people dancing, but these same images should also be returned on the query “person”.

Grammatical methods model such ambiguity as OR nodes in an AND-OR structure (Zhu and Mumford [2006]), or by rules of the form

–  –  –

The LabelMe dataset uses a set of labels derived from WordNet (Fellbaum [1998]) that are related via class-subclass relationships. The maintainers of the dataset claim that it requires very little work to map the arbitrary labels provided by users into these more precise and formalized labels (Russell et al. [2008]). Given such techniques, the Class-Subclass ambiguity problem is probably not a fundamental barrier to rich description.

–  –  –

The rich descriptions we have described are naturally represented in XML and similar languages, which yields some opportunities:

• XML is reasonably human-readable and human-writable. (Comparable formats like YAML are even more so.) This means that rich photo-tagging could be done by many people, and also that many users could benefit from a visual search engine that accepts structured queries. For example, photos inside of an image could be recursively described as such, allowing us to separate the images containing an actual movie star from those containing a poster depicting that movie star. As another example, having a notion of classes and subclasses would allow us to search for BASS FISH and receive only pictures of bass fish, and not pictures of the instrument or of other fish.

• Existing technology such as XPath allows computer programs to do efficient and flexible searches on XML documents. This means that fairly complex image-sorting tasks could potentially be automated. This could be very good, because some image-sorting tasks such as content moderation are reported to be very psychologically damaging when performed by humans.

• One particular XML-based file format is the scalable vector graphics format (SVG). If we can learn rich visual grammars, we could hope to recover artistic information from images, so that we could approximate a drawing as a collection of strokes in fills in an SVG document.

Ultimately, we might hope to learn artistic concepts such as drawing style or font, which would greatly expand the power of graphics programs.

–  –  –

Grammatical methods offer a very powerful and general way to make vision algorithms more robust: if we can integrate different statistical models in a modular fashion, especially models trained in different contexts, then our system will be more robust than any single model.

Grammatical methods are well-suited to integrating any statistical model that depends on the qualities of image objects and the relationships between them. When we can map objects and relationships between domains (for example, mapping pictures of text to actual text), this allows us to import already-trained statistical models from very different domains.

Consider transcribing a lecture from the visual chalkboard. The system will better recover from misidentifying letters if it uses higher-level knowledge about the lecture’s language and contents. In particular, we can build grammars that integrate such tried-and-true models as the n-gram model of letters (Manning and Sch¨tze [1999]), the n-gram model of words u (Manning and Sch¨tze [1999]), a stochastic grammar model of phrase and sentence structure u (Manning and Sch¨tze [1999]), and topic models of word choice in the subject of the lecture u (Blei et al. [2003]). All of these models can be trained on large corpora of text, rather than on smaller datasets of images.

–  –  –

Grammatical models are typically context-free, which is fundamentally about making independence assumptions. We argue that independence assumptions can increase the effective amount of training data you have.

Remark 1.4.

Pages:     | 1 || 3 | 4 |   ...   | 11 |

Similar works:

«CONSCIOUSNESS BLOSSOMING: ISLAMIC FEMINISM AND QUR’ANIC EXEGESIS IN SOUTH ASIAN MUSLIM DIASPORA COMMUNITIES By ISRAT TURNER-RAHMAN A dissertation submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY WASHINGTON STATE UNIVERSITY Department of Anthropology MAY 2009 © Copyright by ISRAT TURNER-RAHMAN, 2009 All Rights Reserved © Copyright by ISRAT TURNER-RAHMAN, 2009 All Rights Reserved To the Faculty of Washington State University: The members of the...»

«Pests of Deciduous Trees and Shrubs –1– Introduction Managing common insects, weeds, plant diseases and certain animal pests found in the backyard can be a challenge. However, there are a number of ways to approach the problem. Information in this publication will help identify and manage pest problems. Pest management methods will vary among individuals according to their tolerance of the pest, the damage and a basic philosophy about handling pest problems. It may not be necessary to...»

«2012-2013 Burnett International College Catalog 2013-2014 Cataog BURNETT INTERNATI ONAL Col lege 2013-2014 1 Volume III Table of Contents Welcome 6 Mission _ 7 Vision 7 School Philosophy 7 Legal Ownership _ 8 Board of Trustees 9 School Administration _ 9 Faculty _ 9 HOLIDAYS AND BREAKS12 Class Start & End Dates for Enrollment 12 Hours of Operation_ 13 GENERAL ADMISSIONS REQUIREMENTS14 International/Non-U. S. Schools_14 International Students 14 ACADEMIC POLICIES_ 15 Attendance _ 15 Clinical...»

«Forthcoming in Jessica Brown and Mikkel Gerken (eds.), New Essays on Knowledge Ascriptions. Oxford: Oxford University Press. Group Knowledge Attributions Jennifer Lackey Northwestern University A view growing in popularity in the current philosophical literature is that the purpose of knowledge attributions is to identify or flag good informants. Such a thesis has its origin in the work of Bernard Williams and Edward Craig. Williams, for instance, claims that the central point of the concept of...»

«SEEKING SALVATION: BLACK MESSIANISM, RACIAL FORMATION, AND CHRISTIAN THOUGHT IN LATE TWENTIETH CENTURY BLACK CULTURAL TEXTS by Deidre Lyniece Wheaton A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (American Culture) in The University of Michigan Doctoral Committee: Associate Professor Angela D. Dillard, Co-Chair Associate Professor Joshua L. Miller, Co-Chair Associate Professor Tiya Miles Professor Ifeoma Nwankwo, Vanderbilt University...»

«Development of Analytical Methods for the Determination of Methylarginines in Serum Thomas Linz B.S. Chemistry, Truman State University, 2007 Submitted to the Department of Chemistry and the Graduate School of the University of Kansas in partial fulfillment of the requirements for the degree of Doctor of Philosophy. _ Susan Lunte – Chair _ Robert Dunn _ Craig Lunte _ David Weis _ Brian Ackley Dissertation Defense: April 12, 2012 The Dissertation Committee for Thomas Linz certifies that this...»

«A COMPUTATIONAL FRAMEWORK TO QUANTIFY NEUROMECHANICAL CONSTRAINTS IN SELECTING FUNCTIONAL MUSCLE ACTIVATION PATTERNS A Thesis Presented to The Academic Faculty by Mark Hongchul Sohn In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Woodruff School of Mechanical Engineering Georgia Institute of Technology May 2015 COPYRIGHT © 2015 BY MARK HONGCHUL SOHN A COMPUTATIONAL FRAMEWORK TO QUANTIFY NEUROMECHANICAL CONSTRAINTS IN SELECTING FUNCTIONAL MUSCLE ACTIVATION...»

«Computer Science: Craft, Science or Engineering? JIN SUN Department of Computer Science University of California at Riverside Riverside, CA 92507 Email: jsun@cs.ucr.edu SID: 860-750-875 Mar 2006 Abstract There has been a lot of philosophical and pragmatic debate on whether Computer Science is a craft, science or an engineering discipline. This article examines both Paul Graham’s claim that hackers and painters have a lot in common and Dr. Tolbert’s claim that Computer Science is a craft on...»


«La libertà nella filosofia di L P. Sartre Gli esistenzialisti si sono presentati alla massa facilmente riconoscibili per il loro aspetto, disobbedienti ad ogni vincolo di moda, di costumi e di educazione, nella varietà di uomini e donne provenienti dai più diversi strati della società. Così, sulla scorta dei resoconti e delle curiosità giornalistiche, la massa si è abituata a vedere l'Esistenzialismo come movimento bizzarro, e più ancora, come atteggiamento di una casta di artisti...»

«KRITIKE VOLUME NINE NUMBER TWO (DECEMBER 2015) 193-206 Article Not Even to Know That You Do Not Know: Cicero and the “Theatricality” of the New Academy Soumick De Abstract: The relation between philosophy and theatre has mostly been an ambiguous one, frequently informed with a certain playful irony. Plato’s aversion to include the tragic poets in his Republic, which itself remains a philosophical work written in the dramatic form of dialogues, testifies to this traditional ambiguity. It...»

«SEQUENCE EFFECTS IN EVALUATING, SCHEDULING, AND DESIGNING SERVICE BUNDLES A Dissertation Presented to the Faculty of the Graduate School of Cornell University In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Michael James Dixon August 2011 © 2011 Michael James Dixon BIOGRAPHICAL SKETCH Mike Dixon’s interest in service operations management stemmed from multiple jobs in the service sector that have allowed him to see firsthand the impact that operational...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.