«THE UNIVERSITY OF CHICAGO GRAMMATICAL METHODS IN COMPUTER VISION A DISSERTATION SUBMITTED TO THE FACULTY OF THE DIVISION OF THE PHYSICAL SCIENCES IN ...»
In our model of curve grammars, the second kind of rule is given by the midpoint distribution µX→Y Z, and the third kind of rule is trivial (see Chapter 2).
While human vision crucially relies on global context to resolve local ambiguity (Bar ), computer vision algorithms often have a pipeline which makes hard low-level decisions about image interpretation, and then uses this output as input to higher-level analysis. Algorithms will be more accurate and less brittle if they can avoid making such hard decisions, as advocated in Amit and Trouv´ , Jin and Geman , Felzenszwalb and Huttenlocher e .
For example, in the visual chalkboard, there will be various stray marks on the chalkboard. We would prefer not to ﬁlter these out with some sort of quality threshold, but instead mark them as possibilities, try to assemble an overall interpretation of the board, and then discount any stray marks that do not participate in the interpretation. This seems much more fruitful than ﬁltering out stray marks, along with some genuine letters, and then having to be very forgiving of words actually missing some of their letters altogether.
This requires us to combine and resolve information at diﬀerent levels. Grammatical methods provide us with powerful inference algorithms for determining the most likely decomposition of a scene under a given compositional model. Since important local ambiguities will lead to diﬀerent global decompositions, this is exactly what is needed: the overall likelihood of the decomposition is a common currency that allows us to negotiate between ﬁtting the local data well, and explaining the local data in a way that allows a good decomposition of the rest of the image.
1.2.2 Modeling Clutter with Object Sub-parts
We would like to build speciﬁc and accurate models of clutter. For instance, for the visual chalkboard, it would be helpful to have a model for stray chalk marks, rather than a model for arbitrary unexplained patches; otherwise we will be tempted to explain stray chalk marks as some letter, possibly a lower-case ’i’. If we try to set our threshold high enough that we don’t do this, we might start labeling some genuine i’s as background. If we instead have a model for chalk marks, we can explain stray chalk marks and i’s as particular sorts of chalk marks, and diﬀerentiate them based on context and appearance.
Jin and Geman  suggests modeling clutter in the background with sub-parts of the objects of interest. Since objects in the background are still objects, and are often related to the objects of interest, this might allow us to build a much stronger background model in many cases. In addition, by modeling clutter with sub-parts, we are less likely to hallucinate whole objects when we see sub-parts. Thus, it is especially important that we have a cheap way to explain clutter that closely resembles sub-parts of the objects of interest.
With such a system, we might even be able to ignore rather subtle clutter, such as some stray letters, or even words, from a previous lecture that was not completely erased. Clutter words would not be part of a line of text, and would thus be identiﬁable as clutter in the parsed output, where they would be excluded from the main body of text.
1.2.3 Whole Scene Parsing
It is useful to demand whole scene parses, since it avoids the need to ﬁne-tune detection thresholds and decision boundaries (Amit and Trouv´ ). Consider the example of the e visual chalkboard. Instead of having to set a ﬁlter on chalk marks to ﬁlter out stray chalk marks, we simply explain them and discount them, since they are not part of any larger structure, such as a word, that we ﬁnd interesting.
1.3 Hierarchical Decomposition and Rich Description
It would be very useful if vision algorithms could achieve richer understanding of scenes, and produce richer descriptions of images. A rich understanding of a scene requires an understanding of the relationship between objects. Consider an image containing a person and two objects, where the person is pointing at one of the objects. This is an important piece of information, and it cannot easily be described by a list of the objects in the image.
Some scenes contain important objects that are nothing more than a particular grouping of other objects: a crowd is just a collection of people. Moreover, the nature of the collective object is determined partly by the relationship between its elements. A crowd and a marching band are two very diﬀerent objects, but this diﬀerence cannot be expressed in a simple listing of objects. How can vision algorithms achieve this level of understanding, or even represent it?
One straightforward and general framework for rich description is a labeled hierarchical
decomposition of a scene (Zhu and Mumford ). This takes the form of a tree, where:
• The root node describes the entire image.
• Each node is labeled with the name of an object, and an area of the image described, which may be approximate.
• The children of a node describe sub-parts of the parent, and have areas inside the parent node’s area.
We can explicitly encode such a description in an XML-like language. Since grammatical methods produce and work with such hierarchical decompositions, they give a natural model for such structures.
A marching band and a crowd would diﬀer in the label assigned to the relevant node, but the children of those nodes would be similar, i.e., individual people.
In the example of the visual chalkboard, rich description would produce more useful output than simple text. For instance, in transcribing a series of boards as lecture notes, we would like to be able to label non-textual regions as ﬁgures and include them verbatim, or render them as a collection of lines. Figures could also include text labels, for which we would want to know both their relationship to the ﬁgure and their textual content; a hierarchical format such as XML would best represent the logical structure involved.
1.3.1 Training on rich annotations
We would also like to train grammars simultaneously on an image and a hand-made rich hierarchical description of that image. This ensures that our trained grammars will produce semantically meaningful decompositions of new images: the decomposition will have a structure similar to that produced by a human, and we will be able to transfer labels onto the nodes of the decomposition.
This will make it more feasible to output meaningful rich descriptions on a wide range of data. Stochastic grammatical methods allow us to put soft constraints on descriptions (“it is unlikely that a person will have an apple for a face”), which will be more ﬂexible and less brittle than hard constraints (“a person’s face can have eyes, nose, etc., but not fruit”).
We can thus favor more realistic descriptions of scenes while still producing useful output on very unrealistic scenes (such as Magritte’s “The Son of Man”).
Note that we may still gain information from rich descriptions without a speciﬁed correspondence to particular parts of the image. Knowing the correct structure and guessing at the exact correspondence is no harder than guessing both the correct structure and the correspondence. Such descriptions would be less labor-intensive to produce, so we might be able to train on larger datasets.
In general, supervised learning is easier than unsupervised learning. In computational linguistics, this means that learning a grammar for natural language is much more tractable given samples of natural language that have been parsed. (These are called bracketed samples.) It is likely that training on rich annotations would also make learning visual grammars much easier.
1.3.2 Some Problems with Rich Description, and Solutions
For a number of reasons, dealing with rich descriptions is more complicated than dealing with simpler descriptions. Grammatical methods and hierarchical decomposition give ways around some of these problems. Some important issues are: Class/Subclass ambiguity and Questionable Parts.
First, it is worth noting that hierarchical decomposition degrades gracefully into a ﬂat description, since we can always decompose the root node into a list of objects. Hierarchical decomposition presses, but does not force, us to explain how any two parts of our annotation are related, making for more useful description.
• Decomposition provides a reasonable and elegant solution to the Questionable Parts
problem. David Marr explained it thus:
In any annotation system rich enough that we might simultaneously label a wheel and a car, or eyes and a face, in the same image, there is an arbitrary choice of how many things to label. Forcing these descriptions to be consistent is very diﬃcult (Russell et al. ). This is especially pronounced with agglomerations, like crowds of people.
There is no point at which a group of people meaningfully becomes a “crowd”; two people are not considered a crowd, and one hundred people are considered a crowd, but it is impossible to draw a clear line between crowds and not-crowds.
Forcing consistency may even be counter-productive. If we must label every face in a crowd, then a crowd seen from a distance will have an enormous number of face labels, most of which are basically guesses. If we never label faces in a crowd, then our visual search engine may fail to retrieve images of a particular person when they are in a group, even if they are clearly visible.
Hierarchical decomposition describes a scene as a hierarchy of objects, and further decomposition of these objects may be optional; as long as we know something about the object’s appearance, we don’t have to demand that it be broken up into its constituent parts.
• Grammatical methods also naturally address the Class/Subclass Ambiguity problem:
descriptions of an object can be general or speciﬁc, and the level of speciﬁcity is fairly arbitrary. It is clear that we want the query “dancer” in a visual search engine to return images of people dancing, but these same images should also be returned on the query “person”.
Grammatical methods model such ambiguity as OR nodes in an AND-OR structure (Zhu and Mumford ), or by rules of the form
The LabelMe dataset uses a set of labels derived from WordNet (Fellbaum ) that are related via class-subclass relationships. The maintainers of the dataset claim that it requires very little work to map the arbitrary labels provided by users into these more precise and formalized labels (Russell et al. ). Given such techniques, the Class-Subclass ambiguity problem is probably not a fundamental barrier to rich description.
The rich descriptions we have described are naturally represented in XML and similar languages, which yields some opportunities:
• XML is reasonably human-readable and human-writable. (Comparable formats like YAML are even more so.) This means that rich photo-tagging could be done by many people, and also that many users could beneﬁt from a visual search engine that accepts structured queries. For example, photos inside of an image could be recursively described as such, allowing us to separate the images containing an actual movie star from those containing a poster depicting that movie star. As another example, having a notion of classes and subclasses would allow us to search for BASS FISH and receive only pictures of bass ﬁsh, and not pictures of the instrument or of other ﬁsh.
• Existing technology such as XPath allows computer programs to do eﬃcient and ﬂexible searches on XML documents. This means that fairly complex image-sorting tasks could potentially be automated. This could be very good, because some image-sorting tasks such as content moderation are reported to be very psychologically damaging when performed by humans.
• One particular XML-based ﬁle format is the scalable vector graphics format (SVG). If we can learn rich visual grammars, we could hope to recover artistic information from images, so that we could approximate a drawing as a collection of strokes in ﬁlls in an SVG document.
Ultimately, we might hope to learn artistic concepts such as drawing style or font, which would greatly expand the power of graphics programs.
Grammatical methods oﬀer a very powerful and general way to make vision algorithms more robust: if we can integrate diﬀerent statistical models in a modular fashion, especially models trained in diﬀerent contexts, then our system will be more robust than any single model.
Grammatical methods are well-suited to integrating any statistical model that depends on the qualities of image objects and the relationships between them. When we can map objects and relationships between domains (for example, mapping pictures of text to actual text), this allows us to import already-trained statistical models from very diﬀerent domains.
Consider transcribing a lecture from the visual chalkboard. The system will better recover from misidentifying letters if it uses higher-level knowledge about the lecture’s language and contents. In particular, we can build grammars that integrate such tried-and-true models as the n-gram model of letters (Manning and Sch¨tze ), the n-gram model of words u (Manning and Sch¨tze ), a stochastic grammar model of phrase and sentence structure u (Manning and Sch¨tze ), and topic models of word choice in the subject of the lecture u (Blei et al. ). All of these models can be trained on large corpora of text, rather than on smaller datasets of images.
Grammatical models are typically context-free, which is fundamentally about making independence assumptions. We argue that independence assumptions can increase the eﬀective amount of training data you have.