# «THE UNIVERSITY OF CHICAGO GRAMMATICAL METHODS IN COMPUTER VISION A DISSERTATION SUBMITTED TO THE FACULTY OF THE DIVISION OF THE PHYSICAL SCIENCES IN ...»

1 (Context-freeness is an independence assumption). Grammatical methods in general are characterized by 1) a method for decomposing novel data in a hierarchical fashion, and 2) an explanation for each level of the hierarchy in terms of previously seen data. For example, in a standard CFG, we will have both a parse tree over a sentence, and a set of labels for the nodes of the parse tree. If the CFG has been automatically generated (i.e., no human has assigned semantic meaning to grammar elements), then the label’s meaning will be derived solely from which parts of the training data have received that label.

For generic grammatical methods, we can deﬁne context-freeness by specifying that the contents of a node N (all nodes below N in the hierarchy) are independent of the context in which N appears (all nodes not below N in the hierarchy) given the data stored at N (its label and any attributes).

** Remark 1.4.**

2 (Independence assumptions yield more eﬀective data). Independence assumptions let you eﬀectively multiply the amount of data you have. Consider the following simple problem: try to classify 0-1 vectors into two classes given a bunch of labeled examples.

Consider the two sort of extreme things we can do: if we assume that each coordinate of the vector is independent, we get a Naive Bayes classiﬁer, where we basically learn what the distribution is over a given coordinate for each class. This can be done with a small number of training examples.

The other extreme is assuming that there is total dependence, that the vectors are just arbitrary elements of a set, and the individual coordinate values do not mean anything. Then the maximum likelihood classiﬁer (Paranoid Bayes?) is given by seeing how many times a speciﬁc vector showed up in each class, and picking the class where it showed up more often.

We would have to guess on any novel vector.

If the independence assumption is valid for the data, then the Naive Bayes classiﬁer acts like the Paranoid Bayes classiﬁer trained on a data set that is exponentially larger.

(Speciﬁcally, for each class, we generate all possible vectors that can be made by taking the ﬁrst coordinate of a random example from the class, then taking the second coordinate from an independently chosen random example from the class, etc.) Even when the independence assumption does not apply, context-free grammars are useful. This can be seen in examining the classic nonsense sentence ”Colorless green ideas sleep furiously”. The supposed weakness of context-free grammars, that the context-free assumption is unrealistic, is actually a strength, because it allows us to parse and react to very unrealistic data.

** 1.5 Grammatical Vision is Diﬃcult**

Little concrete progress has been made with visual grammars. The general frameworks have few practical results, and people with practical results don’t seem to have a general framework.

We are also hindered because the closest equivalent to linguistics is the cognitive science of vision, which is not as well-stocked with intermediate hypotheses and sanity checks. For example, in linguistics, utterances can be decomposed into phonemes. There is no agreement as to how a visual signal can be decomposed into canonical parts.

** 1.6 Grammar induction is diﬃcult**

Grammatical methods in vision are inspired by grammatical models like probabilistic contextfree grammars in linguistics. In computation linguistics, the problem of learning a grammar from unlabeled examples (grammar induction) is very diﬃcult.

There are many negative theoretical results (Lee [1996]), the most fundamental of them

**being Gold’s Theorem:**

** Theorem 1.6.**

1 (Gold’s Theorem). Consider an algorithm A trying to learn a language L ∈ Σ∗. A is given an inﬁnite sequence of words w1, w2, · · · ∈ L which contains every word in L at least once. After each wi, A outputs some Li. A is said to learn L in the limit if, for some n, L = Ln = Ln+1 = Ln+2 =....

Then, if a class of languages L contains all languages of ﬁnitely many strings, and at least one inﬁnite language (in particular, if L is the set of context-free languages), no algorithm can learn all languages in L in the limit.

Note that this result applies to all algorithms, regardless of their complexity.

This is a fundamental theoretical barrier to grammar induction. There is some hope that we can defeat it by adopting a Bayesian approach, and only hoping to learn suﬃciently simple grammars. This has been attempted several times, but no solution is known for the associated Bayesian learning problem, and a heuristic search strategy must be used (Cook [1976], Stolcke [1994], Nevill-Manning and Witten [1997]).

It is clear that the problem of learning context-free grammars applies directly to curve grammars, since part of our curve model actually is a PCFG (see Chapter 2). The learning problem might be easier if visual grammars can be simpler in structure than string grammars for natural languages. We don’t know if this is the case.

1.7 Visual Grammars are hard for additional reasons

The continuous nature of visual signals means that we will rarely if ever see exact agreement between two visual objects at the pixel level. Compare a well-deﬁned linguistic object (such as a word) to a visual part such as an eye; an eye will look slightly diﬀerent in every image because of natural variation and photometric eﬀects.

Additionally, because language has well deﬁned symbols, in the case of written language, or agreed-upon phonemes, in the case of spoken language, it is possible in computational linguistics to check whether two letters or two words are the same. Working with visual grammars is thus analogous to trying to solve the problems of computational linguistics simultaneously with the problems of speech recognition.

Additionally, hand-labeled training data is more readily available in computational linguistics. For example, there are datasets of sentences which come with ground-truth grammatical parses. Such information is generally unavailable in vision.

Without certain knowledge of correspondence between samples, generalization is diﬃcult.

We are required to infer correspondences between samples in an unsupervised manner. This generally leads to a unsupervised learning problem, such as clustering, or a correspondence problem.

** 1.7.2 Dealing with Scale**

Computer vision usually tries to be scale invariant, since a small object closer to the camera naturally looks like a large object further from the camera. A doubled copy of an image is hopefully semantically the same as the original image.

We thus have two problems: when an object is too small, its internal structure may be lost. Generally, ﬁne internal structural elements ﬁrst become texture (discernible en masse but not individually) and then disappear completely. The larger object may still, however, be recognizable by shape.

When an object is too large, we are faced with the task of parsing internal structure of which we were previously unaware.

These diﬃculties can probably be dealt with. The ﬁrst demands that we come up with a simple appearance model for every object, so that we can detect it without detecting any of its subparts. Decreased recall is probably OK, as smaller things are actually more diﬃcult to see.

The second demands that every object be detectable at large scales, even those which have no sub-parts in our grammar. We can model this with simple inﬁnitely recursive grammars;

suﬃciently simple ones can hopefully model almost anything, while basically preventing their

**being bias for either one of:**

• A “larger” model with more internal structure

• A “smaller” model with less internal structure 1.7.3 Plane grammars do not naturally admit eﬃcient parsing Eﬃcient algorithms for parsing strings with context-free grammars rely on dynamic programming, and ultimately on the fact that a string has only quadratically many contiguous substrings. The elements of visual grammars naturally live in the image plane, and thus do not have a linear order. A visual object need not even occupy a single contiguous region, in the case of occlusion. Therefore, a parsing algorithm might in principle have to consider any subset of the image elements as a node in the parse, leading to an exponential runtime.

** 1.8 Contributions**

Our ﬁrst contribution is the formulation of a class of shape grammars, a probabilistic model of curve formation that allows for both continuous variation and structural variation. We describe shape grammars in Chapter 2 and derive an EM-based training algorithm for shape grammars in Chapter 3. We demonstrate the eﬀectiveness of shape grammars for modeling human silhouettes, and also demonstrate their eﬀectiveness in classifying curves by shape.

Our second contribution is a general method for heuristically speeding up a large class of dynamic programming algorithms, given in Chapter 4. We provide a general framework for discussing coarse-to-ﬁne search strategies, and provide proofs of correctness. Our method can also be used with inadmissible heuristics.

Our third contribution is local inference, a method for correcting parses with unintended reuse, given in Chapter 4. We deﬁne an energy minimization problem over low-level assertions that allows us to detect shapes in cluttered images.

Our fourth contribution is an algorithm for doing approximate context-free parsing of long strings in linear time, given in Chapter 6. We deﬁne a notion of approximate parsing in terms of restricted families of decompositions, and construct small families which can approximate arbitrary parses.

Chapters 2, 3, and 6 are based on joint work with my advisor, Pedro Felzenszwalb.

We are building probabilistic models of shape. For now, we consider only models of shape that deal with the location in the plane of a ﬁxed number of (ordered) landmarks z1,..., zn, which in our case will deﬁne a curve outlining some shape. Some existing approaches to this problem are: Markov models (Basri et al. [1998], Grenander et al. [1991]), the Procrustes distance and its probabilistic counterpart, the Watson distribution (Dryden and Mardia [1998]), and active shape models (Cootes et al. [1995]).

We would like to prove that our models do as well or better in explaining the geometric variability of various shape classes. The easiest way to compare two generative probabilistic models is to examine samples from them, and subjectively assess their similarity to real data.

This is inherently a very qualitative evaluation.

Our main goal in building shape models is to allow us to do inference on new curves.

We can calculate a likelihood that a shape class generated a particular curve. We start by considering models deﬁned by perturbations of a single input curve.

When comparing to other probabilistic models, we can evaluate them in a more quantitative way using the log-likelihood, which is deﬁned as

If X = {x1,..., xn } is unseen data, and a large enough sample to be statistically useful, then H(X, q) is a good measure of how well a model explains unseen data. Smaller values of H(X, q) suggest that q is a better model. One problematic aspect of the log-likelihood is that Figure 2.1: The problem of drift makes a Markov model unappealing. Random samples from this model are too similar locally and too dissimilar globally. These shapes were generated by changing each length l by a multiplicative factor of 1 + N (0, σ), and changing each angle θ by adding π · N (0, σ). Here σ is the value listed underneath the shape.

** Figure 2.2: The shape on the left is the original, and the other curves have been produced by sampling from the hierarchical curve models of Felzenszwalb and Schwartz [2007].**

The model produces curves which have more perceptual similarity than the Markov model.

it is only meaningful if q is a correctly normalized probability distribution. Often it is easier to compute q(x) = Z ·q(x) for some unknown but ﬁxed constant Z. Due to the unsatisfactory nature of the competing models we consider, we defer quantitative measurement to Chapter 3.

The most straightforward model for deﬁning a distribution over curve shapes is a Markovtype model. A curve is a sequence of line segments 1,..., k. We can describe a curve template by giving the length li of each i, and the angle θi between i and i+1. If we perturb each li and θi slightly, we get a curve which diﬀers from the original. Some samples from this are shown in Figure 2.1.

The major weakness of the Markov model is drift: the small perturbations will accumulate, and the overall shape of the curve will vary greatly. A straight line has some probability of curling into a tight spiral. Consider a shape like a hand: a hand has ﬁngers that protrude.

This means that there are two points (namely the two points where a ﬁnger meets the rest of the hand) far away in the curve that we always expect to be very close together. A Markov perturbation of the shape is likely to pull these points far apart.

To defeat this problem, hierarchical curve models were introduced in Felzenszwalb and

**Schwartz [2007]. There, the following curve model is given:**

• Given a model curve C, decompose C hierarchically by repeatedly cutting it in half, in a balanced but otherwise arbitrary fashion.

• Suppose our ﬁrst decomposition is C = DE. We perturb C into C by ﬁrst perturbing the midpoint of C slightly. We then rotate and scale the curves D and E to align the end of D and the beginning of E with the perturbed midpoint.

• We then recursively apply the same process to each subcurve D and E.

It is easy to see that this defeats the problem of drift, because the overall shape of the curve is determined in a constant number of perturbations of the template curve. If our perturbations are smaller for larger curves, then we will leave the overall shape very similar while allowing signiﬁcant local variation. Some samples from this model are shown in Figure 2.2.