# «THE UNIVERSITY OF CHICAGO GRAMMATICAL METHODS IN COMPUTER VISION A DISSERTATION SUBMITTED TO THE FACULTY OF THE DIVISION OF THE PHYSICAL SCIENCES IN ...»

Finding the optimal such grammar is a very hard unsolved problem even for standard CFG’s, as discussed in Section 1.6.1. In particular, we do not know of any optimal way to search over grammatical structures, so we propose to start from the initial grammars described in Section 2.7 and proceed by greedily applying transformations such as replacement (discussed in Section 5.4), merging (discussed in Section 5.5), and rule deletions. We could also try a slightly more ﬂexible search algorithm, such as the beam search used in Stolcke [1994].

In the light of Stolcke [1994], one should consider an initial structure and transformations such that we can reach any grammatical structure from the initial one.

To assess such a search strategy, a simple task would be to build a shape grammar by hand and take samples from it, and try to recover the grammar from the samples. It is unlikely that we will recover the same exact grammar, so we would instead try to show that we can improve some measure of closeness, for instance the cross-entropy on unseen samples.

for some binary encoding function enc(·) : {G} → {0, 1}∗. enc must be uniquely decodable, i. e., enc(N, R, S) = enc(N, R, S ) =⇒ N = N, R = R, S = S.

The encoding function enc should be as concise as possible, so that simple grammars are rewarded for their simplicity. We choose to represent N, R, and S as

where the iX denote some consistent numbering of the nonterminals. We assume that every nonterminal X ∈ N has the rule [X → ] ∈ R(X), and thus do not explicitly encode this information.

**Each component of this representation can be encoded in binary:**

• |N | is represented as a standard binary integer, taking up kN log2 N bits

• Each (iX, iY, iZ ) is represented as three binary integers, each zero-padded to take up exactly kN bits.

• Each (iY, iZ ) is represented as two binary integers, each zero-padded to take up exactly kN bits.

We thus use up kN (1 + 3|R| + 2|S|) bits. We also need to use 2 log2 kN + 2 bits to specify how to break the stream of bits into “words” of kN bits. (We simply need to encode kN itself. We need to be able to specify where the encoding of kN stops and the rest of the encoding begins. One relatively simple way to do this is to encode kN in binary, and then preface every digit with a 1. We then use 00 to denote the end of kN.) Thus our total description length is

Given a grammar G, we wish to create a grammar which still explains our training data, and is simpler. We would also like to create grammars that have reusable parts, as explained in Bernstein and Amit [2005]. This can be done using the operation of replacing Y with X: whenever Y is on the right-hand side of a rule, we replace it with X. Whenever Y is on the left-hand side of a rule, we delete that rule. This may lead to a merging of two rules [Z1 → XZ2 ], [Z1 → Y Z2 ], in which case we combine the multinomial probabilities and average the midpoint distributions.

Why does this operation create reusable parts? Ideally, before the replacement, Y and X are nonterminals that are used to parse similar curves (i.e., they have similar yields), but which ﬁt diﬀerently into the larger grammar. After the replacement, these curves will be parsed only with X, and X will ﬁt into the larger grammar in two diﬀerent ways. X has thus become a reusable part.

Note that replacement is asymmetric. This could be good; since approximate similarity is not transitive, we would like to replace nonterminals X1,..., Xk with whichever Xi is most representative.

This transformation should be applied when the nonterminals X and Y have similar internal structure. Intuitively, this means that any curve generated by X can be generated by Y with a similar probability, and vice versa.

Suppose that we have two nonterminals X and Y, and we want to decide whether we should replace Y with X, given that we want a simple model which still explains our data, which is formalized by trying to maximize the posterior probability of the chosen grammar.

Performing the replacement will simplify the grammar, which will be reﬂected in the MDL prior over structure. In order to make sure that we are still explaining the data, we need to consider the ratio of the likelihood under the current grammar to the likelihood under the grammar after performing the replacement.

Consider the quantity

The last two terms correspond to the reward for simplifying the grammar. The ﬁrst term measures how much better or worse the new grammar is at explaining the data.

We can make the simplifying assumption or approximation (used in Stolcke [1994]) that the distribution T ∼ T|C, G is arrived at by sampling from T ∼ T|C, G and performing the operation Replace(Y,X) on each parse tree; we then have

We can also arrive at a similar but slightly diﬀerent formula by trying to empirically estimate the KL divergence between the inside distributions of X and Y. The inside distribution of X is the distribution over curves that comes from sampling in the standard way. We call this distribution the inside distribution because its probabilities are calculated in the ﬁrst pass of the Inside-Outside Algorithm, as explained in Section 2.6.

The Kullback-Leibler divergence is a popular choice for measuring the distance between two distributions. It is deﬁned as

Let X be a nonterminal. For every subcurve C of any training curve, we can use reweighting to consider C as a sample from X’s inside distribution. In this case, P, Q are the inside distributions of X and Y, respectively. This approximation is especially appealing because the Inside-Outside algorithm gives us a very eﬃcient way to simultaneously compute PX (Ca [i : j]) for every nonterminal X and every subcurve Ca [i : j] of every training curve Ca. Speciﬁcally, we can compute all these in O(kn3 |N |) time, where k is the number of training curves, and n is the maximum length of a training curve.

When dealing with the initial grammar, this approximation should be reasonable, since our initial grammars don’t generate any curves that are too dissimilar from our training curves. We hope that generalization will take place as the grammar is simpliﬁed; thus, the approximation will get worse as we simplify the grammar.

Given training data C1,..., Cn, we can evaluate any transformation that takes G to G by examining the posterior

which is computable via parsing. This gives a general search strategy: simply pick the transformation that increases the value of the posterior the most.

We would like to show that grammars can capture an exponential amount of variability through factoring, as discussed in Section 5.1.

The easiest task would be to correctly apply merging (Section 5.5) to the n-armed shapes of Section 5.1, so that we can generate all of the shapes in the class after having seen a small number of them. We can make this easier by having diﬀerent arm variants in each of the n arm slots. We can also start on an easier version by training on bracketed samples.

A harder merging task is to use merging to get accurate models of the Romer and Weizmann Horse datasets.

** 5.5.2 Creating Choice with Merging**

We want to create grammars that have choice, as described in Section 5.1.1. This can be done by the operation of merging X and Y : we replace every instance of Y in either the left or right-hand side of a rule with X. This may lead to a merging of two rules [X → Z1 Z2 ], [Y → Z1, Z2 ] (and some other cases). When this happens, we combine the multinomial probabilities and average the midpoint distributions of the rules.

As a heuristic to guide merging, we plan to apply the approach of the last section, but to use the outside probability instead of the inside probability. If two nonterminals X and Y have similar outside probabilities on every subcurve, then X and Y can be said to have similar outer structure, since the grammar is placing them in similar contexts. Intuitively, this means that X and Y are interchangeable, so that for any curve in which X is used to explain a subcurve, there is another equally likely curve where that subcurve has been replaced with a subcurve that can be explained by Y.

5.6 How do we Identify Natural Constituents?

** 5.6.1 Goal: Learning Constituents**

The EM algorithm with a sparsifying prior will hopefully be able to recover grammatical structure. Therefore, if we start from a grammar with choice (see Section 3.7), we hope that doing EM iterations will give us a grammar with fewer choices that correspond to natural decompositions of shapes. Note that this method can only work with many training examples, and cannot be used to identify natural decompositions of a single shape.

Easy tasks for this would be trying to ﬁnd constituents in stars and polygons, and the n-armed shape. We could also try to get intuitively reasonable constituents for silhouettes of people and horses.

In Basri et al. [1998], Richards et al. [1985], Kimia et al. [2003], it is argued that some shapes have natural constituents, and some shapes do not. Some cues that we believe would give

**us some clues to structure:**

• Large-scale curvature. If we smooth the points of a curve with a Gaussian, then we can measure curvature at diﬀerent levels. We feel that large-scale curvature is likely to occur only at constituent boundaries.

We use explicit correspondences to learn the statistically best set of constituents when building a grammar from a single example. (This experiment is closely related to an experiment in Felzenszwalb [2003].) Speciﬁcally, since the points on each curve from the hand-annotated Romer dataset correspond semantically, we can ask which decomposition of the curves yields the model which gives the highest log-likelihood to the curves seen. This can be done with

**a simple dynamic program:**

Minimizing the sum of negative log-likelihoods of the Watson distribution is simply maximum likelihood estimation for the Watson distribution, which is explained in Subsection 3.1.5.

The constituents selected by the algorithm are shown in Figure 5.5 The constituents that seemed most intuitive to me are shown in Figure 5.6.

** Figure 5.5: Finding optimal constituents.**

This ﬁgure generated by the experiment experiments/6.structure/constituents.

** Figure 5.6: Hand-picked constituents.**

## CHAPTER 6

## APPROXIMATE PARSING OF ONE-DIMENSIONAL

Recall that the problem of parsing curves with a shape grammar is closely related to the problem of parsing strings with a weighted context free grammar. The running time of context-free parsing with a standard algorithm like CKY is cubic, and it is not known how to do better than this in practice. This makes parsing prohibitively slow for long strings in particular. This prevents the use of context-free grammars in modeling long one-dimensional signals, which are instead commonly modeled with Markov models.

The main contribution of this chapter is to introduce an approach for eﬃcient context-free parsing of very long strings. This would allow fast curve parsing with our shape grammars, and also the use of context-free models for other problems that involve long one-dimensional signals.

** 6.1 CKY parsing**

We now recall the CKY algorithm for parsing strings with a weighted context-free grammar in Chomsky Normal Form, which is shown in Algorithm 7. Let s be a string of length n.

Let (X, R) be a weighted context-free grammar, with X a set of symbols and R a set of weighted rules. Let w(X → λ) the weight of the rule (X → λ) ∈ R.

Let [i, j) be the interval starting with i and ending with j − 1. The algorithm works by computing a cost COST [X, [i, j)] for each symbol X and each interval [i, j), corresponding to the cost of parsing the substring s[i]... s[j − 1] with the nonterminal X.

The O(n3 ) runtime of the CKY algorithm arises because there are O(n2 ) substrings, and many substrings can be decomposed in O(n) ways. If we could limit the number of constituents and decompositions, we could speed up the algorithm. This brings us to the next section, where we deﬁne a framework for considering restricted families of constituents Algorithm 7 CKY parsing. Returns cost of best parse.

Input: string s, context-free grammar (X, R) for X ∈ X do for i = 0 to n − 1 do COST [X, [i, i + 1)] ← w(X → s[i]) end for end for for = 2 to n − 1 do for i = 0 to n − 1 do COST [X, [i, i + )] ← ∞ for j = i + 1 to i + − 2 do for X → Y Z ∈ R do COST [X, [i, i + )] ← min{COST [X, [i, i + )], w(X → Y Z) + COST [Y, [i, j)] + COST [Z, [j, i + )]} end for end for end for end for return COST [S, [0, n)] and decompositions.

Deﬁnition 6.2.1. Let s be a string of length n. A decomposition family for s is a pair (I, D), where I is a set of intervals [i, j), and D is a set of decompositions of the form

where [i, j), [i, k), [k, j) ∈ I. I must contain the interval [0, n).

The most trivial decomposition family is the set of all intervals [i, j) such that 0 ≤ i j ≤ n, together with all possible decompositions of each [i, j) into [i, k) + [k, j). We will call this the complete decomposition family.

**We would like small decomposition families that can approximate arbitrary parses reasonably well. We capture this notion with θ-ﬂexibility:**

** Figure 6.1: With a θ-ﬂexible decomposition family, any parse can be adjusted to an allowable parse by moving the midpoint of each binary rule slightly.**

On the left, a parse before adjustment. On the right, after the adjustment. Vertical lines denote allowable midpoint choices.