# «THE UNIVERSITY OF CHICAGO GRAMMATICAL METHODS IN COMPUTER VISION A DISSERTATION SUBMITTED TO THE FACULTY OF THE DIVISION OF THE PHYSICAL SCIENCES IN ...»

The Watson distribution can be thought of as a probabilistic version of the Procrustes distance, which can be deﬁned as

We have also experimented with the nonparametric distribution given by the Parzen windows method (Parzen [1962]). We want our grammar to be invariant to simultaneous translation, scale, and rotation of the points p, q, r. Therefore, we translate, scale, and rotate R2 with a map φ such that φ(p) = (0, 0) and φ(r) = (1, 0), and represent q via the coordinates q = φ(q).

In the context of kernel density estimators, the parameter h is called the bandwidth.

It speciﬁes how much smoothing to apply to the density estimation. Selecting a suitable bandwidth is often problematic.

It is worth noting that this nonparametric distribution is just a mixture of Gaussians, and thus there is no solid distinction between a mixture of multiple copies of the rule X → Y Z with Gaussian midpoint distributions, and a single copy with a mixture of Gaussians midpoint distribution. We will prefer the second, since the resulting grammar has a simpler structure, as discussed in Chapter 5.

** 2.9 Dealing with Variation in Length**

As mentioned before, our grammar models longer curves with rules of the form X → XX, and shorter curves with rules of the form X →. When building our initial models, it is very important to assign reasonable probabilities to these rules, because we want to use these rules in parsing, but prevent pathological parses. For example, we don’t want to parse an input curve in such a way that many high-level nonterminals X use the rule X → and some low-level nonterminal Y uses the rule Y → Y Y many times. Since the balance between diﬀerent rules is governed by a multinomial distribution, we need a reasonable prior on multinomial distributions that prevents pathological parses.

Let X be a nonterminal. When there is no prior reason to believe that any ρX (X → λ) is larger than the others, it is natural to use a Dirichlet prior with each αi the same. For shape grammars, this is not entirely the case; since rules of the form [X → Y Z] and [X → ] serve diﬀerent roles in the grammar, we expect their probabilities to take on diﬀerent values.

In particular, if the nonterminal X is meant to model long curves, it is unlikely that a long curve will have only two sample points, especially if we are parsing a curve that has many sample points. Therefore, we would expect the probability of [X → ] to be low.

Let C be a curve. We deﬁne the scale of a subcurve C to be s(C ) = |C |/|C|. We can then specify an ideal scale sX for the nonterminal X. Let k = |R(X)|. Our prior distribution on ρX is then Dir(α1,..., αk ), where α1 = k · αdir · e−ωsX 1 − e−ωsX αi = k · αdir · 2 ≤ i ≤ k.

k−1 Here α1 corresponds to the rule [X → ]. Suitable values of αdir are probably signiﬁcantly less than one, in light of Johnson et al. [2007]. A suitable value for ω is more complicated, and depends on the value of other model parameters.

This prior is suitable, because it produces grammars biased towards balanced parses. To see this, consider a grammar G whose parameters are set according to this prior. (This will be true of our initial grammar, as we describe in Section 2.7.) Let C be a curve of length n, and T be a G-parse of C. T will then specify nonterminals X1,..., Xn to parse each line segment of C. Thus, the likelihood P (C, T | G) will contain factors ρXi ([Xi → ]) for i = 1,..., n. If ρXi ([Xi → ]) is e−ωsX, then these factors contribute −ω i sXi e in total. Since the Xi combine to parse C, the scales sXi will hopefully sum to approximately one. In this case the probability will be highest when i sXi is lowest, which occurs when all the sXi are equal. The stated Dirichlet prior thus pushes G to favor balanced parse trees, Given a grammar model and a curve, we want to calculate how likely the curve is under the model, and we would like to extract the most likely correspondence between the points of the curve and the parts of the model. We call this process parsing, because it is extremely similar to CKY parsing with standard PCFG’s. We defer the technical details of parsing until a later chapter.

We examine parsing in this section by looking at human silhouettes from a dataset generated by Romer Rosales. We have two versions of this dataset: one where a silhouette has been automatically extracted by subtracting the background, and the curve simpliﬁed slightly by the discretization method discussed in Section 2.3; and a hand-annotated version, where we have marked by hand the position of 27 corresponding points on the body in each image.

Throughout this section, we examine the task of building a grammatical model from one curve, and using it to parse another curve. We demonstrate the accuracy of these parses by showing the correspondence between the points of the two curves induced by the most likely parse.

The easiest parsing task is to recover a one-to-one correspondence between two curves that have the same number of points, which we demonstrate in Figure 2.15. Because there are no missing or extra points, this is straightforward. The two curves are both from the handannotated Romer dataset. The grammar is built using a hand-picked set of constituents.

A somewhat harder task is to parse a curve that is longer than the curve on which our model is based. We wish to recover a reasonable correspondence between the points of the coarse model curve and a subset of the points of the ﬁner input curve. We do this by building a grammar model from the coarse curve and using it to parse the ﬁner curve. The grammar must have lengthening rules, but doesn’t need shortening rules. This demonstrates that we can model longer curves than were used to build the grammar. This is shown in Figure 2.16, where a model built from a hand-annotated curve is used to parse a curve from the Figure 2.15: Recovering a One-to-one Correspondence ground-truth Romer dataset. We successfully recover a very reasonable correspondence.

We also wish to recover a reasonable correspondence by building a grammar from a longer (i.e., ﬁner) model curve and parsing a shorter (i.e., coarser) input curve. This demonstrates that we can model shorter curves than were used to build the grammar. This task is generally harder than the prior task.

Here we build a grammar from a ground-truth Romer curve, and try to parse one of the (much shorter) hand-annotated Romer curves. We can safely assume that every point in the parsed curve has a corresponding one in the example curve, which is the reverse of the previous experiments. In order to do this successfully, the grammar needs shortening rules, but not lengthening rules.

In Figure 2.17, we show the results of parsing a shorter curve with very little geometric variation. The ﬁne details are not quite right, but overall the correspondence is reasonable.

In Figure 2.18, we show the results of parsing a shorter curve with signiﬁcant geometric Figure 2.16: Parsing a Longer Curve Figure 2.17: Parsing a shorter curve with very little geometric variation.

variation. In this case, the correspondence found is completely incorrect.

Our models are very sensitive to the initial decomposition speciﬁed for the model curve.

The two experiments above used a balanced but otherwise arbitrary decomposition of the model curve. By using a more carefully chosen decomposition, we get much better parses, which can be seen in Figures 2.19 and 2.20. In Figure 2.19, the ﬁne details are now exactly what we would expect, in contrast to Figure 2.17. In Figure 2.20, even though we have the same signiﬁcant geometric variation as in Figure 2.18, we now ﬁnd a very good correspondence.

** Figure 2.18: Parsing a shorter curve with signiﬁcant geometric variation.**

** Figure 2.19: Parsing a shorter curve with very little geometric variation.**

** Figure 2.20: Parsing a shorter curve with signiﬁcant geometric variation.**

Thus, even on fairly straightforward parsing tasks, our models are extremely sensitive to the particular hierarchical structure that we impose on the curve. The shortening rules only allow the parser to chop oﬀ constituents. Therefore, it is important to have good constituents. We address this in Chapter 3 by using multiple decompositions, and learn which constituents allow us to best explain our data.

## PARAMETER LEARNING FOR GRAMMATICAL MODELS

In this chapter, we describe a mathematically rigorous way to tune our model parameters using the EM algorithm. We give the details of the algorithm in Section 3.1. In the remainder of the chapter, we show samples from various grammatical models after training, in order to demonstrate that we have learned a correct and interesting model. Finally, in Section 3.8, we show that samples from standard models fail to model the dataset we are using.3.1 Learning Grammar Parameters with the EM Algorithm Let C1,..., Cn be independent samples from a grammar G, for which we know the structure, but not the parameters. We would like to set the parameters of G such that the posterior P (G | C1,..., Cn ) is maximized. We can write the posterior as

where this last sum is taken over all collections of one parse tree for each curve.

Unfortunately, it is not known how to maximize this posterior, or even the likelihood.

The likelihood is known to have many local maxima in the case of context-free grammars for natural languages (Charniak [1996]). Therefore, we are forced to use approximation techniques. We use the Expectation-Maximization algorithm (Dempster et al. [1977]), which is a standard approach to ﬁnding maximum likelihood or maximum a posteriori estimates in the case where important variables are unobserved. In our case, the unobserved variables are the parse trees {Ti }.

Let C = (C1,..., Cn ), and let T be the latent variable (T1,..., Tn ).

**We then iteratively ﬁnd better and better grammars G (t) :**

We want to ﬁnd the grammar parameters that maximize Q(t) (G). We have two classes of parameters: the multinomial distributions ρX, and the midpoint distributions µX→Y Z. We will consider each in turn.

A multinomial distribution is a distribution over a ﬁnite alphabet a1,..., ak, where P (ai ) = pi. Learning a multinomial distribution from independent samples x1,..., xn is commonly done by maximizing the posterior probability

Note that results still hold when the counts ci are real rather than integer. This is because the proofs rely solely on the log-likelihood having the form ci log pi. In this case, the ci are referred to as “soft counts”.

When the αi are greater than one, this prior smooths the estimated distribution p towards the uniform distribution. When the αi are less than one, this prior makes the distribution sparser than the maximum likelihood estimate.

When estimating the production probabilities in a grammar, a sparsifying prior is generally recommended. One intuition for this is that we are trying to recover a relatively sparse set of rules from the set of all possible rules; in order to avoid missing a rule, we must give some weight to (and thus generate observations of) rules that will ultimately be discarded.

In Johnson et al. [2007], a Dirichlet prior with αi = 1000 was found to work well for training context-free grammars for natural languages, although αi between 100 and 10,000 were found to work about as well.

In our case, ρX ([X → λ]) is a multinomial distribution, and we have soft counts cλ = Counta ( Xij → λ ).

a,i,j

First, we choose priors. Let X → Y Z be a rule of G. We set µX→Y Z equal to a Watson distribution with random mode µ and random concentration κ, where

Thus, the Count values are the soft counts we use to weight the za,ijk. For simplicity, we will simply say that we have observations z1,..., zn with corresponding weights c1,..., cn.

Let γ(κ; σ, θ) = log Gamma(κ; σ, θ), where

Examining this solution, we see that our estimate of the midpoint is similar to the maximum likelihood estimate of a Watson distribution, but is adapted to contain αmp artiﬁcial observations at the prior mode.

The posterior estimate of the concentration will be at a maxmimum when all zi are identical and equal to µ0. In this case, λmax will be m + αmp, and κ = κ σ+n. Thus, our ¯σ concentration will increase essentially linearly with the number of examples, but will never be inﬁnite.

In general, the zi and µ0 will be more spread out, causing A to have a lower dominant eigenvalue, and lowering the concentration.

** 3.1.6 Examples of Learning Watson distributions**

In ﬁgures 3.1 through 3.5, we demonstrate that we can learn the mode and concentration parameters of Watson distributions from training data. In each experiment, we select a random triangle (by using a Gaussian in Bookstein coordinates). We then draw 20 samples from Figure 3.1: Experimenting with the Watson distribution, part 1. In the ﬁrst row, the original triangle T on the left, and the mode of the estimated Watson distribution on the right. Subsequent rows are samples from Watson(T, 30.00). The estimated concentration was 63.48.

the Watson distribution centered at this triangle (using 30 for the concentration parameter of the Watson). We then reestimate the Watson distribution from the samples. This is a less noisy version of the learning task that EM faces when reﬁtting the midpoint distributions of a grammar from 20 samples.

We also show the results of ﬁtting a Watson distribution with diﬀering numbers of samples in Table 3.6. This demonstrates that the estimated concentration approaches the correct value as the number of samples increases.

** Figure 3.2: Experimenting with the Watson distribution, part 2.**