# «THE UNIVERSITY OF CHICAGO GRAMMATICAL METHODS IN COMPUTER VISION A DISSERTATION SUBMITTED TO THE FACULTY OF THE DIVISION OF THE PHYSICAL SCIENCES IN ...»

In the ﬁrst row, the original triangle T on the left, and the mode of the estimated Watson distribution on the right. Subsequent rows are samples from Watson(T, 30.00). The estimated concentration was 82.09.

** Figure 3.3: Experimenting with the Watson distribution, part 3.**

In the ﬁrst row, the original triangle T on the left, and the mode of the estimated Watson distribution on the right. Subsequent rows are samples from Watson(T, 30.00). The estimated concentration was 69.92.

** Figure 3.4: Experimenting with the Watson distribution, part 4.**

In the ﬁrst row, the original triangle T on the left, and the mode of the estimated Watson distribution on the right. Subsequent rows are samples from Watson(T, 30.00). The estimated concentration was 77.10.

** Figure 3.5: Experimenting with the Watson distribution, part 5.**

In the ﬁrst row, the original triangle T on the left, and the mode of the estimated Watson distribution on the right. Subsequent rows are samples from Watson(T, 30.00). The estimated concentration was 79.39.

3 87.55 10 79.70 30 62.15 100 42.28 300 32.47 1000 29.89 Figure 3.6: Fitting the Watson distribution with diﬀerent numbers of samples.

For each experiment in this section, we build an initial grammar from a single example curve (shown in Figure 3.7), train it on the training curves (shown in Figure 3.8), and then evaluate it by computing the cross-entropy on the validation curves (shown in Figure 3.9).

We show samples from each model in order to demonstrate what has been learned. We want models to generate reasonable-looking human silhouettes, and ideally generate some that diﬀer signiﬁcantly from any training example. For instance, we would like the model to learn that limbs can move independently of one another.

For all but one of the experiments, we repeat the experiment with three diﬀerent starting structures, which are shown in Figures 3.10, 3.11, and 3.12 as decompositions of the example curve.

Unless otherwise noted, these experiments use Watson distributions and a Gamma(σ = 106, κ = 1000) prior on the concentration of the Watson distributions.

** Figure 3.8: Training data for learning experiments.**

** Figure 3.9: Validation data for learning experiments.**

** Figure 3.10: One of three starting structures used in learning experiments.**

Each bold curve is a symbol in the grammar, and each division of a subcurve corresponds to one production of the grammar.

** Figure 3.11: One of three starting structures used in learning experiments.**

Each bold curve is a symbol in the grammar, and each division of a subcurve corresponds to one production of the grammar.

** Figure 3.12: One of three starting structures used in learning experiments.**

Each bold curve is a symbol in the grammar, and each division of a subcurve corresponds to one production of the grammar.

3.3 Simple Tuning of hand-built grammar with curves of

First, we use EM to train an initial grammar that has no choices: each symbol is on the left hand side of a single rule, and the only randomness of the model is in the midpoint distributions. Our grammar is produced from a hand-chosen decomposition. Samples from the grammar after various numbers of rounds of training are shown below. Since we are using unimodal midpoint distributions, and our grammar has no choice, all the samples are slight variations on one particular mean shape. The mean shape chosen is a very reasonable one, although the arms are longer than is ideal.

We repeat this experiment three times with diﬀerent initial grammatical structures, to show how the performance of the algorithm depends on our initial structure. Samples from the model learned are shown in Figures 3.13, 3.14, and 3.15. There is only a small diﬀerence between the samples from diﬀerent structures.

Simple Tuning 21679.145584 14944.299535 15952.025953 Multiple Midpoints 7112.552734 4620.838965 5732.380757 Multi-level Correlations 8451.596984 6095.411852 7368.234718 Full Grammar 2807.826409 Table 3.1: Cross-entropy scores for experiments. Each column represents a diﬀerent starting structure.

** Figure 3.13: Samples from grammar after 30 rounds of EM, using a grammar based on structure 1.**

From simple tuning experiment.

** Figure 3.14: Samples from grammar after 30 rounds of EM, using a grammar based on structure 2.**

From simple tuning experiment.

** Figure 3.15: Samples from grammar after 30 rounds of EM, using a grammar based on structure 3.**

From simple tuning experiment.

3.4 Experimenting with diﬀerent priors over concentration In Figures 3.16, 3.17, and 3.18, we show the outcome from training with Watson distributions and a Gamma(σ, κ = 100) prior on the concentration of the Watson distributions, for σ = 1, 1000, 106. It can be seen that the samples become more realistic the higher we adjust the weight of the prior.

** Figure 3.16: Samples from grammar after 30 rounds of EM, using σ = 1.**

** Figure 3.17: Samples from grammar after 30 rounds of EM, using σ = 1000.**

** Figure 3.18: Samples from grammar after 30 rounds of EM, using σ = 1000000.**

3.5 Multiple Midpoints, and Curves of Constant Length In the next experiment, we enrich the grammar by adding in several copies (in this case, ﬁve) of each rule, with perturbed midpoints. We randomly perturb the mode of the midpoint distribution for each copy of the rule, in order to allow the diﬀerent copies to be used to explain diﬀerent data. (If we used the same midpoint for each copy of the rule, parses could use the rules interchangeably, and EM would never break this symmetry.) In Figures 3.19, 3.20, and 3.21, we show samples from the grammar built using each of the three starting structures. The samples show signiﬁcantly more variation than in the previous experiment, but they also include many ill-formed and unlikely shapes. This is because the grammar cannot model correlations between levels, so that e.g., it cannot pick the location of the elbow based on the location of the hand, which means that the arm is often ill-formed and self-intersecting.

We repeat this grammar three times with diﬀerent initial grammatical structures, to show how the performance of the algorithm depends on our initial structure. We show the log-likelihoods of the validation data after training in Table 3.1.

** Figure 3.19: Samples from grammar after 30 rounds of EM, using structure 1.**

From multiple tuning experiment.

** Figure 3.20: Samples from grammar after 30 rounds of EM, using structure 2.**

From multiple tuning experiment.

** Figure 3.21: Samples from grammar after 30 rounds of EM, using structure 3.**

From multiple tuning experiment.

3.6 Multiple Midpoints, Multi-Level Correlations In this experiment, we enrich the grammar by adding in several copies of each nonterminal, each of which has several copies of the original rule with perturbed midpoints. If our original rule was X → Y Z, then we have ﬁve copies each of X, Y, Z, and each Xi has ﬁve rules of the form Xi → Yj Zk, where j and k are chosen independently at random. The midpoint distribution for each copy of the rule has its mode randomly perturbed to allow the diﬀerent copies of a rule to explain diﬀerent data.

In Figures 3.22, 3.23, 3.24, we show samples from the grammar built using each of the three starting structures. The silhouettes look much better than in the previous experiment.

This is because we have learned correlations between levels, i.e., the choice of rule at a higher level inﬂuences the choice of rule at a lower level. For example, knowing where we have placed the point at the end of the arm should give us information about where to place the points of the elbow. Ideally, we would have ﬁve diﬀerent variants of the nonterminal corresponding to the arm, each with its own substructure. This results in cleanly articulating parts.

We repeat this experiment three times with diﬀerent initial grammatical structures, to show how the performance of the algorithm depends on our initial structure. We show the log-likelihoods of the validation data after training in Table 3.1.

** Figure 3.22: Samples from grammar after 30 rounds of EM, using structure 1.**

From experiment with multi-level correlations.

** Figure 3.23: Samples from grammar after 30 rounds of EM, using structure 2.**

From experiment with multi-level correlations.

** Figure 3.24: Samples from grammar after 30 rounds of EM, using structure 3.**

From experiment with multi-level correlations.

In this experiment, we build a grammar from the example curve that allows all possible decompositions. The grammar has one symbol for every subcurve of the example curve, and one production for every decomposition of a subcurve into two smaller subcurves. We then train using EM. Below we show samples from the grammar after training. Note that there is no longer a choice of initial structure, as we are generating a rule form every possible decomposition of the example curve. We show the log-likelihood of the validation data in Table 3.1.

** Figure 3.25: Samples from grammar after 30 rounds of EM.**

From full tuning experiment.

In Figures 3.26, 3.27, 3.28, we see samples from a model using an independent Gaussian distribution for each point. They are trained on the same dataset as the grammatical models.

Respectively, they show samples from the maximum likelihood model, samples from the maximum likelihood model with no variance, and samples from the maximum likelihood model with its variance artiﬁcially decreased. It is plain to see that these models are unable to capture much interesting structure, especially when compared to our models.

We also show (in Figure 3.30) samples from a model of Cootes et al. [1995] trained on hand shapes, which is a model based on non-independent Gaussians.

3.8.2 Comparing to Independent Nonparametric Distributions In Figure 3.29, we see samples from a model using independent nonparametric distributions for each point. The independence assumption is clearly not a good ﬁt for modeling these curves.

We demonstrate the eﬀectiveness of our models and training procedure by classifying shapes from the Swedish Leaves dataset. For each class, we build a grammar from the ﬁrst example curve, and train it on all the examples in the training set. Our grammar includes, for each symbol, a shortening rule X →, and for every symbol that is not on the left hand side of any other rule, a lengthening rule X → XX.

The performance achieved is shown in Table 3.2. We also show samples from the models learned in Appendix A.

** Figure 3.26: Samples from a model using independent Gaussians.**

We are using the maximum likelihood estimate of parameters.

** Table 3.2: Classiﬁcation results on the Swedish leaves dataset.**

** Figure 3.27: Samples from a model using independent Gaussians.**

The variance of the Gaussians has been decreased to show the means better.

** Figure 3.28: Samples from a model using independent Gaussians.**

The variance of the Gaussians has been set to zero to show the means better.

** Figure 3.29: Samples from a model using independent nonparametric distributions for each point.**

** Figure 3.30: Samples from the model of Cootes et al.**

[1995], from which this ﬁgure is taken.

Given a shape model and an image containing a similar shape, we would like to be able to ﬁnd the shape in the image. This is done by searching for a shape that is rated highly by the shape model, and for which evidence exists in the image. Related work has been done in Felzenszwalb and Schwartz [2007], Jin and Geman [2006], Felzenszwalb and McAllester [2010], Felzenszwalb [2005], each concerned with the problem of ﬁnding objects in the plane with recursive costs deﬁned by a hierarchical model.

The rest of this chapter is organized as follows: First, we describe ﬁltration parsing, an admissible heuristic method for speeding up a wide array of dynamic programming algorithms, which we have used to speed up parsing. We then describe our formulation of object detection with grammatical models. Finally, we identify unintended reuse as an important obstacle to using grammatical methods for object detection, and describe our approach to solving this problem.

** 4.1 Speeding up Detection with Filtration Parsing**

In this section, we describe ﬁltration parsing, a new algorithm that allows us to speed up certain dynamic programming problems using a technique akin to branch and bound. Similar techniques have been studied before in Felzenszwalb and McAllester [2007] and Raphael [2001].

** 4.1.1 Lightest Derivation Problems**

We now discuss ﬁltration parsing, which is a technique to speed up parsing by eliminating unlikely parses from consideration. It is similar in nature to Felzenszwalb and McAllester [2007], but it can be applied in a more general context. We ﬁrst require some preliminaries.

binary trees, where u is a leaf node of T1 and v is the root node of T2. We deﬁne a new tree T1 u T2 to be the tree resulting from deleting u and attaching T2 in its place.

v ¡ A lightest derivation problem (Felzenszwalb and McAllester [2007]) is a tuple (S, R), where S is a set of statements and R is a set of weighted inference rules. We require that S contain ⊥, a special goal-statement. We will denote the weight of a rule R ∈ R by w(R).

The goal is to ﬁnd the lightest (i.e., lowest cost) derivation of the goal statement using

**the rules in R. Our inference rules are of the following form:**