# «THE UNIVERSITY OF CHICAGO GRAMMATICAL METHODS IN COMPUTER VISION A DISSERTATION SUBMITTED TO THE FACULTY OF THE DIVISION OF THE PHYSICAL SCIENCES IN ...»

**show that a hand-built grammar can exhibit interesting variation, such as:**

** Figure 2.7: The rules for one of our hand-built grammars.**

The shaded part illustrates a reusable part because it occurs on the right hand side of two diﬀerent rules.

• Small geometric deformation (Figure 2.9)

• Articulating parts of an object (ﬁngers on the simpliﬁed hand in Figure 2.10)

• The presence or absence of a part (thumb on the simpliﬁed hand in Figure 2.10). This is achieved by giving the nonterminal which generates the thumb the rule Xthumb →.

• A choice between two diﬀerent variations on a part (pants vs. skirt in 2.11). This is achieved by having rules Xlower → Yskirt Zskirt and Xlower → Ypants Zpants, and having Yskirt, Zskirt based on a decomposition of the skirt-wearing shape, and Ypants, Zpants based on a decomposition of the pants-wearing shape.

• Shared parts that occur in diﬀerent contexts (ﬁngers on the simpliﬁed hand in Figure 2.10) Figure 2.8: The curves used to initialize our hand-built grammars.

** Figure 2.9: Samples from a grammar showing slight geometric deformation.**

** Figure 2.10: Samples from a grammar showing articulating parts, reusable parts, and a part that can be absent.**

** Figure 2.11: Samples from a grammar showing a choice between two diﬀerent variants on a part, corresponding to a choice between pants and skirt.**

We can then deﬁne P (C; G, p, q) to be the probability that we arrive at C by starting with Sp,q and repeatedly applying a random concrete substitution rule according to Psub.

For a ﬁxed grammar G and a given curve C with points C[0],..., C[n], there are potentially many ways for this sampling process to produce C. In this section, we formalize this by deﬁning parse trees.

For the sake of simplicity, we will assume in this section that all curves are open, unless otherwise speciﬁed. We deal with the technicalities of closed curves in Subsection 2.6.5.

and attaching T2 in its place.

In order to diﬀerentiate between abstract and concrete substitutions we will write the former as [X → Y Z] or [X → ] and the latter as Xik → Yij Zjk or Xii+1 → ii+1.

Deﬁnition 2.6.3. A G-parse tree is a binary tree T that speciﬁes a set of concrete substitution rules that take a placed symbol Xij to some placed curvilinear form λ.

**T has three kinds of nodes:**

• Unexpanded nodes, which are labeled with a placed symbol Xij, where 0 ≤ i j ≤ n and X ∈ N. Unexpanded nodes are always leaves.

• Lexical nodes, which are labeled with a concrete substitution rule

where [X → ] ∈ R(X), X ∈ N, 0 ≤ i n. Lexical nodes are always leaves.

• Binary nodes, which are labeled with a concrete substitution rule

where X, Y, Z ∈ N, [X → Y Z] ∈ R(X), and 0 ≤ i j k ≤ n. Binary nodes always have two children. The ﬁrst must be either of the form Yij → λ or of the form Yij.

The second must be either of the form Zjk → λ or of the form Zjk.

For brevity, we will refer to G-parse trees simply as parse trees.

**To simplify notation, we will deﬁne the symbol of a node (denoted sym(v)) as:**

• sym( Xii+1 → ii+1 ) = Xii+1

• sym( Xik → Yij Zjk ) = Xik We have deﬁned parse trees to represent derivations according to the substitution rules of G. For X ∈ N, 0 ≤ i, j ≤ n, the set of all G-parse trees T such that sym(root(T )) = Xij is in one-to-one correspondence with the set of all derivations of placed curvilinear forms from the placed symbol Xij. Unexpanded nodes correspond to incomplete derivations, while lexical and binary nodes correspond to applications of lexical and binary substitution rules.

The constraints on the children of binary nodes require that our derivation then operate on the placed symbols produced by the substitution rule.

Deﬁnition 2.6.4. The weight of a parse tree T is deﬁned to be

Note that this product omits the unexpanded nodes of T.

The weight of a parse tree multiplies the probability of each concrete substitution in the parse tree. We call it a weight rather than a probability because two diﬀerent parse trees are not always mutually exclusive events (in particular, if one is a subset of the other), and thus WG (T ) does not sum to one over the set of all parse trees. Trees with unexpanded nodes correspond to sets of trees, and the weight of the tree is the combined weight of the set.

Observation 2.6.5. Let T1 be a parse tree with an unexpanded leaf node Xij, and let T2 be a parse tree with root node of the form Xij → λ. If we deﬁne

Proof. Let T ∈ IC (Xij ). The root node is either of the form Xii+1 → ii+1 or Xij → G Yik Zkj. In the former case, the root is a lexical node, and cannot have any children.

In the latter case, since T is an inner parse tree, the root node cannot be a leaf node, and it is constrained to have exactly two children, which must be of the form Yik → λY and Zkj → λZ, since T has no unexpanded nodes. Every leaf node of either subtree is also a leaf node of T, and thus lexical. Therefore the subtrees headed by these nodes are also inner parse trees, and reside in the speciﬁed sets.

Proposition 2.6.8. The set IC (Xij ) is in one-to-one correspondence with the set of derivaG tions of C[i : j] from the placed symbol Xij according to the substitution rules of G. Moreover, for T ∈ IC (Xij ), WG (T ) gives the probability of the derivation.

G Proof. Since inner parse trees cannot have unexpanded nodes, they correspond to complete derivations of a subcurve from the placed symbol Xij. Since each step of the derivation is chosen independently from the others, the probability of the derivation represented by the parse tree is the product of the probabilities of the individual substitutions, which is given by WG (T ).

The probability that we derive a curve C[i : j] from a placed symbol Xij is just the sum of the probabilities of any particular derivation, taken over all possible derivations. This

**probability is called the inside probability:**

**We can compute the inside probability eﬃciently:**

Observation 2.6.9. We can compute all inside probabilities in time O(|R|n3 ) by using the

**following relations:**

Deﬁnition 2.6.10. An outer parse tree of C is a parse tree T with sym(root(T )) = S0n in which one special leaf is an unexpanded node Xij, and all other leaf nodes are lexical nodes corresponding to segments of C. Let OC (Xij ) be the set of all outer parse trees which have G unexpanded node Xij.

Proof. Let T ∈ OC (Xij ). The unexpanded node Xij may be the root of T, in which case G Xij must be S0n, and we are in the ﬁrst case.

Otherwise, Xij is not the root of T, and has a parent which is either Zhj → Yhi Xij, or Zik → Xij Yjk. Let us restrict to the second case; the ﬁrst case is strictly analogous.

If we remove the subtree rooted at Yjk and replace Zik → Xij Yjk with an unexpanded

**node Zik, we get a tree Tout which is a valid outer tree for Zik :**

• Tout has a single unexpanded node Zik

Furthermore, the subtree TY rooted at Yjk satisﬁes sym(root(TY )) = Yjk, and is an inner parse tree, since all remaining leaf nodes are lexical nodes. This completes the proof.

** Figure 2.12: The two ways of forming an outer parse tree for X.**

Here shaded ovals are unexpanded nodes, shaded triangles are outer parse trees, and unshaded triangles are inner parse trees.

**We deﬁne the outside weight as:**

Observation 2.6.12. Given the inside probabilities, we can compute the outside weights in time O(|R|n3 ) by using the following relations (note that we can do this with dynamic programming because we can compute the values in a ﬁxed order, based on the length of

**the intervals):**

We can think of a closed curve as an open curve whose endpoints are the same. However, we will prefer to think of it in a diﬀerent way. We wish to preserve the symmetry of closed curves by considering any rotation of the curve to be equivalent. We can then deﬁne an “opening” operation, which takes a closed curve C[0],..., C[n − 1], and an integer k, and produces the open curve C[k], C[k + 1],..., C[n − 1], C[0],..., C[k] whose endpoints are the same.

When given a closed curve, we will assume that any of the n possible opening operations are equally likely, and thus have probability n. Alternatively, we can imagine that the closed curve has had one of the n possible rotations applied to it before we see it, with each rotation being equally likely.

Given a closed curve, we could then apply the preceding machinery to every possible opening of the curve. However, this is grossly ineﬃcient, since diﬀerent openings of the curve still share many of the same subcurves. Here, we describe a way to take maximum advantage of this sharing.

** Remark 2.6.**

13. For closed curve grammars, the top-level midpoint distributions µS→XY (deﬁned in Section 2.4) cannot be proper distributions if our grammar is going to be invariant under similarity transformations (translation, scaling, and rotation).

The symbol S will be realized as a placed symbol Sp,p, and the placed rule will be of the form Sp,p → Xp,q Yq,p. Then, any choice of midpoint q is equally good, since we can map the pair (p, q) to a pair (p, q ) via a similarity transformation. This is a problem because we cannot deﬁne a uniform distribution on all of R2 !

We handle this by setting µS→XY (·) = 1 in our formulas. We justify this by thinking of each curve as being a representative of its equivalence class under similarity transformations.

The probability that we see a parse tree is therefore P (C, T | G) = n WG (T ). In order to describe the parse trees, we must allow placed symbols Xij, where j i. In this case, Xij will be understood to be a placed symbol that ultimately expands to be all the indices between 0 and n that are after i or before j. To simplify matters, we introduce clockwise

**intervals:**

These are the points which are outside of the clockwise interval between i and j, excluding i and j.

**We can then adapt Observations 2.6.9 and 2.6.12 for closed curves as follows:**

Proposition 2.6.14. We can recursively compute inside weights for closed curves in time

**O(|R|n3 ):**

The current work is based on Felzenszwalb and Schwartz [2007], in which grammar-like structures were constructed from single training curves. We follow that approach here.

A grammar GC for a curve C should produce C, and curves which look like C. We are not trying to achieve structural variation yet, so the only variation that GC needs to account

**for is:**

• Slight geometric deformations of C

• Curves like C that have more or fewer sample points.

Geometric deformation is modeled with a midpoint distribution that is deﬁned in terms of C. We discuss our models in Section 2.8. Curves with fewer sample points are modeled by including, for every nonterminal X ∈ N, the rule [X → ]. We initially set the probability of this rule based on the scale of X, as described in Section 2.9.

The nonterminals of our grammar will correspond to subcurves of C. We model curves with more sample points than C by allowing any nonterminal X that arises from a subcurve

**of length 1 to have two rules:**

Let C be a curve of length n. A generic parse tree for C is a parse tree that does not refer to any particular grammar. This will just be a binary tree T where each node is labeled by an interval (i, j), 0 ≤ i j ≤ n, and

1. An arbitrary parse, chosen to make the tree as balanced as possible. This approach corresponds most closely to that of Felzenszwalb and Schwartz [2007].

2. A parse that has been chosen to respect the natural part structure of C. This approach would require us to identify natural constituents of curves. There has been some work on this question, e.g. Richards et al. [1985].

3. We can choose multiple parses, and create a grammar that can parse C in multiple ways.

Our favored approach is number 3. We want to create a small grammar that can parse C in as many ways as possible, so that the EM retraining can discover the “correct” subgrammar.

When sampling from our grammar, we place a midpoint q according to the midpoint distribution µX→Y Z (q; p, r), where Xp,r is the placed symbol we are currently replacing, and X → Y Z is the substitution rule we are using.

When building a grammatical model from a single curve C, we base the distribution µX (i,k) →X (i,j) X (j,k) on the triangle C[i], C[j], C[k], with C[i], C[j], C[k] playing the roles of p, q, and r respectively.

We have experimented with two diﬀerent approaches to modeling the shape of triangles, the Watson distribution and a non-parametric distribution. Unless otherwise speciﬁed, midpoint distributions will always be modeled with the Watson distribution.

where µ is the mode of the distribution, and κ 0 is a concentration parameter. This distribution is invariant to rotation since, if y = eiθ z, |y ∗ µ| = |z ∗ µ|. More details on the Watson distribution can be found in Dryden and Mardia [1998].