# «THE UNIVERSITY OF CHICAGO GRAMMATICAL METHODS IN COMPUTER VISION A DISSERTATION SUBMITTED TO THE FACULTY OF THE DIVISION OF THE PHYSICAL SCIENCES IN ...»

One approach to solving this problem, used in Jin and Geman [2006], is to ﬁx various low-level assertions that seem likely, and eliminate incompatible low-level assertions from consideration. In our case, this approach does not yield good results.

** 4.4 State space**

We assume that there is a single object of interest against a background, and that its boundary is the curve we wish to detect. Let P be the set of pixels in the input image, and let E ⊂ P be the pixels marked as edges by some edge detector, such as the Canny edge detector (Canny [1986]). We wish to assign every non-edge pixel to be either ﬁgure or ground, i.e., inside the object of interest or not inside the object of interest. This will be formulated as an energy minimization problem. We will represent the labeling as a function L(p), where L(p) = 1 when p is inside the object, and L(p) = 0 when p is outside of the object.

Because we are trying to prevent edges from being used multiple times, we would in particular like to prevent or penalize the use of an edge pixel in opposite directions. We thus add variables to our energy minimization problem to represent the orientation of each edge pixel. We constrain this orientation to be orthogonal to the gradient at the edge pixel, so each edge pixel has two choices of orientation, represented as a unit vector D(p). We want to assign orientations so that the object of interest has its boundary traced out clockwise. We also want the orientation assignment to be consistent with the output of our global parsing algorithm.

We use the output of our local interpretation procedure to modify global inference in two

**ways:**

• A line segment that goes over an edge pixel with the wrong orientation is assigned the negative of its normal cost (corresponding to the assumption that it will be used twice in the parse with the correct orientation, and once with the reverse orientation). The ﬁrst part of Figure 4.2 depicts a problematic parse that will be penalized appropriately by this heuristic.

• A line segment is disallowed if too high a fraction of its pixels are classiﬁed as interior, or if too high a fraction are classiﬁed as exterior. (Here interior and exterior refer to the output of our local inference procedure.) This is meant to eliminate short-circuiting, in which the optimization procedure double-counts an edge by taking a round-about way from its head to its tail. Unless the curve encompasses the whole object, this test generally prevents short-circuiting. The second part of Figure 4.2 shows an example of short-circuiting.

**We formulate our energy minimization problem as the sum of four terms:**

U (L, D) = Useg (L) + Uor (D) + Uint (L, D) + Uyd (D), Figure 4.2: Motivation for local constraints. Grey regions represent areas where there is evidence for edges.

where Useg enforces coherence in the segmentation, Uor enforces coherence in the orientation assignment, Uint enforces consistency between the segmentation and the orientation assignment, and Uyd enforces consistency between the orientation assignment and the previous curve selected by our global parsing algorithm. We now discuss each term in turn.

This energy function is very simple, it is just a sum over all pairs of adjacent non-edge pixels, where the cost is −α if the two pixels have the same label, and α if they have diﬀerent labels, where α 0 is some constant.

This energy function is a standard model from statistical physics called the Ising model.

It is commonly used in computer vision to denoise images (Besag [1993], Geman and Geman [1984]). Here we are using it to penalize non-edge boundaries between ﬁgure and ground, which should be rare. When they are necessary, this energy term will push non-edge boundaries to be short and simple.

where M (p) is the gradient magnitude at pixel p and · is the dot product between vectors.

This energy function pushes our orientation assignment to vary smoothly, as nearby edge pixels are penalized if their orientations do not point in the same direction. We weight this penalty by the gradient magnitude, so that stronger edges are given more inﬂuence. The function wp,q is a weighting that speciﬁes how inﬂuence dies oﬀ over the distance between p and q. We have used wp,q = e−(x +y )/4w (1 − β|y|), where x is the component of q − p perpendicular to the gradient at p, and y is the component of q − p parallel to the gradient at p. Note that wp,q = wq,p. The (1 − β|y|) factor captures the fact that parallel edges suﬃciently far apart are more likely to go in opposite directions, because they are likely to be the two sides of a part of the object.

where M (q) is the gradient magnitude at q, and θ(u, v) is the angle between vectors u and v.

This energy function is motivated by the observation that the angle between q − p and D(q) tends to be close to 90 degrees when p is in the interior of the object and D(q) is going clockwise around the object. When p is outside the object and D(q) is going clockwise around the object, the angle between q − p and D(q) tends to be close to 270 degrees. This is shown in Figure 4.3. Multiplying the sine of this angle by (1 − 2L(p)), which is −1 in the interior and 1 in the exterior, yields a quantity that is negative if L(p) and D(q) are consistent with one another, and thus consistency is rewarded when we minimize the energy.

where · is the dot product of vectors.

This energy function simply rewards orientation assignments which go in the same direction as the line segments that make up our initial parse.

In our experimental results, we show the ﬁnal parse chosen. Recall that we parse ﬁrst, do local inference, and then reparse with modiﬁed data costs. We also show the segmentation into Figure 4.3: The angle between q − p and D(q) should be close to 90 degrees for interior p and close to 270 degrees for exterior p.

** Figure 4.4: Output of detection algorithm.**

Final detection shown top left, segmentation shown top right, orientation assignments shown in four bottom images. The top left of the orientation images gives a key to the orientation labels.

ﬁgure and ground, and the orientation assignment selected by our local inference procedure.

The orientation assignment is shown in four diﬀerent images corresponding to four diﬀerent pairs of opposite orientations. One orientation is shown in black, and its opposite is shown in white, while other orientations are shown in grey. The orientation depicted is shown by the squares in the top left of each image.

These results demonstrate that our local inference procedure, combined with our global parsing procedure, allows us to locate human silhouettes in clutter-free images.

** Figure 4.5: Output of detection algorithm.**

Final detection shown top left, segmentation shown top right, orientation assignments shown in four bottom images. The top left of the orientation images gives a key to the orientation labels.

** Figure 4.6: Output of detection algorithm.**

Final detection shown top left, segmentation shown top right, orientation assignments shown in four bottom images. The top left of the orientation images gives a key to the orientation labels.

** Figure 4.7: Output of detection algorithm.**

** Figure 4.8: Output of detection algorithm.**

** Figure 4.9: Output of detection algorithm.**

** Figure 4.10: Output of detection algorithm.**

** Figure 4.11: Output of detection algorithm.**

** Figure 4.12: Output of detection algorithm.**

** Figure 4.13: Output of detection algorithm.**

Natural object categories such as cars and people exhibit two kinds of variation: continuous or “plastic” deformations and discontinuous structural variations. Therefore, object models which allow for both will make vision algorithms much more powerful. The potential for a satisfying account of large structural variation is one of the most intriguing possibilities of grammatical methods.

One of the simplest structural variations is occlusion: part of an object may not be visible, usually because something between the object and the camera is occluding it. Occlusion has been well understood in computer vision for a long time, and models can be made robust to it, e.g., the Hausdorﬀ distance in Huttenlocher et al. [1993].

**Another common way that objects exhibit structural variation is by having optional parts:**

a dog may or may not have a tail, a person may or may not have a hat. Occlusion models are capable of recognizing such objects with or without their optional parts, but they do not accurately model optional parts. An optional part is a particular subset of the object that is likely to not appear, while occlusion allows any not-too-large subset of the model to disappear.

The usefulness of more general structural variation can be seen in Figure 5.1. Here, the human eye notices a large similarity between the two shapes A1 and A2, but many curve models would see very little similarity.

** Figure 5.1: If A1 is the original curve, which other curve is most similar to it? Figure adapted from Basri et al.**

[1998].

We might intuitively describe the second shape, A2, as A2 = “Take A1, snap oﬀ the right appendage, and reattach it beneath the left appendage.”.

**This highlights several important points:**

The description 5.1.1 of A2 is very short in English, and might be even shorter in a specialized curve model encoding. Description length is a good proxy for the conditional probability of observing A2 given that it is a distortion of A1 (Bienenstock et al. [1997]).

Structural variation is a fundamental problem in modeling visual objects. In the absence of a practical model of structural variation, we must model variation as continuous deformation. Then, any model that declares A1 and A2 to be similar will think that A3 or A4 is even more similar to A1.

Structural variation cannot be modeled without a semantically meaningful decomposition of the original curve, like that seen in Figure 5.2. Description 5.1.1 crucially relies on “the right appendage” making sense to the listener. Thus, perceptually simple structural variation must respect the perceived structure of the original curve. Contrast Figure 5.1 with Figure 5.3, where a similar transformation has been applied with no regard to the perceived structure of the original curve.

** Figure 5.2: The original shape from Figure 5.**

1, decomposed into semantically meaningful parts. We argue that this decomposition explains why the variation in Figure 5.1 is less semantically diﬀerent than the variation in Figure 5.3. Adapted from Basri et al. [1998].

** Figure 5.3: Two shapes which are not perceptually very similar, although they are related by a transformation as simple as that in Figure 5.**

1. The problem is that the transformation does not respect the perceived structure of the original. Adapted from Basri et al. [1998].

Mixture models are a class of models that do not suﬀer from the continuous deformation problem of Figure 5.1. However, if there are multiple independent structural variations possible, it is unlikely that we will see every combination of each form. Consider the shape grammar Gn that generates shapes that have n arms, each of which can take either of two

**forms:**

where A is pointy and B is rectangular. We show four shapes possible under this grammar in Figure 5.4. A classic mixture model will not be able to generalize in this scenario without

exponentially many training examples, since there are 2n possible shapes. If we instead have mixture models at the level of individual structural variations, then our model is a grammatical model in the style of Section 1.1.

We want to learn a grammar that explains our training data well, but we also want a simple grammar, because simpler models exhibit better generalization. We can reward simple grammars by using a prior based on the Minimum Description Length, described in the next section. We can then attempt to maximize the posterior probability of G given the training data and this prior.