«Chapter 6. Classification Chapter author: Jess Hemerly jhemerly Table of Contents 6.1 Overview ...»
Chapter 6: Classification Last revised: September 17, 2010
Chapter 6. Classification
Chapter author: Jess Hemerly
Table of Contents
6.2 Classification Theory
6.2.1 What is Classification?
6.2.2 Purpose of a Classification
6.2.3 Classification is Principled
6.2.4 Spectrum of Classification
6.3 Faceted Classification
6.3.1 What are Facets?
6.3.2 Faceted Classification as a Controlled Vocabulary
6.3.3 Facets in Information Retrieval
6.3.4 Designing a Faceted Classification for the Web
6.4 Social/Distributed Classification
6.4.1 What is Tagging?
6.4.2 Folksonomy versus Tagsonomy
6.4.3 Tagsonomies and Personal Information Management
6.5 Computational Classification
6.5.1 What is Computational Classification?
6.5.2 Machine Learning
6.5.4 Discriminant Approaches
6.1 Overview Imagine how kitchen items are organized in a brick-and-mortar department store—think WalMart or Macy’s—that sells a variety of goods, from clothing to furniture. Within the store, the kitchen goods will be grouped together in a few aisles or on a single floor. Signs above the aisles or on the department store directory serve as descriptions pointing you to the section of the store that contains the items fitting that description. Within the kitchen area, you may see blenders grouped together on one shelf or a section of shelves; wooden spoons, tongs, and spatulas arranged by type and hung neatly; and rows of dishes unpacked and laid out in place settings to help you imagine how the different styles might look on your kitchen table. In this scenario, the department store comprises any number of kinds of items, grouped together in a location; the kitchen section comprises only kitchen supplies; and each shelf or area comprises a specific kind of kitchen item.
Next, imagine you’re shopping for kitchen items in a specialty kitchen store, like Williams-Sonoma or a wholesale kitchen supply store. As you walk in the door, you immediately find that the store sells one grouping of items: things to be used in your kitchen. Because this ‐ 1 ‐ Chapter 6: Classification Last revised: September 17, 2010 store is devoted only to this one type, or class, of items, the arrangement is somewhat different.
The specialization allows more variety within each class, expanding classes across aisles and displays. Instead of one or a few aisles devoted to kitchen items, you find one or a few aisles devoted to utensils, to blenders, to coffee makers, to knives, etc. Items within classes may be grouped by brand, size, or price, with labels on the shelf describing these specific attributes.
Because the contents of the store come from a narrow or specialized class, the selection and its organization differ from the organization in a department store. Here, the set of instances is more refined, catering to a specific clientele looking for a more specialized selection of goods.
Now, think about how you might shop for kitchen items online at either a department store or a specialty store’s website. Online, there are many different ways to locate items. You can enter a query and search for a generic term, like “knife,” a more specific term like “paring knife,” or a very specific term, like “Wüsthof Classic 9cm Paring Knife.” On many websites, search results will display a list of related terms or descriptors somewhere on the page, especially if you have entered a generic term. These descriptors are usually things like price, brand, and type, allowing you to browse and narrow down your results based on different characteristics you desire in the item. Maybe you didn’t know you wanted a Wüsthof Classic 9cm Paring Knife, but by narrowing down your search results using the characteristics, called facets, you end up discovering that this is just the knife you need. If you entered a very specific term, your item may not be available, but the system may suggest similar items that you might find just as desirable based on your terms and objects that have been assigned to categories sharing certain characteristics. The system takes your query and matches it to items available in the database to come up with recommendations that might work for you.
Let’s turn to the classification of the books in a kitchen or cooking store. Perhaps they’d be organized at the topmost level by topic or type—cookbooks, equipment, and techniques.
Looking at the subclass cookbooks, there are a few ways the subclasses could be arranged. They could be organized by cultural cuisine, like French, Indian, and Chinese; by main ingredient (fish, poultry, vegetarian); or alphabetically by author or title, within the class cookbooks overall or within each of the subclasses.
But looking for books on topics related to food and cooking in a library that uses the Dewey Decimal System is another story. You might first want to look under “700 – Arts and recreation,” since cooking is referred to as “the culinary arts” and many people enjoy cooking— and eating—as a recreational activity. But you wouldn’t find it there. Instead, food-related books live under “600 – Technology” in “640 Home economics & family living.” This classification may have made sense when the system was established, but “home economics” has become a dated term and cooking is more than an activity relegated to the home and family living space along with child rearing and sewing—historically, “women’s work.” Now think about your own kitchen. How have you grouped items in your kitchen?
Silverware is usually kept together in a single drawer, often separated by type with a silverware organizer. Pots may be stored in the same cabinet, baking items on the same shelf, and coffee next to or very near to your coffee maker. Items you use frequently may be in more accessible areas—on top in drawers, lower cabinets, and shelves—than items you use infrequently, which end up on high shelves or pushed to the back of cabinets. Containers in your freezer or fridge may be labeled, or tagged, with dates and names of items, in the same format. If you have a roommate or roommates, things could be labeled in a different format depending on who put it away—item name, item name and date, date only, etc. Or, worse, maybe you just have a collection of unlabeled mystery containers and wish you had taken the time to label things when ‐ 2 ‐ Chapter 6: Classification Last revised: September 17, 2010 you put them away. And maybe your kitchen isn’t organized at all, and every time you want to cook anything that involves the stove you spend some time searching for a specific item you need for the job. In a disorganized or unorganized kitchen, finding items is often difficult.
Finally, let’s turn to an example tied to technology. Let’s say you have an online collection of recipes and you want to figure out, without going through them individually, which ones are similar—for example, which of these recipes are vegetarian or contain ingredients in common? We can use computational classification to analyze these recipes, mine them for terms and combinations of terms —ingredients in this case—and cluster them based on the similarity of their term distributions. If the algorithm doesn’t start with a set of recipe categories its approach is called “unsupervised” machine learning. In contrast, you might have created a set of recipe categories and used an algorithm that sorts recipes into them. Or you could run the document analysis and the use a service like Mechanical Turk to have people read through the sorted documents and apply metadata. Both of these latter approaches are examples of “supervised” learning.
What does all this talk of kitchens have to do with classification? Everything. Each of the above examples includes concepts that we will discuss in depth in summary and in the topics within this chapter, from search, browsing, and retrieval to rules for arrangement and tagging.
Classification is, in a sense, applied categorization, but while categories are equivalence classes—sets of material and
things and processes we treat as the same—a classification is a system of categories, called classes, ordered using a predetermined set of principles. The terms “classification” and “categorization” are often used interchangeably, but they are not the same. Having a set of categories is not sufficient to create a classification. A classification must be principled so that we know where to place new items and entities in accordance with our system.
We apply principles to a set of instances or entities—concepts, objects, tasks, activities, etc.—within a domain in an attempt to sort and group the things that go together. In the kitchen examples, principles dictate how and why kitchen items are put in certain places or arranged in certain ways. But as we saw in Chapter 3, there are many different kinds or properties that can be used to describe things, and organizing principles can be based on any of them, not just those that are intrinsic or inherent. For example, credit bureaus classify borrowers by analyzing their purchasing and repayment history and assigning credit scores; insurance companies use accident and citation records to classify drivers and compute rate quotes.
The fundamental purpose of a classification is to help us make sense of relationships between concepts or objects within a domain or a set. A classification provides a reference model or a “semantic road map” to these concepts or objects within a domain, improving learning and
communication. When we talk about a classification being “principled,” we use three key terms:
lawful, systematic, and arbitrary. A classification is lawful because it follows at set of defined principles that determine the structure of categories and relationships; it’s systematic because the principles must be followed; and since it is designed by people or machines with a specific perspective for a specific purpose, a classification scheme is also arbitrary.
Classifications have perspectives and purposes, whether they are task-oriented or, at a higher level, serve individual, cultural, or institutional purposes. A classification may be as structured and widely used as the Dewey Decimal system in libraries, or as individual as a personal system for labeling genres in one’s music collection. Decisions must be made about what characteristics will be used to define the classes, and it is in these decisions that perspectives emerge. For example, a superordinate, or parent, class of “beer” could be divided ‐ 3 ‐ Chapter 6: Classification Last revised: September 17, 2010 into its first subdivision in a number of ways: color (light, dark), yeast (wild, lager, ale), style (lager, stout, porter), etc. The characteristics we choose for classification determine the shape of the ontology tree— the number of branches, sub-branches, and leaves at the endpoints. A different choice about classes and arrangements of those classes will create a very different ontology.
We classify largely to find things more easily later. That is, classification is as much for the organization of information as it is for retrieval. The development of the Dewey Decimal system provided multiple libraries a single standard of organization so that things could be found the same way in different libraries. In short, classification standards for books allow people to learn only one system to find books in many libraries. Without such a standard, we would need to learn a new system for every library—or worse, for every subject or type of publication.
Further, every system makes distinctions, either implicitly or explicitly, between “standard” and “nonstandard” ways of understanding things. These are often accompanied with the value judgment of “good” or “bad,” respectively. The politics of classification often show themselves in the labels or descriptors used to identify the class or its characteristics. For example, in the United States, people who have given up their job searches are not classified as “unemployed” even though in the literal sense of the word that is what they are. Here, the government has made a conscious decision to define the term in such a way as to exclude a group of entities from a class because it lowers the unemployment rate. How things are assigned to classes within a classification can even be politically motivated, as we’ll see with the example of NASA’s risk classification for space flight. A group of people may resist classifying an item in a certain way because of the implications or actions required by placing an item in that class.
An even more striking example classification can be found in the ethnic classifications of the United States Census and the classes in which census-takers have been forced to place themselves. We’ll discuss this in greater detail below.
Second, decisions must also be made about what to classify—that is, do we design our classification system based on characteristics of a given set of items or do we design it on a philosophical standpoint with universal classes intended to classify all knowledge? A justification for our order and selection of classes is known as warrant and it takes several forms. In the case of the Library of Congress Classification (LCC), the collection of books within the library of congress is the literary warrant. The taxonomic classification system used to classify all living organisms relies on scientific warrant.
Third, we must decide if we want the classification to be flexible based on new information. Let’s return to the census example. The terms for ethnic background have changed dramatically over the years, as census administrators adjust classifications to align with changes in what is and is not culturally acceptable. The census benefits from a 10-year delay between surveys and has the ability to adjust these classifications according to new information about the cultural climate regarding race identification (see Figure 6.1).
Figure 6.1: US Census Race Classification Changes