«Chapter 6. Classification Chapter author: Jess Hemerly jhemerly Table of Contents 6.1 Overview ...»
In more recent iterations of the census, we have seen increasingly fine granularity in classifying Asian populations. The term “Hispanic” has also changed over the years, and in the 2010 census actually functioned to exclude a number of nationalities from the descriptor.
Politically, we see tension between the terms “Hispanic” and “Latino.” Finally we must decide how we want to build our classification. If we want to use a topdown hierarchical structure, with superordinate and subordinate classes nested in a structure, our classification will be enumerative—i.e., it will list all possible entities and the relationships between them according to literary or subject-specific warrant. It can be highly enumerative or less so, but either way will be a top-down structure consisting of mutually exclusive classes.
Enumerative classifications are limited by their nested structure in expressing relationships:
items are collocated by a hierarchical order of classes and subclasses and every entity will only belong to one subdivision.
Faceted classifications are an alternative to the strict hierarchies of enumerative systems and, as mentioned above, are especially useful in web user interfaces for things like online shopping, where users may want to consider multi-dimensional characteristics and where it is unreasonable to assume a strict hierarchical ordering of the dimensions. For example, if we have a collection of shirts in various styles, colors, brands, and prices, it makes sense to sort them using these dimensions in any order. Items are grouped orthogonally—that is, they are mutually independent—according to certain characteristics and users can select these characteristics to narrow down a selection of entities that meet a number of criteria. A faceted classification essentially enumerates all possible classes into which a set of concepts or entities can be sorted, but those classes only exist if there is a need to sort entities by a specific characteristic. A faceted classification is like a controlled vocabulary of “concepts and their associated labels that can be used, in association with a notation and a prescribed citation order, to synthesize the classes that will populate the classification scheme” (Jacob, 2004, p. 525).
The history of faceted classification is rooted in the colon classification theory of S.R.
Ranganathan, a Hindu mathematician working as a librarian. Ranganathan sought to organize all the world’s ideas for the purpose of library cataloging using a single classification and notation.
He established a set of five and only five facets applied to all knowledge: Personality, the type of thing; Matter, the constitutional matter of the thing; Energy, the action or activity of the thing;
Space, where the thing occurs; and time, when things occur. The notation was each facet separated by a colon, with values that represent different characteristics pulled from a table and shown in the P:M:E:S:T format (Ranganathan, 1967, p. 5-7). Today, the types of facets include enumerative (mutually exclusive); Boolean (yes or no); hierarchical or taxonomic (logical containment); and spectrum (a range of numerical values). The selection criteria include ‐ 5 ‐ Chapter 6: Classification Last revised: September 17, 2010 orthogonality, semantic balance, coverage, scalability, concreteness, and normativity. We will cover this in more depth later in this chapter, using examples from current web user interfaces.
So far, we’ve discussed systems of classification established by people or organizations with some institutional authority to create them. But a classification can be a highly personal form of information management too. Think again about your kitchen. You very likely have rules for where things go and why, and this system of organization allows you to more easily find things when you need to use them. Your principled system may be arbitrary, constrained by the size of your apartment or by the space granted to you in a shared cupboard. You may even organize based on activities for which you use things, like baking, snacking, and serving. But here we are talking about management of physical things that can only be in one place at a time.
It is useful to contrast this case with the management of “information things” that can be classified in many “places” at once, like a digital picture, music file, or a digital document. How might we manage and classify bits?
With the rise of social media sites and tools in recent years, a new of form of classification has emerged: social and distributed classification, the most well known form of which is tagging. Tagging is generally not a principled practice. Users tend to apply terms to photographs, news clips, or other entities, both textual and multimedia, that help them find and share things with others. Tagging usually falls short of classification due to a lack of vocabulary control and a tendency for users to tag intuitively. Pictures of trees on Flickr can appear tagged with “tree” or “trees” depending on the user’s whim. And, if we remember the vocabulary problem (section 2.5.2), one photographer’s “tree” is another’s “oak.” This disparity in the descriptors people use to categorize similar things makes many systems that depend on tagging for IR little more than tag soup. In an unstructured free-for-all tagging system, tagging is not classification; it is simply categorization or even description.
Thomas Vander Wal coined the term “folksonomy”—combining “folk” and “taxonomy”—to describe a collection of descriptors, often listed by popularity—use frequency—on the home page of a social tagging site such as Delicious. Folksonomies are often displayed in the form of a tag cloud, where the frequency with which the tag is used throughout the site determines the size of the text in the tag cloud. Similar tags are clustered, but folksonomies are not principled; they are emergent, created through bottom-up aggregation of user tags.
Users and communities can generate a set of principles to govern their tagging practices in order to harness distributed and social tagging to develop a useful classification system. Such a system of distributed or personal classification through the use of tagging is a tagsonomy, a principled evolution of the folksonomy. Tagsonomies can overcome the strict limitations of hierarchical classifications and users can adopt conventions to encode hierarchical and derivational relationships. Looking back at the kitchen example in the beginning of this chapter, the way you may (or may not) label items in your fridge is a basic example of a tagsonomy.
Social media systems can also be designed to push users toward tags that align with popular usage, systematically encouraging principles and thus classification. Social media systems can also include functionality that “bundles” tags, essentially building their own classification of user tags in order to enhance information retrieval. We’ll explore this in greater detail in the chapter.
Classification need not be performed directly by humans. Automatic indexing derives keywords from a document and provides access to all of those words. More complex systems take indexing a step further and build controlled vocabularies based on the keywords in the ‐ 6 ‐ Chapter 6: Classification Last revised: September 17, 2010 documents. Building on these controlled vocabularies, automatic classification aims to group similar documents using either a fully automatic clustering method or a predetermined classification scheme and documents already indexed according to that scheme.
Clustering allows us to perform automatic classification based on predetermined rules and guidelines that the machine will execute during document analysis. Computer scientist C.J.
Van Rijsbergen’s cluster hypothesis states, “closely associated documents tend to be relevant to the same requests” (Van Rijsbergen, 1979, p. 30).
Classes can be structured in one of two ways. First, a class can be intellectually formulated. That is, it’s structured through manual assignment, as in library classification, or automatic assignment, as in the Library of Congress’s CHESHIRE for which UC Berkeley professor Ray Larson developed entry vocabulary modules for clustering classification. Second, a class can be derived automatically from a collection of things in one of three ways: hierarchic clustering, agglomerative clustering, and hybrid methods, like query clustering.
For the purposes of automatic classification, data consists of objects with a set of four types of descriptors. These descriptors are similar in dimension to Ranganathan’s facets: multistate attributes (e.g., color), binary-state (e.g., keywords), numerical (e.g., hardness scale, or weighted keywords), or, when objects are themselves classes, (e.g., probability distributions).
To summarize, while classification and categorization are closely linked, they are not synonymous. Classification is, in a sense, the formalization and implementation of categorization. A classification can be hierarchical, faceted, social, or automatic, but in order for it to truly be a classification, there must be predetermined principles that serve as authority control for the organization of entities.
The following sections of the chapter will dive into each of these highlighted areas in more detail, including examples and applications of ontologies, faceted classifications, tagsonomies, and computational classifications.
6.2 Classification Theory 6.2.1 What is Classification?
As Louise Gruenberg wrote in “Faceted Classification, Facet Analysis, and the Web”:
“Classification is a higher order thinking skill requiring the fusion of the naturalist’s eye for relationships…with the logician’s desire for structured order…the mathematician’s compulsion to achieve consistent, predictable results…and the linguist’s interest in explicit and tacit expressions of meaning” (Gruenberg, 2002, para. 1).
As mentioned in section 6.1, a classification is a system of categories, called classes, ordered using a predetermined set of principles. The act of placing items into these classes is classification, and can be performed by people or, thanks to advances in fields like natural language processing, data mining, and the semantic web, machines. Classifications can be applied to a narrow set of concepts or entities, like kitchen supplies or beer, or to much broader sets of concepts and entities, such as Aristotle’s attempt to classify all beings or Dewey’s system to classify all knowledge for the purpose of finding it in a library. Classifications can also be applied to documents and data as well as concepts and actions—i.e. an entity’s placement in a class requires certain action be taken.
‐ 7 ‐ Chapter 6: Classification Last revised: September 17, 2010 6.2.2 Purpose of a Classification A classification serves as a reference model—a semantic roadmap—to individual domains and relationships therein. This roadmap then enables us to better understand concepts and relationships between entities in a given domain. It also allows us to organize things in a way that will make them easier to locate. A classification in a home kitchen allows us to find what we need when cooking or baking quickly and easily. The way things are classified in a department store helps us to find specific domains of objects among many. A specialty store’s classification helps us to find specific objects within subclasses among many others. And the classification used in an online store allows Internet shoppers to locate and narrow down sets of matching items.
The four different forms of kitchen-related classification all relate to a single, specific domain of objects, but classifications can also be designed to classify all knowledge. In 1873, Melvil Dewey invented the Dewey Decimal Classification (DDC) as a scheme for classifying works in a general collection containing diverse subjects—essentially, collections of general knowledge. The first edition of DDC appeared in print in 1876, and it is currently the most widely used library classification scheme in the world’s public libraries, modified on a regular basis to “continually keep up with recorded knowledge” (OCLC, p. 1).
In contrast, Herbert Putnam created the Library of Congress Classification (LCC) in
1897. It was meant not to catalog all the world’s knowledge but to provide a practical way to organize and later locate items within the Library of Congress’s collection. It has since been adopted by research and academic libraries particularly in the United States, but most public and smaller libraries tend toward the Dewey Decimal Classification (DCC). Subject divisions in LCC are broad; for example, A contains “General Works,” M “Music,” and K “Law.” Contrast the widely used DDC and LLC systems with a collection of United States government documents distributed through the Federal Depository Library Program. New York University’s Bobst Library is a “selective depository,” receiving 55% of all documents distributed from the government to participating libraries. These materials come in various forms: pamphlets, books, booklets, newsletters, CD-ROM, and microfilm. Because the content is highly specialized, and because new subjects and pieces are added to the existing collection regularly, these documents, housed on the sixth floor of the NYU library, are organized by their own classification and hand-numbered as they come into the library. The documents are included in the library’s general catalog, BobCat, but users visiting the sixth floor are met with special indexes and librarians who know the collection quite well.
On the other hand, searching for information online with search engines like Google has drastically changed the way we expect to see search results returned. Keyword search returns links whose relevance (remember the concepts of recall and precision) depends on the algorithm powering the search engine. But search results are returned as links to the actual digital documents, not numbers that point to locations on shelves. Furthermore, unlike books on library shelves, documents online can exist in many different places. Because users are becoming so used to the methods of online searching, some believe that systems like DDC and LCC are in danger of being abandoned for less rigorous Google-like organization of libraries using a classification called BISAC.