«Chapter 6. Classification Chapter author: Jess Hemerly jhemerly Table of Contents 6.1 Overview ...»
As we learned in Chapter 2 (“Identity and Identification”) and Chapter 3 (“Describing Instances”), people use different names for the same things and the same names for different things. So, too, do people apply different tags, and thus tagging can be distracting and deficient in information retrieval. Things may be tagged insufficiently or even with terms that don’t actually describe anything about the item at all. Thus, tagging suffers from the vocabulary problem.
Tagging is rather subjective, as it’s usually more of a quick and dirty task than a structured one. However, social media sites are beginning to build in mechanisms that add some structure to the tagging task. For example, on the social networking site Facebook, users can indicate that a specific person is in an uploaded picture by clicking on the faces of people in photographs, typing the person’s name, and then selecting the person from a list of Facebook friends. Because the system offers the user his or her full list of friends to choose from, the names are formatted the way they appear on a user’s profile, thereby creating a structured way to identify, describe, and connect people to photographs.
6.4.2 Folksonomy versus Tagsonomy Tagging by nature is an organizational free-for-all because it’s a highly subjective practice and is frequently done on the fly. A personal photo collection on Flickr may have pictures of trees tagged with the terms “woods,” “trees,” “forest,” and “forests,” or just one of these terms, or any combination of them. When retrieving photos from the collection, the variation in tags may make it difficult to find all pictures of trees. Without some rules, tagging is hardly a classification and more closely resembles categorization. This is what Thomas Vander Wal first referred to as “folksonomy” in 2004. It is simply a collection of tags users assign as descriptors, not a principled set of classes. But if a user were to choose one and only one term to use for all pictures of trees, as well as to decide whether the plural or singular form would be used, we’d see the beginning of a tagsonomy. We can thus overcome the vocabulary problem in tagging by creating a controlled vocabulary for the tags that will be used for various descriptive purposes.
Because tags are so much like facets, a tagsonomy can classify examples along multiple dimensions. We can tag with the location where the photo was taken, the event where it was taken either by general name or our own personal label—HawaiiVacation2010, for example—or by its public name, like SXSW. But just like categorization isn’t classification without the creation and application of guiding principles, tagging isn’t a tagsonomy without rules to dictate the dimensions along which tagging occurs, the granularity with which things are tagged, and the naming conventions that help form a controlled vocabulary for a set of entities. Will we use all plural, all singular, or a mix of plural and singular forms of nouns? And will we include spaces
between multiple-word tags if the site allows them? Making these decisions and applying them in the tagging process constitutes the different between tags and a tagsonomy.
6.4.3 Tagsonomies and Personal Information Management As we mentioned in 6.4.1, a tagsonomy can be useful in both information organization and retrieval. But tagsonomies play an especially useful role in situations of personal collections of entities and objects or even tasks and activities. A tagsonomy allows users to classify new entities as they are added, assigning them to classes based on a principled system of tagging.
Like the main IO/IR tradeoff between the up-front costs of information organization and the long-term benefits for information retrieval, a tagsonomy that’s consistently applied to tagging all of one’s music or one’s photos on Flickr makes it much easier to find things.
Let’s return to Flickr as our example here. There are two main levels of metadata at Flickr. First, as files are uploaded, machine metadata like exposure and date taken is added as well. Second, users are able to add tags to their photographs that allow them to describe the subject, the context, and more. Flickr displays these tags on the photo page and automatically generates an index of all tags a user has applied to uploaded photographs. For people, Flickr allows a user to tag the photo with the name of another Flickr user if that person appears in a photo, but since not everyone uses Flickr, a user may want to create another system to keep track of who appears in what photographs. This can happen on multiple levels. First, a user could use broad tags of “family” and “friend” to group more generally. A user could then tag family photos with specific identifiers like “mom,” “dad,” “cousin,” and “brother” and friend photos with tags like “best friend” and “girlfriend.” In order to maximize one’s ability to find all pictures of specific people within a collection, a user could come up with a system for tagging individual names, like “joeb” or full names. Each level of granularity defines a different class or subclass and principles that dictate how we tag enable us to find things more easily later on.
Task-based tagging also aids personal information organization, allowing us to build relationships between activities and domains. We can call this a taxonomy of tasks, or a taskonomy. Where a taxonomy organizes entities based on similarity of content or composition, a taskonomy organizes based on “activity structure” (Dougherty and Keller, 1982, p. 763-774).
Some people tend to organize their work areas in externally objective ways, such as by subject or topic—think of books going from a library shelf to a desk. They may then shift to more taskoriented organization as they complete a task, like writing a paper or creating a class presentation. Entities are no longer piled together objectively but are grouped according to resources necessary for a given task. These entities may retain some semblance of the objective organization, but the intention has shifted to something highly subjective: getting work done.
When a task is completed—a paper has been finished and turned in—the entities are then reorganized according to the original externally objective plan—i.e., books returned to their spots on library shelves. Of course, some people never organize at all, and everything ends up in the purgatory that is a “potential project” pile.
Taskonomies are interesting to think about when the work being done is not knowledge work but a skilled trade. Think about how you organize your own kitchen. Many people keep baking sheets and other items for oven use in a drawer under the oven, with pots and pans for stovetop use hung near the stove. This simple arrangement is a common example of a taskonomy, and, returning to our kitchen example, a cook’s taskonomy might look something
Figure 6.2: A Cook’s Taskonomy
Looking at the relationship between tasks and tools in this way can help a cook determine the best way to organize tools in a kitchen. Cutting items would necessarily be kept together near a prep area; having to run across the kitchen to another area where a poultry knife is kept with, say, chicken broth would be detrimental to the cook’s workflow. It would make far more sense to have all of the items for the task of cutting in a single area.
At an even more specific task-based level, think about the way you might prepare to make dinner. If following a recipe, many cooks like to pull all ingredients from their storage places and keep them close by in the prep area. This is similar to the idea of an activity-based “pile” mentioned above. After the meal has been prepared, items are returned to their original places, or “filed.” This piling and filing is an effective way to arrange items for a task at hand.
As with tagging content for metadata purposes, tagging tasks is a helpful way to build structure into something as simple as a to-do list. Items that are necessary for a given task can be tagged with a predetermined tag for that project or task in order to better organize all of the related items. A legal secretary could organize documents for an upcoming hearing by applying tags developed through a taskonomy so that all of the requisite electronic documents are easier to find. For collaborative work, assigning tags to tasks allows all collaborators to get a high-level view of the work to be done and who is best suited to perform a task. A taskonomy is also a useful way to achieve a high-level summary of what people do with certain items for user research and can then lead to more efficiency in design. Taskonomies, then, are excellent tools for helping to understand how things are done.
‐ 19 ‐ Chapter 6: Classification Last revised: September 17, 2010
6.5 Computational Classification 6.5.1 What is Computational Classification?
As we’ve seen, people can usually assign things to existing categories or create a new system of categories to design a classification. Knowledge experts have historically performed the task of classification, and these knowledge experts developed the major classifications we still use today, including Library of Congress Classification and Dewey Decimal. Even the scientific taxonomy was developed and refined by knowledge experts over time.
But it can be too costly in terms of time or effort to perform this manual assignment, especially when approaching a new domain or set of specialized documents. The cost increases when a set of documents changes or grows regularly. And when the value of the classification depends on it being done in a timely manner, such as filtering of news or email messages or clustering of search results, you just can’t do it manually because it isn’t useful unless things are classified immediately, even instantly.
Sometimes computational classification is fully automated, performed entirely by machines. Other times, people—sometimes data scientists, sometimes normal people through services like Amazon’s Mechanical Turk—assist the machines by refining results or preorganizing. Text analysis programs can index documents to help determine their similarity and, thus, what documents belong in a set. Banks perform automatic classification of us when determining credit risk based on our credit scores.
Advances in natural language processing—where machines and computers can use human language instead of only machine language as inputs and outputs—coupled with the expansion of the field of data science have allowed us to empower machines with the capability to classify objects, entities, and, specifically, documents through text classification. In the case of library science, for example, assigning the terms from a controlled vocabulary like Library of Congress subject headings is text classification because each term can be thought of as a category. But this work can be aided with automatic text classification, allowing new entities to be matched to the appropriate headings based on analysis of the document text or document metadata. This doesn’t remove the librarian entirely; it simply aids the librarian in the work of classifying new materials, especially digital ones.
Of course, text classification is not 100% perfect, but we don’t compare automated approaches to “perfect” classification, only to that which can be done by people. Because text classification processes can be applied to an incredible variety of domains, and because of the increasing number of documents in digital form, text classification is a growing and important field.
6.5.2 Machine Learning Machine learning is a process by which a computer, usually through the use of a complex algorithm, builds a text classification from a set of documents by “learning” the general categories that groups of documents share in common. Machine learning happens in two major ways: supervised and unsupervised. With supervised machine learning, we give the machine the categories we expect a set of items to fit into and the machine learns to give us the output we desire based on the input we provide.
A familiar example of supervised machine learning is filtering, a form of automated text classification. Text classification assumes a system of categories and labeled instances so that we ‐ 20 ‐ Chapter 6: Classification Last revised: September 17, 2010 can train a system to assign new entities or occurrences to the appropriate classes. Take, for example, your email inbox. At the simplest level, incoming messages are classified by your mail server or program as SPAM or NOT SPAM. Those messages which are SPAM are filtered to a SPAM folder, while those that are NOT SPAM head to your inbox. The spam filter looks for different characteristics within emails, such as nonsensical phrases, odd URLs and email addresses in the sender field, or key terms like “pharmaceutical” or “beneficiary.” The machine performs these tasks without any human help. Sometimes, however, things we want to receive end up in the spam folder and we have to go and look for them there. We may miss an important message because the computer mistook it for spam. We then mark the message NOT SPAM, teaching the computer that messages like this one are meant for the inbox, not the spam folder. Likewise, spam may sneak past the filter and end up in the inbox. We then have to mark the item as spam, teaching the computer that items like this one are meant for the spam folder.
We can also further filter the items coming into our inboxes into specific categories depending on things like sender email address or subject line. While you could tag each message once it has already reached your inbox, defining filters saves you the work and allows you to automatically organize your email inbox. Here, we provide the machine with a set of parameters and the machine then does the work of classifying our messages based on the message text.
With unsupervised machine learning, the machine receives input but does not receive categories. Instead, the machine finds patterns in the data of which we must make sense and to which we must attach meaning. The goal is to build representations of the input that can later be turned into a useful and reusable classification. We’ll explore this in greater depth in 6.5.3.