«Chapter 6. Classification Chapter author: Jess Hemerly jhemerly Table of Contents 6.1 Overview ...»
6.5.3 Clustering A special method of computational classification, clustering aims to identify and organize things that are most alike, both keywords and documents, and group them together into appropriate classes. Scott Spangler and Jeffrey Kreulen (2008) define clustering as “an algorithmic attempt to automatically group documents into thematic categories” (p. 13). It is a method of fully automatable unsupervised machine learning that works on any collection of text.
Clustering relies on the clustering hypothesis: “closely associated documents tend to be relevant to the same requests” (van Rijsbergen, 1979, p. 30). On Flickr, keyword clustering works machine-in-hand with social tagging to classify similar photo tags applied by users. The top three most frequently used tags in the cluster, as grouped by the system, serve as the name for the cluster. For example, clusters related to the keyword “lava” include “Hawaii, volcano, ocean,” “Iceland, landscape, nature,” and “Etna, Sicily, Sicilia.” Clustering analysis is an effective way to group similar documents into thematic categories, but it is not capable of telling you what those documents actually mean. Thus, in order to build a taxonomy around meaning and not just keyword or similarity in a set of clustered documents, people need to go through the analysis results and develop the appropriate classification to fit with the set of documents. However, clusters are effective in excluding documents from groupings. In the end, while clustering is a form of unsupervised machine learning, it relies on the data analyst or information scientist to make sense of the clusters and impose a relevant classification scheme.
These examples are known as polythetic clustering: a set of terms defines each cluster.
Thinking back to chapter 5, we see a similarity: polythetic categories consist of multiple ‐ 21 ‐ Chapter 6: Classification Last revised: September 17, 2010 membership features. Likewise, a monothetic category membership is defined by one and only one feature. Thus, another approach to clustering is monothetic clustering, where one feature defines cluster membership. Mark Sanderson and Bruce Croft found that, when paired with another sources, monothetic clustering actually works to build hierarchical classifications. They used the WordNet ontology to determine hierarchical relationships between terms extracted from item descriptions, as well as hypernym/hyponym relationships defined by key phrases like “such as,” “and others,” and “is part of.” This estimation of relationships between words and concepts automatically mined from documents is known as subsumption.
Sanderson and Croft (1999, p. 207) used five principles of design for their experiment:
1. Terms for the hierarchy were to be extracted from the documents and had to best reflect the topics covered within them;
2. Their organization would be such that a parent term would refer to a more general concept than its child, in other words, the parent’s concept subsumes the child’s;
3. The child would cover a related sub topic of the parent;
4. Forming a strict hierarchy, where every child had only one parent, was not considered to be important, therefore, the structure could be more like a directed acyclic graph;
5. And finally, ambiguous terms would be expected to have separate entries in the hierarchy, one for each sense appearing in the documents.
The researchers chose a set of 500 documents, and extracted words and phrases from the documents. They then compared every term to every other term in order to find subsumption relationships. After identifying approximately 200 pairs, they were automatically organized into a concept hierarchy (Sanderson and Croft, 1999, p. 209).
‐ 22 ‐ Chapter 6: Classification Last revised: September 17, 2010 6.5.4 Discriminant Approaches With discriminant approaches, instead of having the machine do the work of identifying groups and categories, we first impose upon the machine a list of categories and have it match documents or entities to those categories. The use of computational classification with Library of Congress headings in section 6.5 is an example of a discriminant approach. The computer is fed certain parameters that are then matched against the documents in order to classify them. Here, the knowledge worker’s task is to create the classification, and the computer fits documents into it automatically.
Gracenote’s CDDB and MusicID are examples of discriminant approaches to classification. CDDB contains metadata about millions of CDs and tracks, matching lengths of tracks on a CD to lengths of tracks in the database to determine what album the CD is. In this case, the list of categories can be thought of as the information in the database, and the track information submitted by the user is the input to be matched. CDDB then fills in the metadata information, classifying the tracks as part of an album. You’ll notice that CDDB does not work when you insert a mix CD made by a friend. This is because CDDB is unable to use the combined information about track length to match the tracks to a whole album.
But CDDB also contains waveform fingerprint information, which powers its MusicID iPhone application. MusicID is an app that, when you to hold your phone up to a speaker through which a song is playing, analyzes the track, matches its waveform to tracks in the database, and spits back possible track matches to help you identify the song. CDDB would not work without a pre-populated database of information to check information against. Popular smartphone application Shazam works the same way but uses a different database of fingerprints.
Music recommendation systems work in a similar fashion, and each service has its own algorithm for matching a user’s listening habits to categories built from the listening habits of others. Last.fm combines listening data with community tags in order to build stations around given artists or tags. Users can mark tracks as loved or click a button to ban a track from ever being played for them again. In this way, the music is classified and the set of documents—in this case, songs—is pared down to fit a user’s preferences. This is not based on any acoustic fingerprinting; instead, recommendation relies on analysis and comparison of the tags users apply to music in Last.fm.
• Batley, Sue. Classification in Theory and Practice. Oxford, UK: Chandos Publishing, 2005.
• Dougherty, Janet W.D. and Charles M. Keller. Taskonomy: A Practical Approach to Knowledge Structures. American Ethnologist, 9(4), pp. 763-774.
• Getty Vocabulary Program. Art & Architecture Thesaurus (AAT). Los Angeles: J. Paul Getty Trust, Vocabulary Program, 1988 (http://www.getty.edu/research/conducting_research/vocabularies/aat/about.html)
• Gruenberg, Louise. Faceted Classification for the Web…
• Jacob, Elin. (2004). Classification and Categorization: A Difference that Makes a Difference.
Library Trends, 52(3), pp. 515-500.
• Lavallee, Andrew (2007). Discord Over Dewey. Wall Street Journal Online, July 20, 2007.
• Murphy, J. (2003, July 22). NASA Team Dismissed Foam Strike. CBS News. Retrieved from http://www.cbsnews.com/stories/2003/07/10/tech/main562542.shtml ‐ 23 ‐ Chapter 6: Classification Last revised: September 17, 2010
• Nunberg, G. (2009). Google's Book Search: A Disaster for Scholars. The Chronicle of Higher Education. Retrieved from http://chronicle.com/article/Googles-Book-SearchA/48245/
• Ranganathan, S. R. (1967). Hidden Roots of Classification. Information Storage Retrieval, 3, pp. 399-410
• OCLC.org. (2010). Dewey Services. Accessed via http://www.oclc.org/dewey/ on April 4, 2010.
• Spangler, S. and Jeffrey Kreulen. (2008). Mining the talk : unlocking the business value in unstructured information. Upper Saddle River NJ: IBM Press/Pearson plc.
• Svenonius, Elaine. (2000) The Intellectual Foundations of Information Organization.
Cambridge, MA: MIT Press.
• Van Rijsbergen, C.J. Information Retrieval. Newton, MA: Butterworth-Heinemann, 1979.