«Keyword-based searching and clustering of news articles have been widely used for news analysis. However, news articles usually have other attributes ...»
Watch the Story Unfold with TextWheel: Visualization of Large-Scale
WEIWEI CUI and HUAMIN QU, Hong Kong University of Science and Technology
HONG ZHOU, Shenzhen University
WENBIN ZHANG and STEVE SKIENA, State University of New York at Stony Brook
Keyword-based searching and clustering of news articles have been widely used for news analysis. However,
news articles usually have other attributes such as source, author, date and time, length, and sentiment
which should be taken into account. In addition, news articles and keywords have complicated macro/micro relations, which include relations between news articles (i.e., macro relation), relations between keywords (i.e., micro relation), and relations between news articles and keywords (i.e., macro-micro relation). These macro/micro relations are time varying and pose special challenges for news analysis.
In this article we present a visual analytics system for news streams which can bring multiple attributes of the news articles and the macro/micro relations between news streams and keywords into one coherent analytical context, all the while conveying the dynamic natures of news streams. We introduce a new visualization primitive called TextWheel which consists of one or multiple keyword wheels, a document transportation belt, and a dynamic system which connects the wheels and belt. By observing the TextWheel and its content changes, some interesting patterns can be detected. We use our system to analyze several news corpora related to some major companies and the results demonstrate the high potential of our method.
Categories and Subject Descriptors: H.4.0 [Information Systems Applications]: General; I.7.5 [Document and Text Processing]: Document Capture—Analysis General Terms: Documentation, Design Additional Key Words and Phrases: Document analysis, text visualization, macro-micro relation
ACM Reference Format:
Cui, W., Qu, H., Zhou, H., Zhang, W., and Skiena, S. 2012. Watch the story unfold with TextWheel: Visu- alization of large-scale news streams. ACM Trans. Intell. Syst. Technol. 3, 2, Article 20 (February 2012), 17 pages.
DOI = 10.1145/2089094.2089096 http://doi.acm.org/10.1145/2089094.2089096
1. INTRODUCTION News articles are a major source of information. For topics such as major companies, famous people, and major events, the volume of news articles is enormous and read- ing them one by one becomes impossible. News visualization turns news streams into visual forms and shows them to users so they can use their prior knowledge and high- bandwidth visual processing capacity to gain insight into the data.
There are three major issues facing news stream visualization. First, news streams are time-varying high-dimensional data. It is a classic hard problem to develop This work is supported by HK RGC grant GRF 619309.
Authors’ addresses: W. Cui (corresponding author) and H. Qu, Department of Computer Science and Technology, Hong Kong University of Science and Technology, Hong Kong; email: firstname.lastname@example.org; H. Zhou, Department of Computer Science, Shenzhen University, Hong Kong; W. Zhang and S. Skiena, Department of Computer Science, State University of New York at Stony Brook, Stony Brook, NY.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies show this notice on the ﬁrst page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior speciﬁc permission and/or a fee. Permissions may be requested from the Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701, USA, fax +1 (212) 869-0481, or email@example.com.
c 2012 ACM 2157-6904/2012/02-ART20 $10.00 DOI 10.1145/2089094.2089096 http://doi.acm.org/10.1145/2089094.2089096
Fig. 1. Visualization of news streams with the TextWheel to reveal multiple attributes of news articles and the macro/micro relations between articles and keywords.
visual encoding schemes for keywords and articles to show their multivariate and time-varying nature. Second, there may exist complex macro/micro relations between keywords and articles. At the micro level, keywords (e.g., Bill Gates and Microsoft) have various relations. At the macro level, text articles may be also related (e.g., dealing with the same topic). Meanwhile, each article contains multiple keywords and each keyword likely appears in many articles; these relations may change with time.
These complex macro/micro relations may be very useful for text analysis. However, it is very difﬁcult to visually encode these macro/micro relations. Third, text data can be extremely large and this poses special challenges for the scalability of visual encoding schemes.
In this article we develop a visual analytics system for news streams and address the aforementioned issues facing news visualization. We focus on the multiple attributes of news articles and the dynamic relations between articles and keywords.
Our visualization system is built on top of a text mining preprocessing, and consists of three components: signiﬁcance trend chart generation, entity encoding, and relation encoding. During the preprocessing stage, keywords are extracted with various attributes (e.g., sentiment and frequency), and the relations between keywords and news articles are also established. With the help of modern text mining techniques, we may ﬁnd a number of interesting or important keywords as candidates for further explorations. In return, our visualization system could also help users verify the keywords picked by an automatic algorithm, and further gain insights into the content of the news streams. After that, all the sentiment and word frequency information is collected and summarized as a line chart, which we call the signiﬁcance trend chart, to provide an overview of the sentiment evolution over time. At the same time, we use a concise glyph to encode each article whose multiple high-level attributes are ﬁrst extracted and then encoded using different visual channels of the glyph to provide a succinct overview of the article.
To deal with the complex macro/micro relations existing among keywords and articles, we introduce a novel visual primitive called TextWheel which consists of one or multiple keyword wheels, a document transportation belt, and chains to connect the belt and wheels (see Figure 1). By observing the TextWheel and its content changes, interesting patterns can be detected. Designed after some familiar objects in our life, ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 2, Article 20, Publication date: February 2012.
Watch the Story Unfold with TextWheel: Visualization of Large-Scale News Streams 20:3
the TextWheel is intuitive to use. We also provide a set of interaction tools to help users better use our system to analyze large-scale news corpora. The major contributions of this article are as follows.
— We present a visual analytics system for users to visually summarize, organize, abstract, and analyze news streams and other text documents. Our system brings multiple attributes of text documents, dynamic relations between text documents and keywords, micro relations among keywords, and macro relations among documents into one coherent analytical context.
— We design text glyphs to visually encode keywords and documents. The glyph can intuitively and succinctly summarize various nominal and categorical features of keywords and documents.
— We develop a novel visualization primitive called TextWheel representing the complicated macro/micro relations among keywords and documents.
2. RELATED WORK Text visualization has received considerable research interest in recent years. Various visualization approaches have been proposed to help people effectively analyze and understand large document archives. These techniques can be generally grouped into two categories: document visualization and semantic visualization.
2.1. Document Visualization Document visualization techniques concentrate on visualizing large document corpora and illustrating relations between documents. Most of the techniques are designed to analyze similarity relations among documents by transforming each document into a feature vector (usually a keyword vector) and then measuring distances between these vectors. Based on the resulting similarity, these documents are clustered and then displayed either in a hierarchical way [Granitzer et al. 2004] or in a unstructured way [Wise 1999]. For example, the SPIRE system [Wise 1999] displayed documents as dots on a 2D plane and clustered them based on their keywords. Documents that are close in the high-dimensional space will also be close on the 2D plane. In 2005, Hetzler et al.  further extended this system to the temporal domain and used it to analyze dynamic document ﬂows. InfoSky [Granitzer et al. 2004] uses a similar approach to visualize hierarchical document collections on a plane space. Users can zoom in and out, just like using a telescope, to explore the documents at different levels of detail. However, InfoSky requires that the documents have already been organized into a hierarchical structure. Different from these two approaches, HiPP [Paulovich and Minghim 2008] can automatically cluster documents into a multi-level structure and allow users to dynamically change the hierarchical structure during the exploration.
2.2. Semantic Visualization Semantic visualization techniques focus on revealing and analyzing the semantic patterns in documents. Based on the different types of patterns they want to explore, these techniques can be categorized into three groups: literal patterns, keyword patterns, and temporal patterns.
Literal Patterns. Word Tree [Wattenberg and Vi´ gas 2008] uses trie-like structures to e explore word collocation patterns in real documents. In the tree structure, nodes represent words with size encoding the word frequency. An edge linking two nodes indicates that these two words are concatenated in the original document. Phrase Nets [van Ham et al. 2009], which can be considered as a follow up work of Word Tree, also uses links to indicate collocation relations. However, it connect words as a graph instead of a
tree to convey more sophisticated relations. Mao et al.  developed a technique to visualize the sequential semantic progress in documents. They used statistical methods to identify patterns within the input document data, and then ﬁtted the patterns to a curve. Oelke et al.  applied text ﬁngerprinting on the input text document and provided users with a loopback framework to evaluate and improve the visualization results.
Keyword Patterns. BlogPulse [Glance et al. 2004] monitors over 5.5 million Web blogs and records more than 450K posts every day. To assist users, it provides a Web search interface (http://www.blogpulse.com/) for keywords. It uses a line chart to show the keyword frequency so that users can see how one or multiple keywords appear and disappear over time. Wong et al.  combined the strength of text mining techniques and 3D bar charts to visualize the association rules within multiple items. They used the x-axis to display rules and the y-axis to list items. Then, an association rule can be visualized as a 3D bar standing on the x-y plane. The Jigsaw system [Stasko et al.
2008] provides a visual interface for users to search different keywords in a large collection of documents. It focuses on the relations between different types of keywords, such as people, organizations, and places.
Temporal Patterns. Temporal patterns are also a very active research ﬁeld in document analysis. Some papers try to track the trend of text ﬂows and identify their changes over time. For example, Wong et al.  proposed a method to visualize dynamic data streams by using animated scatterplots. Allan et al.  also developed a technique to identify evolving stories and new stories by analyzing a growing set of news articles. ThemeRiver [Havre et al. 2000] uses stacked area charts to visualize a collection of documents according to their themes. In the chart, each color stripe represents a theme and is curved smoothly to make it look like a river. Some other papers focus on the cluster or entity relation evolutions in document streams. For example, Erten et al.  combined graph visualization and clustering techniques to analyze how the coauthor relations evolve over time in scientiﬁc literature. TextPool [AlbrechtBuehler et al. 2005] clusters document contents on the screen and uses carefully designed animations to help users understand the content changes. Compared with these previous approaches, our article addresses a quite different problem, that is, how to visualize the macro/micro relations widely existing in news streams and other text data.
3. DATA COLLECTION AND PROCESSING
3.1. Data Collection All experiments in this article were conducted on a 1.5 gigabyte corpus of 333,289 news articles published between 2004 and 2006, with an interesting pedigree. Our experiments in this article are run over subsets of documents selected from this corpus on the basis of a single keyword, say Merck or Verizon and the resulting set of several thousand articles visualized. We believe this procedure is quite representative of typical tasks concerning corpus understanding.
3.2. Named Entity Recognition Named entity recognition is a natural language processing problem where one seeks to detect every named entity mentioned in a document. It serves as our feature extraction system for documents, identifying the topics of likely potential interest.
Named entity recognition is a well-studied problem with an extensive literature (e.g., Chieu and Ng  and McDonald ). We primarily employ rule-based ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 2, Article 20, Publication date: February 2012.
Watch the Story Unfold with TextWheel: Visualization of Large-Scale News Streams 20:5