FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:   || 2 | 3 | 4 |

«Keyword-based searching and clustering of news articles have been widely used for news analysis. However, news articles usually have other attributes ...»

-- [ Page 1 ] --

Watch the Story Unfold with TextWheel: Visualization of Large-Scale

News Streams

WEIWEI CUI and HUAMIN QU, Hong Kong University of Science and Technology

HONG ZHOU, Shenzhen University


WENBIN ZHANG and STEVE SKIENA, State University of New York at Stony Brook

Keyword-based searching and clustering of news articles have been widely used for news analysis. However,

news articles usually have other attributes such as source, author, date and time, length, and sentiment

which should be taken into account. In addition, news articles and keywords have complicated macro/micro relations, which include relations between news articles (i.e., macro relation), relations between keywords (i.e., micro relation), and relations between news articles and keywords (i.e., macro-micro relation). These macro/micro relations are time varying and pose special challenges for news analysis.

In this article we present a visual analytics system for news streams which can bring multiple attributes of the news articles and the macro/micro relations between news streams and keywords into one coherent analytical context, all the while conveying the dynamic natures of news streams. We introduce a new visualization primitive called TextWheel which consists of one or multiple keyword wheels, a document transportation belt, and a dynamic system which connects the wheels and belt. By observing the TextWheel and its content changes, some interesting patterns can be detected. We use our system to analyze several news corpora related to some major companies and the results demonstrate the high potential of our method.

Categories and Subject Descriptors: H.4.0 [Information Systems Applications]: General; I.7.5 [Document and Text Processing]: Document Capture—Analysis General Terms: Documentation, Design Additional Key Words and Phrases: Document analysis, text visualization, macro-micro relation

ACM Reference Format:

Cui, W., Qu, H., Zhou, H., Zhang, W., and Skiena, S. 2012. Watch the story unfold with TextWheel: Visu- alization of large-scale news streams. ACM Trans. Intell. Syst. Technol. 3, 2, Article 20 (February 2012), 17 pages.

DOI = 10.1145/2089094.2089096 http://doi.acm.org/10.1145/2089094.2089096

1. INTRODUCTION News articles are a major source of information. For topics such as major companies, famous people, and major events, the volume of news articles is enormous and read- ing them one by one becomes impossible. News visualization turns news streams into visual forms and shows them to users so they can use their prior knowledge and high- bandwidth visual processing capacity to gain insight into the data.

There are three major issues facing news stream visualization. First, news streams are time-varying high-dimensional data. It is a classic hard problem to develop This work is supported by HK RGC grant GRF 619309.

Authors’ addresses: W. Cui (corresponding author) and H. Qu, Department of Computer Science and Technology, Hong Kong University of Science and Technology, Hong Kong; email: weiwei@cse.ust.hk; H. Zhou, Department of Computer Science, Shenzhen University, Hong Kong; W. Zhang and S. Skiena, Department of Computer Science, State University of New York at Stony Brook, Stony Brook, NY.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from the Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701, USA, fax +1 (212) 869-0481, or permissions@acm.org.

c 2012 ACM 2157-6904/2012/02-ART20 $10.00 DOI 10.1145/2089094.2089096 http://doi.acm.org/10.1145/2089094.2089096

–  –  –

Fig. 1. Visualization of news streams with the TextWheel to reveal multiple attributes of news articles and the macro/micro relations between articles and keywords.

visual encoding schemes for keywords and articles to show their multivariate and time-varying nature. Second, there may exist complex macro/micro relations between keywords and articles. At the micro level, keywords (e.g., Bill Gates and Microsoft) have various relations. At the macro level, text articles may be also related (e.g., dealing with the same topic). Meanwhile, each article contains multiple keywords and each keyword likely appears in many articles; these relations may change with time.

These complex macro/micro relations may be very useful for text analysis. However, it is very difficult to visually encode these macro/micro relations. Third, text data can be extremely large and this poses special challenges for the scalability of visual encoding schemes.

In this article we develop a visual analytics system for news streams and address the aforementioned issues facing news visualization. We focus on the multiple attributes of news articles and the dynamic relations between articles and keywords.

Our visualization system is built on top of a text mining preprocessing, and consists of three components: significance trend chart generation, entity encoding, and relation encoding. During the preprocessing stage, keywords are extracted with various attributes (e.g., sentiment and frequency), and the relations between keywords and news articles are also established. With the help of modern text mining techniques, we may find a number of interesting or important keywords as candidates for further explorations. In return, our visualization system could also help users verify the keywords picked by an automatic algorithm, and further gain insights into the content of the news streams. After that, all the sentiment and word frequency information is collected and summarized as a line chart, which we call the significance trend chart, to provide an overview of the sentiment evolution over time. At the same time, we use a concise glyph to encode each article whose multiple high-level attributes are first extracted and then encoded using different visual channels of the glyph to provide a succinct overview of the article.

To deal with the complex macro/micro relations existing among keywords and articles, we introduce a novel visual primitive called TextWheel which consists of one or multiple keyword wheels, a document transportation belt, and chains to connect the belt and wheels (see Figure 1). By observing the TextWheel and its content changes, interesting patterns can be detected. Designed after some familiar objects in our life, ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 2, Article 20, Publication date: February 2012.

Watch the Story Unfold with TextWheel: Visualization of Large-Scale News Streams 20:3

the TextWheel is intuitive to use. We also provide a set of interaction tools to help users better use our system to analyze large-scale news corpora. The major contributions of this article are as follows.

— We present a visual analytics system for users to visually summarize, organize, abstract, and analyze news streams and other text documents. Our system brings multiple attributes of text documents, dynamic relations between text documents and keywords, micro relations among keywords, and macro relations among documents into one coherent analytical context.

— We design text glyphs to visually encode keywords and documents. The glyph can intuitively and succinctly summarize various nominal and categorical features of keywords and documents.

— We develop a novel visualization primitive called TextWheel representing the complicated macro/micro relations among keywords and documents.

2. RELATED WORK Text visualization has received considerable research interest in recent years. Various visualization approaches have been proposed to help people effectively analyze and understand large document archives. These techniques can be generally grouped into two categories: document visualization and semantic visualization.

2.1. Document Visualization Document visualization techniques concentrate on visualizing large document corpora and illustrating relations between documents. Most of the techniques are designed to analyze similarity relations among documents by transforming each document into a feature vector (usually a keyword vector) and then measuring distances between these vectors. Based on the resulting similarity, these documents are clustered and then displayed either in a hierarchical way [Granitzer et al. 2004] or in a unstructured way [Wise 1999]. For example, the SPIRE system [Wise 1999] displayed documents as dots on a 2D plane and clustered them based on their keywords. Documents that are close in the high-dimensional space will also be close on the 2D plane. In 2005, Hetzler et al. [2005] further extended this system to the temporal domain and used it to analyze dynamic document flows. InfoSky [Granitzer et al. 2004] uses a similar approach to visualize hierarchical document collections on a plane space. Users can zoom in and out, just like using a telescope, to explore the documents at different levels of detail. However, InfoSky requires that the documents have already been organized into a hierarchical structure. Different from these two approaches, HiPP [Paulovich and Minghim 2008] can automatically cluster documents into a multi-level structure and allow users to dynamically change the hierarchical structure during the exploration.

2.2. Semantic Visualization Semantic visualization techniques focus on revealing and analyzing the semantic patterns in documents. Based on the different types of patterns they want to explore, these techniques can be categorized into three groups: literal patterns, keyword patterns, and temporal patterns.

Literal Patterns. Word Tree [Wattenberg and Vi´ gas 2008] uses trie-like structures to e explore word collocation patterns in real documents. In the tree structure, nodes represent words with size encoding the word frequency. An edge linking two nodes indicates that these two words are concatenated in the original document. Phrase Nets [van Ham et al. 2009], which can be considered as a follow up work of Word Tree, also uses links to indicate collocation relations. However, it connect words as a graph instead of a

–  –  –

tree to convey more sophisticated relations. Mao et al. [2007] developed a technique to visualize the sequential semantic progress in documents. They used statistical methods to identify patterns within the input document data, and then fitted the patterns to a curve. Oelke et al. [2008] applied text fingerprinting on the input text document and provided users with a loopback framework to evaluate and improve the visualization results.

Keyword Patterns. BlogPulse [Glance et al. 2004] monitors over 5.5 million Web blogs and records more than 450K posts every day. To assist users, it provides a Web search interface (http://www.blogpulse.com/) for keywords. It uses a line chart to show the keyword frequency so that users can see how one or multiple keywords appear and disappear over time. Wong et al. [1999] combined the strength of text mining techniques and 3D bar charts to visualize the association rules within multiple items. They used the x-axis to display rules and the y-axis to list items. Then, an association rule can be visualized as a 3D bar standing on the x-y plane. The Jigsaw system [Stasko et al.

2008] provides a visual interface for users to search different keywords in a large collection of documents. It focuses on the relations between different types of keywords, such as people, organizations, and places.

Temporal Patterns. Temporal patterns are also a very active research field in document analysis. Some papers try to track the trend of text flows and identify their changes over time. For example, Wong et al. [2003] proposed a method to visualize dynamic data streams by using animated scatterplots. Allan et al. [2005] also developed a technique to identify evolving stories and new stories by analyzing a growing set of news articles. ThemeRiver [Havre et al. 2000] uses stacked area charts to visualize a collection of documents according to their themes. In the chart, each color stripe represents a theme and is curved smoothly to make it look like a river. Some other papers focus on the cluster or entity relation evolutions in document streams. For example, Erten et al. [2004] combined graph visualization and clustering techniques to analyze how the coauthor relations evolve over time in scientific literature. TextPool [AlbrechtBuehler et al. 2005] clusters document contents on the screen and uses carefully designed animations to help users understand the content changes. Compared with these previous approaches, our article addresses a quite different problem, that is, how to visualize the macro/micro relations widely existing in news streams and other text data.


3.1. Data Collection All experiments in this article were conducted on a 1.5 gigabyte corpus of 333,289 news articles published between 2004 and 2006, with an interesting pedigree. Our experiments in this article are run over subsets of documents selected from this corpus on the basis of a single keyword, say Merck or Verizon and the resulting set of several thousand articles visualized. We believe this procedure is quite representative of typical tasks concerning corpus understanding.

3.2. Named Entity Recognition Named entity recognition is a natural language processing problem where one seeks to detect every named entity mentioned in a document. It serves as our feature extraction system for documents, identifying the topics of likely potential interest.

Named entity recognition is a well-studied problem with an extensive literature (e.g., Chieu and Ng [2002] and McDonald [1993]). We primarily employ rule-based ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 2, Article 20, Publication date: February 2012.

Watch the Story Unfold with TextWheel: Visualization of Large-Scale News Streams 20:5

Pages:   || 2 | 3 | 4 |

Similar works:

«No. 07-290 ================================================================ In The Supreme Court of the United States -♦DISTRICT OF COLUMBIA AND ADRIAN M. FENTY, MAYOR OF THE DISTRICT OF COLUMBIA, Petitioners, v. DICK ANTHONY HELLER, Respondent.-♦On Writ Of Certiorari To The United States Court Of Appeals For The District Of Columbia Circuit -♦AMICUS CURIAE BRIEF OF THE LIBERTARIAN NATIONAL COMMITTEE, INC. IN SUPPORT OF RESPONDENT -♦BOB BARR LAW OFFICES OF EDWIN MARGER, LLC 44 N. Main...»

«Zmije rodu Echis Zmije rodu Echis (Viperidae) Tomáš Mazuch, Jiří Hejduk Www.megasphera.cz/africanvenomoussnakes 2 Zmije rodu Echis První z autorů věnuje knihu zesnulému herpetologovi J. B. Rasmussenovi Autoři: Tomáš Mazuch, Jiří Hejduk Copyright © Tomáš Mazuch, Jiří Hejduk Ilustrace © Tomáš Mazuch Fotografie © Tomáš Mazuch (pokud není uvedeno jinak) Layout © Tomáš Mazuch Tisk: Tomáš Pešek, Tomáš Mazuch Vydáno ČESKÝM SPOLKEM PRO AFRICKOU HERPETOLOGII (2007)...»

«Un caso de expansión agraria capitalista seguido por depresión POBLACIÓN Y SOCIEDAD Nº 10/11, 2003-2004, pp.109-136. UN CASO DE EXPANSIÓN AGRARIA CAPITALISTA SEGUIDO POR DEPRESIÓN. SANTIAGO DEL ESTERO, 1870-19401 Alberto Tasso* A Floreal y Alfredo Al concluir una tarea, un detective profesional hace un informe, una escueta relación de incidentes y hallazgos. El informe aparece en todos los clásicos del género, desde Poe a Auster, de Castellani a Walsh, de Chandler a Borges. Las...»

«ASSESSING OLDER DRIVER’S FITNESS TO DRIVE ALLOWING FOR A LOW MILEAGE BIAS: USING THE GRIMPS SCREENING TEST Koppel, S., Langford, J., Charlton, J., Fildes, B., Frith, W. & Newstead, S.ABSTRACT Data from 244 older drivers in New Zealand have been used to demonstrate that older drivers who travel low mileages are liable to have more crashes per distance driven than older drivers who travel higher mileages. The results showed that drivers travelling 50 km or less per week had a considerably...»

«Capítol 2. Particle fluxes in the Almeria-Oran Front CAPÍTOL 2 PARTICLE FLUXES IN THE ALMERIA-ORAN FRONT: CONTROL BY COASTAL UPWELLING AND SEA-SURFACE CIRCULATION Anna Sanchez-Vidal, Antoni Calafat*, Miquel Canals, Joan Fabres GRC Geociències Marines, Dept. d’Estratigrafia, Paleontologia i Geociències Marines, Universitat de Barcelona, E-08028, Barcelona, Spain. * Corresponding author: Tel +34-934021361; Fax +34-934021340. E-mail address: tonim@natura.geo.ub.es (A. Calafat) Journal of...»

«ALL CHILDREN IN SCHOOL BY 2015 Global Initiative on Out-of-School Children Nigeria NIGERIA COUNTRY STUDY Conducted within the Conceptual and Methodology Framework (CMF) March 2012 Global Initiatives on Out-of-School Children Cover image: ©UNICEF/Nigeria/Valentina Solarin/2010 Inside layout and figures: Education ii Contents Acknowledgement Preface List of Tables and Figures Acronyms Executive Summary 1. Introduction 1.1. Country context 1.2. Overview of the education sector 1.3. General...»

«34th ECDD 2006/4.6 zopiclone Assessment of zopiclone 1. Substance Identification International Nonproprietary Name (INN): zopiclone A. Chemical Abstracts Service (CAS) registry number: CAS 43200-80-2 B. Other Names: eszopiclone (the S(+)-enantiomer of zopiclone) C. Trade Names: (these trade names might include trade names for eszopiclone D. preparations): Alchera, Alpaz, Amoban, Amovane, Ansium, Clonil, Datolan, Ezolin, Flazinil, Hypnoclone, Hypnor, Imolon, Imovance, Imovan, Imovane, Imozop,...»

«Minnesota Statewide Automated Child Welfare Information System (SACWIS) Cost/Benefit Analysis Final Report Minnesota Department of Human Services SSIS Project Revised: May 2002 Minnesota Statewide Automated Child Welfare Information System Cost/Benefit Analysis Final Report Minnesota Department of Human Services Revised May 2002 I. Summary As described in the Implementation Advance Planning Document of October 1995, the focus of the cost/benefit analysis is on productivity improvements. The...»

«Secrets d’authentification épisode II Kerberos contre-attaque Aurélien Bordes aurelien26@free.fr Résumé L’authentification est un composant essentiel dans la sécurité des systèmes d’information. Si de nombreux protocoles d’authentification coexistent, Kerberos s’est largement imposé ces dernières années comme le protocole d’authentification sur les réseaux locaux, en particulier avec son adoption comme service principal d’authentification dans les environnements...»

«Thomas G. Labrecque Smart Start 2014-2015 Scholarship Application If you have internet access you may apply online at https://aim.applyists.net/SmartStart There are many benefits to applying online. A few include:  You can start and stop the application at any time before the deadline. Your entered data is saved.  You will receive incomplete reminder emails if your online application is not complete.  If a required document is missing required information and is rejected, you will...»

«The Best of All I wanted to speak about CAT Version 2.1 2010 Edition By PaGaLGuY.com (Version November 7, 2010) Compiled from the posts of successful MBA students and PaGaLGuY.com users from the discussion thread ‘All I Wanted to speak about CAT’ on www.pagalguy.com/allaboutcat Special copy prepared exclusively for Sam Moi (This is the 19,408th copy of the book) 2 The Best Of All I wanted to Speak About CAT Version 2.1 2010 Edition Copyright © 2004-2010, PaGaLGuY.com, All rights reserved....»

«CONSERVATION MUST PURSUE HUMAN-NATURE BIOSYNERGY IN THE ERA OF SOCIAL CHAOS AND BUSHMEAT COMMERCE. ANTHONY L. ROSE, Ph.D. The Biosynergy Institute, Palos Verdes Peninsula, California USA Photo Copyright 2003 – Karl Ammann This chapter appears in the anthology: Conservation Implications of Human and Nonhuman Primate Interconnections. Editors: Agustin Fuentes and Linda D. Wolfe, Cambridge University Press, 2001. © 2000, Anthony L. Rose, Palos Verdes, California USA CONSERVATION MUST PURSUE...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.