FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:     | 1 || 3 | 4 |

«Keyword-based searching and clustering of news articles have been widely used for news analysis. However, news articles usually have other attributes ...»

-- [ Page 2 ] --

techniques that are not vastly different from those in the literature, although they require substantial engineering to achieve good performance.

Here we present the most important phases of our named entity recognition procedure, namely part-of-speech tagging, syntactic tagging, proper noun classification, rules processing, alias expansion and geographic normalization.

Part of Speech Tagging. To extract proper noun phrases from text, we tag each input word with an appropriate part-of-speech tag (noun, verb, adjective, etc.) on the basis of statistical predication and local rules. Part-of-speech tagging requires that the input text be properly partitioned into sentences. This can be readily done using cues from capitalization and punctuation. We employ a popular Part-Of-Speech (POS) tagger [Brill 1994] in our analysis. Such taggers are based on large vocabularies and collections of tagging rules, and require a substantial amount of time to initialize internal data structures.

Syntactic Tagging. Here we employ regular-expression patterns to markup certain classes of important text features such as dates, numbers, and unit-tagged quantities.

Proper Noun Phrase Classification. Each proper noun phrase in a text belongs to some semantic class, such as person, city, or company. In this phase of our pipeline we employ gazetteer (e.g., popular first and last names provided by the U.S. Census Bureau)1 and Bayesian method [Mitchell 1997] to classify each identified phrase.

Rules Processing. Compound entities are difficult to handle correctly. For example, the entity name State University of New York, Stony Brook spans both a comma and an uncapitalized word that is not a proper noun. By comparison, China, Japan, and Korea refers to these three entities. Our solution was to implement a small set (∼ 60) of hand-crafted rules to properly handle such exceptions.

Alias Expansion. A single entity is often described by several different proper noun phrases, for example, President Kennedy, John Kennedy, John F. Kennedy, and JFK, even in the same document. We identify several common classes of aliasing and take appropriate steps to unify such representations into a single set.

Geographic Normalization. Geographic names can be ambiguous. For example, Albany is both the capital of New York State and a similarly-sized city in Georgia. However, auxiliary information can be useful in resolving this ambiguity. We developed a geographic normalization routine that identifies where places are mentioned, resolves any ambiguity using population and location information, and replaces the name with a normalized, unambiguous representation.

3.3. Sentiment Analysis Sentiment analysis of natural language texts is a large and growing field, surveyed in Pang and Lee [2008]. Previous work falls naturally in two groups. The first relates to techniques to automatically generate sentiment lexicons. The second relates to systems that analyze sentiment (on a global or local basis) for entire documents.

Newspapers and blogs express opinion of news entities (people, places, things) while reporting on recent events. We have developed a method [Bautin et al. 2010; Lloyd et al. 2005] that assigns scores indicating positive or negative opinion to each distinct 1 http://www.census.gov/geneology/names/names files.html

–  –  –

where E[X ] = nanb, N = number of sentences in the corpus, na = number of occurrences N of entity a, and nb = number of occurrences of entity b, as the juxtaposition score for a pair of entities.


After the data collection and processing, all the sentiment and word frequency information is obtained. However, directly going through them one by one seems unwise because there are so many keywords in the document corpus. In this section, we introduce the significance trend chart, which is inspired by entropy and information theory, to analyze and visually summarize changes of sentiment and word frequency information along the whole document stream. According to entropy and information theory, if an object contains more exclusive information, it is more significant. Following this idea, we design a novel method to measure the significance value for a document in the document sequence. we define that a document is more significant if it has more exclusive sentiment and word frequency information compared with its neighboring (preceding or succeeding) documents in the document stream.

–  –  –

ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 2, Article 20, Publication date: February 2012.

Watch the Story Unfold with TextWheel: Visualization of Large-Scale News Streams 20:7

–  –  –

where X has n possible values {x1, · · ·, xn}, and Y has m possible values {y1, · · ·, ym}.

p(x, y) is the joint probability function. p(x) and p(y) are the marginal probabilities of X and Y respectively. H(X ; Y ) equals zero if and only if X and Y are independent, which means X and Y share no information.

The last concept is conditional entropy. The conditional entropy H(X |Y ) of a random variable X given another random variable Y is the measure of the amount of information contained by X but not by Y. It can be calculated as

–  –  –

From the definition, we can see that H(X |Y ) is nonnegative. It equals zero if and only if X and Y share everything (i.e., they are identical). On the other hand, it has maximum value H(X ), when X and Y are independent (i.e., H(X ; Y ) = 0).

–  –  –

5. SYSTEM OVERVIEW Figure 1 shows the interface of our TextWheel visualization system with all its main visual components: a significance trend chart, a document transportation belt, and one or multiple keyword wheels. These three components provide users with three levels of views of details in the document stream.

The line chart at the bottom is the significance trend chart. It shows the highest level of view by depicting the sentiment changes extracted from a document stream.

The x-axis encodes the time and the y-axis encodes the significance value of the documents.

On the upper right, a U-shape document transportation belt shows users a small portion of documents in the whole stream. Each glyph on the belt represents one or multiple documents. When users interact with our system, the document glyphs are transported along the belt. When a new document glyph enters the focus region (the semicircular part of the belt), it is highlighted in red. We also put a sliding bar on the significance trend chart to show the location of the highlighted document glyph in the whole document stream. Users can also drag the sliding bar to synchronize the transportation belt and the significance trend chart. For example, when users interact with the transportation belt (e.g., speeding up the transportation or reversing the direction of the transportation), the bar can show users which part of the data stream the belt is currently showing. On the other hand, users can also directly drag the sliding bar on the chart to a chosen time of the document stream; the belt will automatically roll forward or backward to show the documents at that time. We also encode the macro relations between documents on the chart. By drawing arcs from the sliding bar, all the documents which are most related to the highlighted document are pointed out on the whole document stream.

On the upper left are one or multiple wheels that show different keywords users are interested in. To encode the micro relations between keywords, we put keywords uniformly on a circle and then use lines to indicate the relations between keywords. The keyword wheels also interact with all the documents in the focus area (the semicircular part of the transportation belt) by connecting them with some chains.

6. TEXTWHEEL Our TextWheel system consists of several visual components, which are all inspired by some everyday objects that users are very familiar with. For example, our keyword wheel is inspired by Ferris wheels. News articles move along the transportation belt just like baggage moving along the conveyor belt. By drawing experience from our everyday life, our system has obvious metaphors and is intuitive to use. In this section, we introduce the design details for each component.

6.1. Entity Encoding with Glyphs There are two major entities in our data: keyword and document. Both of them have multiple attributes.

Keyword Glyph. As keywords have been widely used in Web pages, some visual encoding schemes have been well established. For example, the Google Visualization API can allow users to use size and color to encode various attributes of keywords.

Usually, the font size of a keyword represents the frequency of the keyword appearing in a document. We adopt this scheme in our system. Figure 2(a) shows the word cloud generated by the Google Visualization API. These keywords are later arranged into a circular frame called keyword wheel (see Figure 2(b)) in our system.

ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 2, Article 20, Publication date: February 2012.

Watch the Story Unfold with TextWheel: Visualization of Large-Scale News Streams 20:9

Fig. 2. Glyphs: (a) Keyword cloud generated by the Google Visualization API; (b) keyword wheel; (c) the document glyph for one article.

Document Glyph. We adopt a simple rectangular shape glyph in our current system, though other more complicated glyphs can also be used. The width of the rectangle encodes the average number of keywords while the height indicates the article length. The color of the glyph encodes the average sentiment expressed in an article.

Figure 2(c) illustrates the document glyphs used in our system.

6.2. Document Transportation Belts Layout. We adopt a U-shape layout for our belt because the U-shape belt can provide more space for the layout of documents and better convey the “endless” feeling about news streams, considering there usually are much more documents than keywords. In addition, the U-turn can naturally divide the document belt into three parts and give a nice focus+overview view. The curved part serves as a focus region, in which the keywords are connected to the documents that fall into this region, while the top and bottom straight parts give users a bigger picture about what have just left and what are coming.

Speed. The speed of the belt is highly controllable. We can directly drag the sliding bar on the significance trend chart to roll the transportation belt forward or backward to any time points we are interested in. We also set a uniform or automatically computed speed for the transportation belt, so we can see the belt rolling automatically. If we find something interesting, we can stop the belt or lower the speed to allow more time for inspection.

Order of documents. The documents can come into the transportation belt with different orders. By default, they are arranged by time. Other orders are also possible. For example, the documents from the same sources can come together.

6.3. Keyword Wheels The keyword glyphs are put in the keyword wheels. The relation between two keywords is encoded by simply connecting them with a line. Meanwhile, the sentiment change of the keywords can be naturally revealed by the rotation of the wheel.

Keyword selection. We select the most relevant keywords according to user interest or recommended during data preprocessing. Every keyword glyph on the same wheel is assigned with one unique background color. Therefore, users are not suggested to choose more than 20 keywords for one wheel. Otherwise, the colors may become difficult to differentiate. In case users have an actual need to show more than 20 keywords, they can put them in different wheels.

Keyword position. Keywords will be uniformly positioned in the circular frame of the keyword wheel. Their positions are computed based on their inter-relations. We use a greedy algorithm to position keywords and the keyword appearing most frequently in

–  –  –

the documents will be put at the center of the circular segment falling into the focus window. Then other keywords can be positioned accordingly.

Keyword update. As the documents in the focus window change gradually some keywords may become more or less frequent or even disappear from the documents in the focus window. When a keyword becomes more frequent, the glyph size becomes bigger, and vise versa. When a keyword disappears from the documents in the focus window, it is still kept on the wheel, but its background color will disappear, so that it will not cause much distraction to users.

6.4. Dynamic System We further connect the keyword wheel and document transportation belt with chains to form a dynamic system. A keyword is likely contained in multiple documents while every document holds different sentiments towards it. Once a document enters the focus window, a chain is connected to the document with all the related keywords in the keyword wheel. The width, color, and opacity of the chain can encode various attributes of the relation between the document and the keyword. For example, we can use the width to encode the strength of the sentiment and use the color to indicate what word this sentiment is about. There are two hubs between the keyword wheel and the transportation belt. Every chain will go to one of them first. All the chains indicating positive sentiments go to the lower hub, while all the chains indicating negative sentiments go to the upper hub. At each hub, all the chains with sentiments towards the same keywords (i.e., all the chains with the same color) are bundled together. Then the bundled chains connect to the keyword wheel to drag the wheel to rotate. We assume both bundled chains have attractive forces on the wheel, and the force values are proportional to the bundle size. Therefore, if the sum of negative sentiments towards to all the keywords is stronger than the sum of positive sentiments, the wheel will rotate clockwise in our system, and vise versa.

6.5. User Interaction We provide a set of interaction tools to help users better use our system to deal with a very large number of news articles. Some of these tools are summarized as follows.

Pages:     | 1 || 3 | 4 |

Similar works:

«Attendance Proxy Sean Hunt Soung-Jae Bong Robert Savoy David Birnbaum Joe Kinsella Brian Luptak Aaron Tee Rachel Northey Zach Bernholtz Wang Zeyn Becky Wroe Taylor Kersten Scott Pearson Giulia Longella Liam Sharp Craig Pacheco Rachel Smith Jonathan Nianias Auron Mcltugh Joanne Bayonetty Cherise Quash Oliver Twardy Jessica Smith Karl Kliewer Hannah Zarety Elizabeth McFaul Jacob Terry Chris Vandevelde Alen Daniel Alex Russell Alex Kelley Dave McDougall Shervin Kawlend Katherine Idley Matthew...»

«Frequently Asked Questions Food Handler Training in Illinois The following answers are based on Public Act 098-0566 and proposed rules that are in the review process. Answers are subject to change. Who is considered a food handler? Food employee or food handler means an individual working with unpackaged food, food equipment or utensils, or food-contact surfaces. “Food employee” or “food handler” does not include unpaid volunteers in a food establishment, whether permanent or temporary....»

«CRAFT A MATTER OF SCALE AND PACE By Lidewij Edelkoort Crafts: A matter of scale and pace is an analysis of the future of hand-made design and sustainability, first published in the Prince Claus Fund Journal, The Future is Handmade, The Hague (2003). Lidewij Edelkoort is one of the founding members of Heartwear, a non-for-profit organisation based in Paris and working principally in Benin, Morocco and India. She helped start a Humanitarianism Design masters programme while directing the Design...»

«TIMOTHY ELDRED AUTHOR OF PRAY21 4-HOUR YOUTH MINISTRY ESCAPING THE TRAP OF FULL-TIME YOUTH MINISTRY © 2010 Timothy Eldred International Standard Book Number: 978-0-9796551-2-8 Unless otherwise indicated, Scripture quotations are from: The Message by Eugene Petersen. © 1993, 1994, 1995, 1996, 2000, 2001, 2002. Used by permission of NavPress Publishing Group. Other Scripture quotations are from: The Holy Bible, New International Version® (NIV). © 1973, 1978, 1984 by International Bible...»

«Welcome to St. Mark’s Episcopal Church + Capitol Hill St. Mark’s Mission Statement St. Mark‟s is an open community, welcoming people wherever they are on their faith journey. We celebrate the gifts of God that empower us to engage boldly in the struggles of life and to care for others with love, justice, and compassion. The Fourth Sunday After Pentecost June 12, 2016 _ 10:00 am Holy Eucharist + Holy Baptism _ Preacher The Reverend R. Justice Schunior Presider The Reverend R. Justice...»

«JURY THE SECOND EDITION OF POLES WITH VERVE POLL ART AND CULTURE Chairperson: Leszek Możdżer Leszek Możdzer is not only one of the most exquisite Polish jazz musicians, but also a recognized composer, film score author and music producer. At the beginning of his career he was involved with Miłość group (1991-1995) and with Zbigniew Namysłowski quartet (1992-1995), yet as early as in the mid-90s he began to make solo recordings. These days his discography is the subject of a separate...»

«Archaeology Programme Government of the Yukon Hude Hudän Series ¸ Occasional Papers in Archaeology No. 17 FORT SELKIRK: EARLY CONTACT PERIOD INTERACTION BETWEEN THE NORTHERN TUTCHONE AND THE HUDSON’S BAY COMPANY IN YUKON Victoria Elena Castillo Yukon Archaeology Programme Hude*\ Huda†n Series Occasional Papers in Archaeology Hude*\ Huda†n—Long Ago People (Northern Tutchone) Editorial Committee Chair: Jeff Hunston Manager, Heritage Resources Production Manager: Ruth Gotthardt...»

«His Throne Was on Water First Edition 1417 AH/1997 AC His Throne Was on Water Dr. Adel M. A. Abbas amana publications Beltsville, Maryland USA © copyrights 1417 AH/l997 AC by Dr. Adel M. A. Abbas published by: amana publications 10710 Tucker Street, Suite B Beltsville, Maryland 20705-2223 USA Tel: (301) 595-5777 ­ Fax: (301) 595-5888 E-mail: igfx@aol.com Library of Congress Cata1oging-in-Publication Data Abbas, Adel M. A. (Mohammed Ali), 1931 (1350)­ His throne was on water / Ade1 M. A....»

«May 1, 2013 JACQUE (JODY) L. EMEL Graduate School of Geography Clark University Worcester, MA 01610-1477 Telephone: (508) 793-7317 Fax: (508) 793-8881 PROFESSIONAL POSITIONS 2010-2011 Associate Director, Graduate School of Geography, Clark University 2008-2010 Acting Director, Graduate School of Geography, Clark University 2002-2004 Chair, Women’s Studies Program, Clark University 1998-1999 Chair, Women’s Studies Program, Clark University 1996-1998 Co-Chair, Women’s Studies Program, Clark...»

«Interpretation Gone Wrong by Alex Rosenblat, Tamara Kneese, and danah boyd A workshop primer produced for: The Social, Cultural & Ethical Dimensions of “Big Data” March 17, 2014 New York, NY http://www.datasociety.net/initiatives/2014-0317/ Brief Description Just because data can be made more accessible to broader audiences does not mean that those people are equipped to interpret what they see. Limited topical knowledge, statistical skills, and contextual awareness can prompt people to...»

«Theological Studies 52 (1991) THEOLOGY AS RHETORIC1 DAVID S. CUNNINGHAM University of St. Thomas, St. Paul, Minn. Hbrought tocontroversiestermination. Atsaid aCardinal Newman,would ALF THE in the world, are verbal ones; and could they be brought to plain issue, they 2 be a prompt first glance, this seems to suggest that half the controversies are merely verbal—as opposed to the other half, which are substantial. Such a sharp division between language and essence should strike us as anomalous,...»

«THE ROLES OF STAKEHOLDERS IN AN NPD PROJECT: A CASE STUDY Jukka Majava University of Oulu, Finland jukka.majava@oulu.fi Harri Haapasalo University of Oulu, Finland harri.haapasalo@oulu.fi Abstract: New product development (NPD) is affected by multiple stakeholders. This study assesses the key stakeholders in an NPD project, the roles of the stakeholders, and the project phases stakeholders’ involvement is most important. The literature review of the study focuses on stakeholders in product...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.