Information Graphics: An Untapped Resource for Digital Libraries
Sandra Carberry
Dept. of Computer Science University of Delaware Newark, DE 19716 USA

Stephanie Elzer
Dept. of Computer Science Millersville University Millersville, PA USA

Seniz Demir
Dept. of Computer Science University of Delaware Newark, DE 19716 USA

carberry@cis.udel.edu ABSTRACT

elzer@cs.millersville.edu
GDP Per Capita, 2001 (in thousands)

demir@cis.udel.edu

Information graphics are non-pictorial graphics such as bar charts and line graphs that depict attributes of entities and relations among entities. Most information graphics app earing in p opular media have a communicative goal or intended message; consequently, information graphics constitute a form of language. This pap er argues that information graphics are a valuable knowledge resource that should b e retrievable from a digital library and that such graphics should b e taken into account when summarizing a multimodal document for subsequent indexing and retrieval. But to accomplish this, the information graphic must b e understood and its message recognized. The pap er presents our Bayesian system for recognizing the primary message of one kind of information graphic (simple bar charts) and discusses the p otential role of an information graphic's message in indexing graphics and summarizing multimodal documents.

40 30 20 10 0

Luxembourg

Figure 1: Graphic from U.S. News and World Report2

Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: Content Analysis and Indexing

General Terms
Algorithms

Keywords
Summarization, graphics, multimedia, Bayesian reasoning

1.

INTRODUCTION

Information graphics are non-pictorial graphics such as bar charts and line graphs that depict attributes of entities and relations among entities. Although much attention has b een devoted to the summarization and categorization of text, and to effective methods for retrieving textual documents relevant to an individual's needs, relatively little attention has b een given to information graphics that app ear

in documents. This pap er addresses the inclusion of information graphics in digital libraries along with their role in the summarization of multimodal documents. Section 2 presents a corpus study that explores how information graphics are used in multimodal documents. Section 3 argues (1) that information graphics are an imp ortant knowledge resource in their own right and that they should b e retrievable from a digital library and (2) that information graphics should b e taken into consideration in summarizing and indexing multimodal documents. Although some information graphics are only intended to display data, the ma jority of information graphics that app ear in newspap ers, magazines, and formal rep orts are intended to convey a message. For example, the information graphic in Figure 1 conveys the message that the U.S. ranked third (among the countries listed) in GDP (gross domestic product) p er capita in 2001. Similarly, the graphic in Figure 2 conveys the message that there was a substantial increase in Delaware bankruptcy p ersonal filings in 2001 compared with the preceding decreasing trend from 1998 to 2000. A graphic's primary or core message constitutes a brief summary of the graphic and captures its ma jor contribution to In the original graphic, the bar for the United States was annotated. Here we have highlighted it in order to later show the XML that our system produces when bars are colored differently. We have also placed the dep endent axis lab el alongside the dep endent axis, instead of at the top of the graph, since our vision system is currently limited to standard placement of axis lab els.
2

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06 August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00.

581

Switzerland

Denmark

Norway

Britain

Japan

U.S.


Delaware bankruptcy
personal filings
3000

GM's Money Machine
100 Percentage of net earnings coming from finance unit 80 60 40 20 0

2500 2000 1500 1000

1998 1999 2000 2001

Figure 2: Graphic from Wilmington News Journal Category Category-1: Category-2: Category-3: Category-4: # 22 17 26 35

I '02

II

III

IV

I '03

II

Figure 3: Graphic from Business Week

Fully conveys message Mostly conveys message Conveys a little of message Does not convey message

Table 1: Analysis of Text of Articles Containing Graphics

the overall communicative goal of a multimodal document. Consequently, developing a methodology for recognizing this message is the first step to exploiting information graphics. Section 4 presents our implemented and evaluated Bayesian network for identifying the primary message conveyed by one kind of information graphic, simple bar charts. Section 5 discusses how the message recognized by our system can form the basis for summarizing an information graphic, for indexing and retrieving it from a digital library, and for constructing a richer summary of multimodal documents.

information graphic. Table 1 displays the results. In 39% of the instances, the text was judged to fully or mostly convey the message of the information graphic. However, in 26% of the instances, the text conveyed only a little of the graphic's message. An example is the graphic shown in Figure 3 which app eared in a Business Week article entitled "For GM, Mortgages are the Motor". The graphic's message is that there was a substantial increase in the p ercentage of GM's net earnings produced by its finance unit in the second quarter of 2003 in contrast with the preceding decreasing trend from the third quarter of 2002 to the first quarter of 2003. However, the pieces of text most closely related to the graphic were the following: "So where did the other $818 million in secondquarter profits come from? Try General Motors Acceptance Corp., GM's leading arm." "Until then, GM will struggle and remain dep endent on its finance arm." "If not, GM would face lean times once GMAC is no longer minting money." None of these text segments achieves the graphic's primary communicative goal. However, since the text segments talk ab out $818 million in second quarter profits coming from GM's finance unit and the finance unit making a lot of money (minting money), we judged the text to at least convey the high profitability of the finance unit and thus classified the text as conveying a little of the graphic's message. Most surprising was the observation that in 35% of the instances in our analyzed corpus, the text failed to convey any of the graphic's message. An example is the group ed bar chart shown in Figure 4 which is taken from a Newsweek article entitled "Microsoft's Cultural Revolution"; this graphic conveys that the p ercentage of pirated software in China is much higher than in the world as a whole and that the decrease in pirated software in 2002 compared with 1994 was larger in the world than in China. Although the text is ab out Microsoft's efforts in China and the problem of pirated software, the closest that the text comes to capturing the graphic's message is the following statement: "Ninety p ercent of Microsoft products used in China are pirated."

2.

MULTIMODAL DOCUMENTS: A CORPUS ANALYSIS

Information graphics are an imp ortant comp onent of many documents. In some cases, information graphics are standalone and constitute the entire document, as was the case for the graphic in Figure 2. However, in most cases, information graphics are part of a multimodal document consisting of b oth text and graphics. We conducted a corpus study whose primary goal was to determine the extent to which the message conveyed by an information graphic in a multimodal document is also conveyed by the document's text. We analyzed 100 randomly selected graphics from our collected corpus of information graphics, along with the articles in which they app eared. The selected articles were taken from magazines (such as Newsweek, Business Week, Fortune, and Time) and local and national newspap ers. The articles varied in length: 27% were very short (a half-page or less), 20% were classified as short (one magazine length page), 22% were moderate in length (2 magazine length pages), and 31% were long (more than 2 magazine length pages). The graphics also varied in typ e: 33% were simple bar charts, 37% were simple line graphs, 10% were group ed bar charts, 16% were multiple line graphs (graphs consisting of multiple lines), and 4% were pie charts. We examined the text of each article and determined to what extent the text rep eated the message conveyed by the

582


Percentage of Software in Use Which is Pirated
97 92 1994 2002 49 39

Median Income In thousands of 2001 dollars $15 White women 10 Black women 5

China

World

1948

60

70

80

90

01

Figure 4: Graphic from Newsweek

Figure 5: Graphic from Newsweek

But the text does not compare pirating in China with pirating in the world as a whole or compare the situation in 2002 with that in 1994. Thus we classified the text of this graphic as failing to convey the graphic's message. Our further analysis of these multimodal documents has led us to conclude that graphics in multimodal documents generally have a communicative goal that, along with the communicative goals of the text segments, contributes to accomplishing the discourse purp ose[11] of the overall article. For example, Figure 5 illustrates a graphic from a Newsweek article entitled "The Black Gender Gap"; the graphic conveys that the income of black women has risen dramatically over the last decade and has reached the level of white women. Although the text notes that the earnings of collegeeducated black women exceed b oth the median for all women and the median for all black working men, the text does not compare the earnings of all black women with those of all white women. Yet this comparison is more imp ortant (than those in the text) to achieving the overall communicative goal of this p ortion of the article -- namely, convincing the reader that there has b een a "monumental shifting of the sands" with regard to the achievements of black women. This example illustrates how authors distribute their communicative goals b etween the text and the graphics. We hyp othesize that the communicative goals captured by information graphics are particularly central to the purp ose of a document, since the graphic designer has chosen to draw attention to them via a graphic.

Moreover, as shown in Section 2, information graphics have communicative goals or intended messages that contribute to achieving the discourse purp ose of a multimodal document but which are often not captured by the document's text. Thus the summarization of a multimodal document should take into account its information graphics. We hyp othesize that the core message of an information graphic (the primary overall message that the graphic conveys) can serve as the basis for an effective brief summary of the graphic. In a retrieval setting, this summary could b e used to index the graphic and enable intelligent retrieval. In a question-answering setting, the summary might b e used to directly answer a question or to determine whether the graphic should b e analyzed in more detail as a p ossible source for answering the question. We further hyp othesize that, since the core message conveyed by an information graphic captures its communicative goal, a graphic's summary based on this core message represents a good starting p oint for analyzing the contribution of the graphic to the overall discourse purp ose of a document and for taking the graphic's contribution into account in constructing a rich summary of a multimodal document. The next section presents the methodology underlying our implemented and evaluated system for recognizing the core message conveyed by one kind of information graphic, a simple bar chart.

3.

INFORMATION GRAPHICS AND DIGITAL LIBRARIES

4. RECOGNIZING THE CONVEYED MESSAGE
We contend that a graphic's primary or core message constitutes a brief summary of the graphic and captures its major contribution to the overall communicative goal of a multimodal document. Therefore, our first step toward exploiting information graphics within digital libraries has b een to develop a methodology for recognizing the message conveyed by a simple bar chart. Our message recognition system assumes as input an XML representation of the graphic that sp ecifies its axes, the bars, their heights, color, lab els, and any sp ecial annotations, the caption, etc. This is the resp onsibility of a Visual Extraction Module (VEM)[3]. This module currently handles only electronic images produced with a given set of fonts. In addition, the VEM currently assumes standard placement of lab els and axis headings. Work is underway to remove these restrictions. But even with these

Information graphics are themselves an imp ortant informational resource that should b e stored in a digital library and b e as accessible to humans as text documents. An individual might access an information graphic for the knowledge that can b e gleaned from it, and for its use in planning, problem-solving, and decision-making. For example, a graphic such as the one in Figure 2 might b e used by social service agencies in forecasting needed programs for the next year, or by legislators in arguing for or against bankruptcy legislation. Alternatively, graphics might b e accessed for use in writing rep orts or prop osals. They could also b e used in educational settings for teaching students good analysis techniques.

583


restrictions removed, the VEM can assume that it is dealing with a simple bar chart, and thus the problem of recognizing the entities in a graphic is much more constrained than typical computer vision problems. The following displays the XML representation produced by the VEM for the graphic shown in Figure 1; the measurements in the XML may not match the bar chart in Figure 1 since it has b een resized for display purp oses.3 <?xml version="1.0" encoding="UTF-8"?> <InformationGraphic> <BarChart BarDirection="vertical"> <Caption> <Content></Content> </Caption> <MeasurementAxis Length="5.96"> <Label>GDP Per Capita' 2001 (in thousands) </Label> <Tickmark> <TickLabelled>true</TickLabelled> <TickValue>0.00</TickValue> <GridLine>true</GridLine> </Tickmark> ... (xml for other vertical axis labels) </MeasurementAxis> <BarAxis Length="15.66"> </BarAxis> <Bar> <Label> <Content>Luxembourg</Content> <Color>0</Color> <Bold>true</Bold> </Label> <Color>128</Color> <Height>5.68</Height> <AxisDistance>1.09</AxisDistance> <SightLine>false</SightLine> <Value>44.14</Value> </Bar> ...(xml for bar labelled Norway) <Bar> <Label> <Content>U.S.</Content> <Color>0</Color> <Bold>true</Bold> </Label> <Color>64</Color> <Height>4.48</Height> <AxisDistance>5.54</AxisDistance> <SightLine>false</SightLine> <Value>34.88</Value> </Bar> ... (xml for the other four bars) </BarChart> </InformationGraphic>

Language research has b een based on a theory of sp eech acts[17, 18], where a sp eech act is the act of making an utterance with the intention of communicating. Sp eech act theory has p osited that a sp eaker executes a sp eech act whose intended meaning the listener is exp ected to deduce. In addition, sp eech act theory p osits that the listener deduces the utterance's intended meaning by reasoning ab out the communicative signals present in the utterance and the mutual b eliefs of sp eaker and hearer[18, 10]. Our research draws on the AutoBrief pro ject's work on generating information graphics[13, 14, 9]. The AutoBrief group prop osed that sp eech act theory could b e extended to the generation of graphical presentations. Given a desired communicative goal, AutoBrief employed a two-phase graphics generation process. First an algorithm mapp ed communicative goals into a set of perceptual and cognitive tasks that the graphic should supp ort. By p erceptual tasks we mean tasks that can b e p erformed by simply viewing the graphic, such as determining which of two bars is taller in a bar chart; by cognitive tasks we mean tasks that require a mental computation, such as interp olating b etween lab elled values on the dep endent axis to compute the exact value of a p oint in a graphic. A fundamental hyp othesis of the AutoBrief pro ject was that graphic designers construct graphics that make imp ortant tasks (tasks that the viewer is intended to p erform) as easy as p ossible. Thus the second step in AutoBrief 's graph construction process used a constraint satisfaction algorithm to design a graphic that facilitated these imp ortant tasks as much as p ossible, sub ject to the constraints imp osed by comp eting tasks. We are inverting the process. Given an information graphic, we want to recognize its intended message or communicative goal by reasoning ab out the communicative signals present in the graphic. Thus while AutoBrief extended sp eech act theory to the generation of information graphics, our work extends sp eech act theory to the understanding and summarization of information graphics. To recognize an information graphic's message, we make recourse to plan inference techniques that have b een used in language understanding to infer the communicative goal of a sentence in a text or an utterance in a dialogue. Following the lead of Charniak and Goldman who used a probabilistic framework to model plan inference for language understanding[2], we have develop ed a Bayesian network to infer the message conveyed by one typ e of information graphic, simple bar charts. The top-level of the Bayesian network represents the 12 categories of messages that we have identified for simple bar charts: Get-Rank Increasing-Trend Decreasing-Trend Maximum Contrast-Point-with-Trend Relative-Difference-b etween-Entities Relative-Difference-with-Degree Other (Present Data) Rank-of-All Stable-Trend Change-Trend Minimum

4.1 A Bayesian Inference System
We view information graphics that app ear in p opular media as a form of language with a communicative intention.
3 Although hand-coded XML was used to train and test the message recognition system since it was develop ed in parallel with the VEM, a variety of examples (such as the graphic in Figure 1) have b een run through the complete system, with XML produced from the graphic by the VEM and sent to the message recognition system.

The next level of the network captures the p ossible instantiations of the general message categories for a given graphic. For example, if a graphic has three bars, then the children of the Get-Rank node might b e Get-Rank( lab el1, bar1), GetRank( lab el2, bar2), and Get-Rank( lab el3, bar3). We use

584


Get-Rank( lab el, bar) 1. Perceive-if-bars-are-sorted 2. Perceive-bar( lab el, bar) 3. Perceive-rank( bar) Figure 6: A Sample Operator

Get-Rank(_label,_bar)

Perceive-if-bars-are-sorted

Perceive-bar(_label,_bar)

Perceive-rank(_bar)

Figure 7: A Piece of Network Structure

op erators to capture how communicative goals and p erceptual tasks can b e decomp osed into a set of simpler subgoals or subtasks. Figure 6 displays the op erator for the goal of Get-Rank( lab el, bar). It states that, to achieve the goal of the viewer getting the rank of the bar associated with a given lab el in a simple bar chart, the viewer must p erceive whether the bars are sorted according to height, then p erceive (ie., find) the bar associated with the sp ecified lab el, and finally p erceive the rank of that bar with resp ect to bar height. Subgoals in op erators are either primitive p erceptual tasks or they have associated op erators that further decomp ose them. The op erators determine the structure of our Bayesian network, in that subgoals in an op erator b ecome children of their goal node in the Bayesian network. For example, Figure 7 displays the piece of the Bayesian network produced by the Get-Rank op erator. The entire network is built dynamically for a new graphic. A node capturing the top-level message category is entered into the network along with nodes capturing a set of lowlevel p erceptual tasks. Ideally, this would include each p ossible instantiation of each low-level p erceptual task; for example, the parameter bar in the p erceptual task Get-Label( bar) could b e instantiated with any of the bars that app ear in a graphic. However, memory limitations restrict the size of the network and force us to include only instantiated p erceptual tasks that are suggested by the graphic. The instantiations that produce p erceptual tasks of lowest effort and any salient instantiations (see Section 4.2) are used to form the set of suggested low-level p erceptual tasks that are initially entered into the network. Then chaining via the op erators adds nodes until a link is established to the toplevel; as new nodes are added, their subgoals (as captured in the plan op erators) are also added, so that the network is also expanded downwards. Once the network is constructed, evidence nodes are added as discussed in the next section.

4.2 Three Kinds of Communicative Signals
Bayesian networks need evidence for guiding the construction of a hyp othesis. In natural language understanding, the observed evidence would include features of the utterance and the context in which it was made. For information graphics, the evidence consists of the communicative signals present in the graphic. We have identified three kinds of communicative signals that app ear in simple bar charts:

the relative effort of p erceptual tasks that the viewer might p erform on the graphic, salient elements in the graphic, and signals from the verbs and adjectives in the caption. We have adopted the AutoBrief hyp othesis that the graphic designer constructs a graphic that makes intended tasks as easy as p ossible. Thus the relative difficulty of different p erceptual tasks serves as a signal ab out which tasks the viewer was intended to p erform in deciphering the graphic's intended message. This correlates with Larkin and Simon's observation[15] that graphics that are informationally equivalent are not necessarily computationally equivalent -- that is, it might b e p ossible to infer the same information from two different graphics, but it might b e much easier to do so in one graphic than in the other. As a very simple example, consider comparing the height of two bars in a simple bar chart. If the bars are adjacent to one another and significantly different in height, then the task will b e easy; however, if the two bars do not differ much in height, are not annotated with their values, and are separated by many intervening bars, then the task will b e much more difficult. Similarly, if the bars in a bar chart are ordered by height, then p erceiving the rank of a particular bar (see subgoal 3 in Figure 6) will b e far easier than if the bars are unsorted. We constructed a set of rules for estimating the effort involved in p erforming different p erceptual tasks in a simple bar chart. These rules have b een validated by eyetracking exp eriments and are presented in [5]. The second kind of communicative signal is salience. An entity in a graphic can b e made salient in a variety of ways. For example, its bar might b e colored differently from other bars in the graphic, as is the bar for the U.S. in Figure 1, or the bar's lab el might b e mentioned in the caption. To determine whether the lab els on any bars app ear as part of the caption, a caption processing module[4] extracts nouns from the caption found in the XML representation of the graphic using a part-of-sp eech tagger, matches the nouns against lab els on the bars, and augments the XML representation to indicate those bars whose lab el matches a noun in the caption. By analyzing the augmented XML representation, the message recognition module can determine which entities have b een made salient by virtue of their sp ecial color, sp ecial annotation, reference in a caption, etc. The third kind of communicative signal is the presence in the caption of a verb or adjective that suggests a particular category of message.4 For example, the verb "lag" might suggest a message ab out some entity b eing a minimum or ab out some entity falling b ehind some other entity in value. Using WordNet and a thesaurus, we identified classes of verbs and adjectives that are similar in meaning and which might suggest some general category of message. Our caption processing module uses a part-of-sp eech tagger and a stemmer to identify the presence of one of our identified verb or adjective classes in the caption, and the XML representation is augmented to reflect this. The identified communicative signals must b e entered into the Bayesian network as evidence that can influence the sysIn [4] we present a corpus study showing that (1) captions are often very general or uninformative, and (2) even when captions convey something ab out the graphic's intended message, the caption is often ill-formed or requires extensive analogical reasoning. Thus we have chosen to p erform a shallow analysis of captions that extracts communicative signals but does not attempt to understand the caption.
4

585


tem's hyp othesis ab out the graphic's message. Nodes indicating the effort involved in p erforming a particular p erceptual task and nodes capturing whether a parameter of a particular p erceptual task is salient in the graphic are attached to the low level p erceptual task nodes in the Bayesian network. Nodes reflecting the presence or absence of one of our identified verb and adjective classes are attached to the message category node that app ears at the top of the Bayesian network, since verbs and adjectives serve to signal a general category of message.

The sound of sales
Total albums sold in first quarter In millions 92 76 59 78 73

4.3 Implementation and Evaluation
Associated with each child node in a Bayesian network is a conditional probability table that captures the probability of each value of a child node given the value of the parent node. In our network, the parent nodes are always goals or tasks, and thus the value of the parent node is always that it either is (or is not) part of what the viewer is intended to do in recognizing the graphic's message. Our conditional probability tables are computed from our corpus of graphics. We implemented our Bayesian network for hyp othesizing an information graphic's message using the Netica software tools[16] for Bayesian reasoning. The current system is limited to simple bar charts, but we b elieve that our methodology will b e applicable to more complex graphics. Using a corpus of 110 simple bar charts that had previously b een annotated with their message, we evaluated our approach using leave-one-out cross validation in which each graphic is selected once as the test graphic, and the other 109 graphics are used to compute the conditional probability tables for the Bayesian network. For each test graphic, the system was credited with success if its top-rated hyp othesis matched the message assigned to the graphic by the human coders and the probability that the system assigned to the hyp othesis exceeded 50%. The system's overall success rate was 79.1%. This can b e compared with a baseline of the most common message category, rising trend, which occurred 23.6% of the time; however, our system's task was more difficult than just selecting the message category, since it also had to determine the instantiation of the parameters. For example, rather than just hyp othesizing a Change-Trend message category, our system had to identify the two contrasting trends and the p oint at which the trend changed. In order to determine whether the intentions b eing inferred by our system would meet the approval of users, we p erformed a second evaluation. This evaluation consisted of a survey in which we asked human sub jects to examine a set of bar charts and rate a p osited primary intention for each bar chart. Seventeen undergraduate students took part in the survey, which contained twenty-seven bar charts. For each bar chart, the participants were asked to answer a set of questions, including their level of agreement with the stated primary intended message of the graphic (strongly agree, agree, not sure, disagree, strongly disagree), and followup questions for cases where they did not agree with the stated message. For twenty of the twenty-seven bar charts, the statement matched the hyp othesized intention of our system. For the remaining seven bar charts, we prop osed messages that did not match the hyp otheses of our system. When calculating the results of the survey, we assigned numeric values to the scale in the first question. An answer of "strongly agree" was counted as a four and "strongly disagree" a zero. For the twenty graphs where the prop osed
1998 1999 2000 2001 2002

Figure 8: A Variation of a Graphic from USA Today

message matched the output of our system, we exp ected the ma jority of participants to agree with the prop osed message. This was, indeed, the case, given that the average agreement rate for the twenty graphs was 3.33 (a value b etween "agree" and "strongly agree" on our scale) with a standard deviation of 1.02 and a 95% confidence interval of .108. On the other hand, for the seven graphs where the prop osed message did not match the output of our system, the average agreement rate was only 1.19 with a standard deviation of 1.46 and a 95% confidence interval of .261. The results of this survey demonstrate 1) that viewers of information graphics do tend to form consensus regarding the intended message of the graphic, and 2) that utilizing the intentions recognized by our system as the basis of a summary of a graphic should produce summaries which would b e satisfactory to a ma jority of users.

5. EXAMPLES AND RELEVANCE TO DIGITAL LIBRARIES
Our current research has focused on developing a methodology for recognizing the primary message conveyed by an information graphic. This work has several applications within digital libraries. The first is to provide a representation of the core message of an information graphic for indexing and retrieval of the graphic. Consider the graphic in Figure 8. A graphic stored using the caption "The sound of sales" or even the dep endent axis lab el "Total albums sold in first quarter in mil lions" could have a variety of messages, such as conveying the distribution of record album sales among ma jor recording studios or contrasting record album sales by rock artists with those by jazz singers. Our system hyp othesizes that the graphic in Figure 8 is conveying a changing trend in record album sales, with sales increasing from 1998 to 2000 and then decreasing from 2000 to 2002. The logical representation of this message is Change-trend(increasing, 1998, 2000, decreasing, 2002, total albums sold in first quarter in millions) Indexing the graphic using this message would enable retrieval of the graphic if one wanted to know ab out album sales during the p eriod 1998-2002, ab out trends in album sales, or if one wanted to know ab out trends that have changed b etween 1998 and 2002. Furthermore, the probability assigned to the graphic's message reflects the sys-

586


CLICK, CLICK, KA-CHING
U.S. E-Tail Sales (billions of dollars) 100 80 60 40 20 0

'98

'99

'00

'01

'02

'03

Figure 9: Graphic from Business Week

Visa

Mastercard

Discover

American Express

communicative signal in this graphic; without the highlighting, the message inferred for the graphic by our system is that the graphic is conveying the relative rank of the different credit card companies in terms of U.S. credit cards in circulation in 2003. But with American Express highlighted, our system hyp othesizes that the graphic's message is that American Express ranks fourth among the five credit card companies shown with resp ect to U.S. credit cards in circulation in 2003. Consequently, although the text of the article talks ab out a variety of credit card companies (for example, the text states that AmEx charges merchants higher fees than do Visa and MasterCard), the focus of the graphic is clearly on American Express and thus suggests that the article is ab out American Express. This is in fact the case. We hyp othesize that the messages conveyed by information graphics in a multimodal document can b e helpful in determining the focus of an article and thus useful in constructing a good summary. Our future work includes not only extending our message recognition system for graph summarization to more complex graphics, such as line graphs and group ed bar charts, but also to investigating 1) the use of our inferred messages in indexing and retrieving information graphics in a digital library and 2) the utilization of the messages conveyed by information graphics in constructing good summaries of multimodal documents.

Diner's Club

6. RELATED WORK
100 200 300

0

U.S. Credit Cards in Circulation in 2003 (millions)

Figure 10: Graphic from Business Week

tem's confidence in its hyp othesis and thus how clearly the graphic conveys the p osited message. Thus the associated probability might b e used as one criteria in ranking alternative graphics that are suggested for retrieval. The author of a multimodal document ostensibly viewed certain communicative goals as imp ortant enough to warrant exp ending effort on designing graphics to convey them. This suggests that the messages conveyed by information graphics should b e taken into account in summarizing the document. For example, consider the information graphic in Figure 9 which app eared in a Business Week article entitled "THE START OF A DOT-COMBACK?". Our system hyp othesizes that the graphic's message is that there has b een a rising trend in U.S. E-tail sales b etween '98 and '03. The graphic's message is not conveyed in the article, but the message represents a communicative goal which, together with the discourse purp oses of the text segments, contributes to the overall discourse purp ose of the article -- namely that dot-com companies are now focusing on results and that prosp ects for infusions of venture capital are improving. Thus the graphic plays a central role in the article, and its conveyed message should b e captured in the document's summary. The third use of the messages produced by our system is as an indication of the focus or topic of a document. Consider the graphic shown in Figure 10. The highlighting of the bar associated with American Express is an imp ortant

Within information retrieval research, the work most closely related to our pro ject involves the indexing and retrieval of images. Bradshaw [1] notes that the work on image retrieval has progressed from systems that retrieve images based on low-level features such as color, texture and shap e ([6, 21, 19, 12], among others), to systems which attempt to classify and reason ab out the semantics of the images b eing processed. These include systems that attempt to classify images according to attributes such as indoor/outdoor, city/landscap e and man-made/artificial, and they often present these classifications as occurring along semantic axes (see [22], for example) or according to probabilistic lab els (see [23, 1] for examples). Srihari, Zhang and Rao have examined the p ossibility of combining text-based indexing techniques for the caption and any accompanying text with image-based techniques [20]. They argue that text-based methods alone are ineffective, and provide the example of a search for pictures of Clinton and Gore, which resulted in 941 images. After eliminating graphics and spurious images, they had 547 images remaining, but up on manual insp ection, they discovered that only 76 of those images actually contained pictures of Clinton or Gore! They demonstrate, however, that when combined with image-based retrieval techniques, the collateral text can provide a rich source of evidence for improving the information retrieval process. Their work is similar to ours in that they are attempting to use multiple, disparate sources of evidence. However, the work done in image retrieval is concerned with the semantics of images (what is physically represented in the image and the relationships b etween those ob jects, such as "Clinton at the White House" or "a b ee on a flower"), whereas we are concerned with recognizing the communicative goal or discourse purp ose of an information graphic -- that is, the message that the graphic is intended to convey.

587


Very little research has b een concerned with summarizing information graphics. Yu et. al.[24] used pattern recognition techniques to summarize interesting features of time series data generated by a gas turbine engine, but this was automatically generated data without a communicative intention. Futrelle and Nikolakis[8] develop ed a constraint grammar formalism for parsing vector-based visual displays and producing structured representations of the elements comprising the display. The goal of Futrelle's current research is to produce a summary graphic that captures the content of one or more graphics from a document[7]. However, the end result will itself b e a graphic.

7.

CONCLUSION

Information graphics are an imp ortant comp onent of many documents, yet little consideration has b een given to making them available in a digital library or to taking them into account when summarizing a multimodal document. This pap er has presented the results of a corpus study which showed that the communicative goal or message of an information graphic is typically not rep eated in the article's text. It has also presented our implemented and evaluated Bayesian network for recognizing the message conveyed by an information graphic. This message can form the core of a brief summary of the graphic for use in indexing the graphic in a digital library and for taking constituent information graphics into account when summarizing a multimodal document. To our knowledge, our research is the first to address the problem of enabling digital libraries to exploit the rich resource of information graphics.

8.

ACKNOWLEDGEMENTS

This material is based up on work supp orted by the National Science Foundation under Grant No. I IS-0534948.

9.

REFERENCES

[1] B. Bradshaw. Semantic based image retrieval: a probabilistic approach. In Proc. 8th ACM Int. Conf. on Multimedia, pp. 167­176, 2000. [2] E. Charniak and R. Goldman. A bayesian model of plan recognition. Artificial Intel ligence Journal, 64:53­79, 1993. [3] D. Chester and S. Elzer. Getting computers to see information graphics so users do not have to. In Proc. 15th Int. Symposium on Methodologies for Intel ligent Systems, pp. 660­668, 2005. [4] S. Elzer, S. Carb erry, D. Chester, S. Demir, N. Green, I. Zukerman, and K. Trnka. Exploring and exploiting the limited utility of captions in recognizing intention in information graphics. In Proc. 43rd Meeting of Association for Computational Linguistics, pp. 223­230, 2005. [5] S. Elzer, N. Green, S. Carb erry, and J. Hoffman. Incorp orating p erceptual task effort into the recognition of intention in information graphics. In Proc. of Third Int. Conf. on Theory and Application of Diagrams, pp. 255­270, 2004. [6] M. Flickner, D. Petkovic, D. Steele, P. Yanker, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, and D. Lee. Query by image and video: The qbic system. Computer, 28(9):23­32, 1995.

[7] R. Futrelle. Summarization of diagrams in documents. In I. Mani and M. Maybury, editors, Advances in Automated Text Summarization. MIT Press, 1999. [8] R. Futrelle and N. Nikolakis. Efficient analysis of complex diagrams using constraint-based parsing. In Proc. of Third Int. Conf. on Document Analysis and Recognition, 1995. [9] N. Green, G. Carenini, S. Kerp edjiev, J. Mattis, J. Moore, and S. Roth. Atuobrief: An exp erimental system for the automatic generation of briefings in integrated text and graphics. Int. Journal of Human-Computer Studies, 61(1):32­70, 2004. [10] H. P. Grice. Utterer's Meaning and Intentions. Philosophical Review, 68:147­177, 1969. [11] B. Grosz and C. Sidner. Attention, Intentions, and the Structure of Discourse. Computational Linguistics, 12(3):175­204, 1986. [12] A. Gupta and R. Jain. Visual information retrieval. Communications of the ACM, 40(5):71­79, 1997. [13] S. Kerp edjiev, G. Carenini, N. Green, J. Moore, and S. Roth. Saying it in graphics: From intentions to visualizations. In Proc. of IEEE Symposium on Information Visualization, pages 97­101, 1998. [14] S. Kerp edjiev and S. Roth. Mapping communicative goals into conceptual tasks to generate graphics in discourse. In Proc. of the Int. Conf. on Intel ligent User Interfaces, pages 60­67, 2000. [15] J. Larkin and H. Simon. Why a diagram is (sometimes) worth ten thousand words. Cognitive Science, 11:65­99, 1987. [16] N. S. C. Netica, 2005. [17] J. R. Searle. Speech Acts: An Essay in the Philosophy of Language. Cambridge University Press, London, England, 1970. [18] J. R. Searle. Indirect Sp eech Acts. In P. Cole and J. Morgan, editors, Syntax and Semantics: Speech Acts, volume 3, pages 59­82. Academic Press, Inc., New York, New York, 1975. [19] J. R. Smith and S. fu Chang. Querying by color regions using VisualSEEk content-based visual query system. In M. T. Maybury, editor, Intel ligent Multimedia Information Retrieval, pages 23­41. AAAI Press/MIT Press, 1997. [20] R. K. Srihari, Z. Zhang, and A. Rao. Intelligent indexing and semantic retrieval of multimodal documents. Information Retrieval, 2(2):1­37, 2000. [21] M. J. Swain. Color indexing. Int. Journal of Computer Vision, 7(1):11­32, 1991. [22] A. B. Torralba and A. Oliva. Semantic organization of scenes using discriminant structural templates. In Proc. of Int. Conf. on Computer Vision, 1999. [23] A. Vailaya, M. Figueiredo, A. K. Jain, and H.-J. Zhang. Image classification for content-based indexing. In Proc. of IEEE Conf. on Multimedia Computing and Systems, pages 518­523, 1999. [24] J. Yu, J. Hunter, E. Reiter, and S. Sripada. Recognizing visual patterns to communicate gas turbine time-series data. In Proc. of ES2002, pages 105­118, 2002.

588