SIGIR 2007 Proceedings

ACM Athena Award Lecture

2007 Athena Lecturer Award

Natural Language and the Information Layer
¨ Karen Sparck Jones
Computer Laboratory University of Cambridge William Gates Building, JJ Thomson Avenue Cambridge CB3 0FD, UK

Categories and Subject Descriptors
H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing--Linguistic processing.; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval--Retrieval models.; I.2.7 [Artificial Intelligence]: Natural Language Processing--Text analysis.

General Terms
Algorithms, Human Factors, Theory.

Keywords
Athena Lecturer Award, Information Layer. This talk is in response to two Awards: the Association for Computing Machinery's Athena Award, given by the ACM's Committee on Women, on the nomination of the ACM Special Interest Group on Information Retrieval; and the British Computer Society's Lovelace Medal. It is a very great honour to have been given these awards. I would like to say how much I appreciate this recognition. Thank you, ACM and BCS. I would particularly like to say, and I hope the ACM will not take this amiss, how I appreciate being the first woman to be awarded the BCS Lovelace Medal. The awards carry the opportunity to give a lecture with them. I deeply regret not being able to do this live and in a way to suit each specifically. But I hope the single video based on this talk will go a little way as a substitute for two proper lectures. My talk has three parts: on the first phase of natural language processing research and its lessons; on subsequent developments up to the present and their lessons; and on where we are now and what I think are the wider implications for the future.

Words, classification, and retrieval
When I began research in computing nearly fifty years ago, people were very excited about what could be done with computers, challenged by how to do it, and pushing applications that offered new opportunities in dealing with information. One of these areas was NLP ­ natural language processing ­ or NLIP ­ processing information conveyed in natural language. At that time, most researchers thought primarily of the efficiency gains that could be
Copyright is held by the author/owner(s). SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007.

made by emulating or supporting people, e.g. in translation or document retrieval, though earlier visionaries like Warren Weaver had seen how computers could open up quite novel opportunities, and a little later Doug Englebart would illustrate some of these. My own research focussed on automatic classification (nowadays called unsupervised machine learning). It was clear that humans rely in NLP on the use of general conceptual classifications of words ­ like a thesaurus ­ to resolve ambiguity in language (in research practice, text). Individual words are ambiguous, and so are the structures represented by word strings. These ambiguities are resolved by the fact that discourse is coherent, and effective because concepts and topics are repeated, and repeated enough to get them across. Repeated individual concepts and standard conceptual patterns enable us to select the right meanings for words and structural relationships for word sequences. This very general idea fairly obviously applies to translation, but it also applies to document retrieval: different words in a query and a document may still stand for the same concept and thus be allowed to match. The question is, where to get the lexical classification and stock of text patterns from. The obvious answer is, from text. If words tend to co-occur in many texts this similar behaviour suggests they are conceptually related. Thus in principle one should be able to build a thesaurus automatically from a vast text corpus, and analogously to extract repeating conceptual patterns. I was interested in building a thesaurus for translation, but of course had no corpus. I managed to finesse this by exploiting some limited dictionary (not thesaurus) data, for a pilot demonstration. For retrieval the situation is easier. The collection of documents or texts from which you want to retrieve supplies the corpus to build the classification. This is the line we followed, with some success in retrieval performance, particularly when relative word frequency was adequately factored in and eventually captured by word weighting. Documents score not just by the number of words matching, but by the sum of their weights. All of this was essentially statistical in character. Facts about word occurrences and co-occurrences were used to capture meaning in a way that could be manipulated without needing to know what that meaning was. An unusually frequent word in a document is a good indicator of an important topic in the document, so if the query uses the word, the document is likely to be relevant to your information need. One very simple word weighting formula captures this effectively. If a word occurs in many documents most of these ­ other things being equal ­ are unlikely to be relevant to the user's information need. Weighting by inverse document frequency, idf , relatively favours less common and more discriminating terms. If you

3


SIGIR 2007 Proceedings

ACM Athena Award Lecture

also take account of within-document frequency ­ so-called term frequency, tf ­ and also modulate by document length as appropriate especially for full text, you get a robust, easily implemented mechanism ­ tf × idf -type weighting ­ that has consistently performed well. If, in addition, you can find out that some documents are relevant to the user's need, as you might in an interactive or relevance feedback environment, you can distinguish between query term occurrences in relevant and non-relevant documents and do a much more discriminating, and hence effective, retrieval job, still very simply. All of this was established by the research community, through exhaustive experiments supported by theory development, by the mid seventies. But quite apart from being ignored by conventional library and abstract services, it wasn't thought of as having anything to do with computing proper, which was all about fundamental, generic things like programming languages and compilers, and operating systems, and which developed further, again fundamentally and generically, with motivating and validating theory ­ say with Scott-Strachey semantics ­ on the one hand, and with distributed systems and security on the other. But the retrieval research contained an important lesson. This was not a novel lesson in itself, but one that is important though too often forgotten, when you try to get computers to do things that humans do. Thus particularly in the early days of automatic classification work, nobody really examined the assumptions on which it was based, i.e. what a classification should be like and what it was for. It is easy to believe there are natural classes, and because there are natural classifications, these will be appropriate for whatever you want to use them for. The only challenge in automation is how to find them. But this isn't true even in biology any more than it is true in retrieval, as retrieval researchers found. There are lots of equally plausible ways, to group things (even given the same feature descriptions), and the issue is to analyse the purpose for which a classification is intended and to develop one to suit that purpose. In retrieval you want to achieve precision (only relevant) and recall (all relevant), which are in fact conflicting and it turned out, contrary to expectation, that precision-promoting tight classes, not hospitable recall-promoting ones, were what was really required. The effectiveness of weighting was geared to the same truth: idf is not an obvious notion, but words are sloppy, and if you don't constrain word matching, things get too sloppy. Learning all of this ­ what you want to achieve by automation not what you think you want to achieve by automating humans' beliefs and actions ­ took much trial and error, and demonstrated the need for fresh thinking about goals, thorough experiment, and a sound evaluation methodology. So the lesson from the first phase of research was that retrieval isn't what you thought, but when you have figured it out, statistics are a good thing.

Symbols, the world, and language interpretation
Indexing and retrieval, thus statistically characterised, are towards one end of the spectrum of NLIP tasks. Translation, or questionanswering, or summarising, appear to require symbolic processing, i.e. syntactic parsing and semantic analysis in discourse interpretation, and manipulation of an explicit meaning representation (and complementarily for generation). By symbolic I mean displaying the meaning of "The girl ate chocolates" as a young female as agent taking into herself some sweet consumables as objects, with every component of the representation actually a formal, well-defined symbol for something in some world.

Thus correctly answering such a question as "Did John marry the girl he loved?" will not necessarily be achieved by just matching on the words "John", "marry", "girl", and "loved". Similarly it is hard to see how summarising the mini text "Mary went into the garden and cut some roses and lilies. She arranged them all in a tall, dark brown jug", can be done by some statistical word-based operation: that wouldn't deal with referring expressions like "them" or the greater importance of "jug" than "brown". While research on retrieval was slowly building up, people in the NLP community, from the 1970s onwards through the eighties and into the nineties, were developing the tools for symbolic processing ­ parsers, lexicons, representation formalisms and inference mechanisms ­ that encouraged them to tackle challenging NLIP tasks that seemed to require symbolic processing in order to build and use deep sentence and text meaning representations, and to relate these to world models. For example, given "Jill fell off her horse", to answer the question "Did it run away?" This turned out to be hard, harder than expected, even for elementary tasks like simple natural language queries to databases. There was also a growing problem about evaluation. In principle one can build a system, deploy it for real, and see whether it works satisfactorily. But as this is very expensive you would like some means of earlier evaluation. Unfortunately it is extremely difficult to say exactly what a system ought to deliver, in the fine grain, especially when you are trying to do something new and not just replicate an old, manually obtained output (and looking at system internal products like meaning representations is of very limited value). Thus with information or fact extraction from text, what facts should one be trying to obtain from "Peter patted the dog and stroked the big black cat"? Just Peter patted a dog; Peter stroked a cat or Peter patted a dog at time t1 ; Peter stroked a cat at time t2 (notice the temporal order is inferred) or There was a dog; There was a cat; ... or A cat was big; A cat was black; ... (are the big cat and the black cat the same?). There is no limit to what you can inferentially extract if you put your mind to it, but do you want it all, how do you express it, and how do you assess it? Retrieval research avoided this problem by taking users' relevance assessments for their queries as a gold standard. But relevance is a broad notion, and the retrieved documents are only vehicles for the user's discovery process. With fact extraction for a large information base, or summarisation, the situation is more complex. Facts are specific nuggets, and summaries are new, but information-losing, discourses in their own right, so it is impossible to say in advance whether a user's interest ­ e.g. "Tell me about Peter" ­ will only, or best, be satisfied by one system output. The underlying problem is that tasks like information extraction, or summarising, or translation, are not unitary tasks. Retrieval isn't either, but it has a modest aim: offer the user something they may be able to make something of in pretty well any context. Summarising for example, on the other hand, is a very complex task with more

4


SIGIR 2007 Proceedings

ACM Athena Award Lecture

potential variation, and therefore more need to be designed for different types of situation (Google snippet summaries are useful for some purposes, useless for others). What NLIP research, especially since the early 1990s, has shown in attempting to build and evaluate systems for such challenging tasks as summarising, is that these tasks come in many shapes and there are consequently many different appropriate strategies for doing them to suit circumstances. More importantly, it became evident that these tasks in their simpler forms could be done with statistical techniques, or hybrid statistical-symbolic approaches of a relatively shallow kind. You can do translation, or fact extraction, or summarising statistically quite well enough to meet some needs, and even apply these techniques cross-linguistically. For example you can learn word selection and ordering from example sources and their summaries for telegraphic one-line summaries ­ and we have plenty of training data for such things. Statistical data is also a vital support for symbolic processing when this is needed, for example about normal parsing preferences. So the lesson learnt from the second phase of NLIP research was that you can do a lot more with statistics in NLIP than you thought. Statistics can be useful in more ways than might have seemed possible in the early days of NLIP, when statistics even for a humdrum task like retrieval was revolutionary. Now tf , idf and so forth are embedded in Web search engines (though they took decades to get there), machine learning is the default base for speech transcription, and largely statistical extractive techniques can provide useful automatic summaries, even if to meet high quality needs all the full symbolic apparatus of meaning representations, world models, and inference is essential.

Language and information
What, then, do the key features of the present state of NLIP suggest for the future, and what in particular do they suggest for computing in general, not just NLIP itself. I assume now that NLIP will go from strength to strength along its present lines. I want to focus on its implications for computing in general. The key feature of the present state is the way that NLIP is seamless. It is seamless both because different tasks, or component functions like answering a single question, merge into one another, and because basic processes, especially statistical ones like estimating relative word significance, apply across the field. This seamlessness is not surprising given that we are dealing with natural language, and that natural language, despite its huge range of specific capabilities and varied uses, is the common vehicle for our communication, and in this communication we always have to deal with discourse's constituent ambiguity. The seamlessness in the utility of statistics, like applying Bayes's Theorem, seems more surprising, but is not so given we have to cope with ambiguity quickly in real time, and so need robust and simple methods for doing this. Relying on frequency ­ of course in reality by approximation, not actual calculation ­ is such a robust and simple method. Thus even if you can't rely only on such methods in NLIP, the important point about the experience of the last fifty years, first in retrieval and then other NLIP tasks, is that you get a lot of mileage in, as it were, finding your feet, by exploiting rough and simple processes either on their own or to leverage more sophisticated ones, for example to suggest likely preferred paths to follow in full text interpretation. I believe the consequences of this for computing in general, not just NLIP, are profound. They have to do with the notion of the information layer. This is not a well-defined notion, but that does not make it any less real or important. The notion has also been around for some

time. What I want to suggest is that with the experience of NLIP we can give it more substance. What follows is only sketchy and indicative, about the future, but I will stick my neck out and say it. Early on, and putting it very simplistically, the model of computing was that it consisted of a hardware layer at the bottom, then an operating system layer, then maybe a utilities layer, with a final applications layer on top. By application here I mean something like an accounting package. This simple model was further developed to accommodate distributed systems and communications, introducing a communications layer between hardware and operating system. It then appeared, as systems grew more powerful, that it could be appropriate to think of an information layer ­ not just a bits layer or a data layer in some conventional sense (i.e. of well-defined strings or tuples) ­ as part of the computing system proper and not just, as in the old library days, as an application. The natural place for the information layer was on top of the operating system, or at least utilities layer. It could indeed perhaps absorb the utilities layer in a closer relationship with the operating system. The crucial point about the information layer was that it would have widely useful substantive content, something rather different from the familiar but formal notion of data as something systems ship about. Now you may say that this is what we have been getting with Web browsers and, especially, all-pervading Web engines. But this is only half right if one thinks only of browser or search functionality. What really matters is what browsers and engines work over ­ the stuff, incipient information. My claim is that what is central to the idea of an information layer is the stuff itself. This seems to fly in the face of computing as essentially syntactic manipulation of opaque coded objects. So my argument is as follows. People don't talk in numbers. They talk in language. Wherever end-users appear, which is everywhere, they need a familiar communication vehicle for information, and natural language is what they use. Unfortunately they still have to interpret language to decide what the information is and how to use it. They know how to do that, but can be materially assisted by automatic processes like, in the simple case, text retrieval, but also by other facilities like translation or summarising. One might argue that this isn't any different from conventional computing. But I claim that the code and what's coded are far nearer and more accessible to humans. Words are there for themselves, as their familiar selves, and are not replaced by other codes. We have at the same time to recognise, and allow for, the fact that every language use in a context is different, so there will be an early limit on how resolved and explicit the language processing and what it manipulates in the information layer are. The idea of a Semantic Web as a universal characterisation of knowledge strikes me as misconceived. There can be no ontology that will work for everything and everyone. There may be many specific ontological horses for particular courses. But the universal means of getting from one to another cannot but be a sort of lightweight rope-way. That is, something that, modest and partial though it may be, does establish links and make it possible to move around. This is what natural language tools like statistical associations provide. People can get a start, and move, from simple words or phrases and connective relations between them. Natural language is general but leads to particulars. But it's not quite clear why such an information layer, incorporating the vast language stuff that exists electronically, but with an eye on the user, should be a middle layer rather than an outer layer in computing systems. My point is that there is no reason to suppose that what are actually semantic operations ­ if only shallow ones ­ in the layer are only for the immediate benefit of the user.

5


SIGIR 2007 Proceedings

ACM Athena Award Lecture

There is no reason to suppose that as computing systems become more powerful and pervasive, they will not find it useful, even necessary, to do the same sort of things themselves. For example, while system security ought to be as tight as a drum with formal coding, in reality it wont be watertight and it may be desirable for the system to see how it can be helped by whatever or can dig out of the information layer. For example it may be possible ­ though I can only be sketchy and indicative here too as I am not a security expert ­ to learn more and hence plug gaps in some supposedly secure situation by trawling the information layer for correlates of the words that figure somewhere in the security setup, whether these words have to do with the agents involved, the wrapping of the package, or the contents of the package.

I recognise this is very vague. This is not surprising given that I am talking about a future possibility. But I think that people are getting overexcited about images and forgetting the crucial role that language has. You can look at pictures, but you have to use language to talk about them. So I believe we have a wonderfully rich resource in all the words, the stuff of information, that are becoming available, and we should think imaginatively about what it could be like for computing systems not only to make such stuff available to users, but to exploit it for themselves. It's an exciting opportunity. Finally, I would like not only to thank you again for the awards and the opportunity to give this lecture, but also my husband, the late Roger Needham, who always encouraged my work. KSJ, March 2007

6