Project: Supervised Classification for Sentiment Analysis


"Sentiment classification" and related topics ("opinion detection", "subjectivity analysis") are a hot topic in natural language processing right now. A recent monograph by Pang and Lee provides a nice overview of what's going on in this area, and a brief article in Communications of the ACM points out that a number of new companies are starting to commercialize relevant pieces of the technology, and this interview with a DC-based text analytics consultant, Seth Grimes, goes into a bit more detail.

Pang and Lee point out quite a few variations on the theme, but the central premise underlying most of this work is that spans of text often convey internal state, such as a positive or negative opinion, an emotional reaction, or an author's perspective on a topic, and that such a state can be thought of as a label for the text. The simplest example (and probably the best researched) would be opinion detection in movie or product reviews: if the text says "I thought this movie was terrible!", we have overt information that a subjective statement is being made (I thought) as well as overt information about the polarity of an opinion (terrible). Significantly more challenging are scenarios where the goal is to label an underlying state in texts that are not overtly subjective, e.g. in a marital counseling session, "The dishwasher broke yesterday" and "My husband broke the dishwasher yesterday" are both statements of fact, but they convey vastly different underlying perspectives on the event.

In this project, your team will do a piece of end-to-end research on sentiment labeling, using supervised learning techniques. This will involve preprocessing corpora, making choices about features to include in text representations, training classifiers, evaluating performance, and writing a project summary.

Note that this project is designed so that good results might actually be publishable in a workshop or even conference paper. What you're being given here is not a textbook problem; rather, it's part of the very much open problem of how to do better sentiment analysis. That has advantages and disadvantages. On the positive side, it's more fun. On the negative side, this is open territory and it's possible that unforeseen problems will crop up with the assignment -- either in how it's formulated, or with the materials I give you, or in system issues. If that happens, let me know and we'll adjust accordingly.


The following corpora constitute separate problems to solve. I would like your team to experiment with at least two of these, ideally more, in order to assess the extent to which an approach that looks promising in one case might or might not work well on a different kind of data.


This project is not about programming up machine learning algorithms. I recommend that you pick a machine learning toolkit that allows you to try out different algorithms, or a set of such toolkits. Some obvious choices to consider include the following:

WEKA. The Weka toolkit is one of the best known and most widely available machine learning packages. It supports a wide range of supervised learning techniques, including most of the ones we have discussed in class. Weka comes with both a graphical user interface and a command-line interface, as well as a java API. The basic idea, in using Weka, is to represent your learning problem using an .arff file, within which each instance is represented as a feature vector. (The header of the file identifies the types of the features as well as the feature that constitutes the class being predicted.) Once you've got your data into the .arff format, it's very easy to try out different learning algorithms and/or different parameters for the same algorithm.

As a quick and easy starting point, you might want to try out Carolyn Rose's TagHelper package. TagHelper is a wrapper around Weka (with its own GUI and command-line interfaces) that builds in many of the most common feature extraction choices commonly used in text processing, e.g. tokenization, lowercasing, unigrams, bigrams, stemming, part-of-speech tags, etc. The basics are incredibly simple: you put your data into a two-column spreadsheet where the first column is Code (i.e. label) and the second column is Text (i.e. the text being classified), you tell it which feature extraction options you want to use (or just use its defaults), and then you run it. It will do feature extraction (creating .arrf files for you) and, by default, will do evaluation on your dataset via (IIRC) 10-fold cross-validation. It's also possible to use separate training and test sets.

MALLET. The MALLET toolkit is another very nice package. It's not as well documented as Weka, but it has some decent "quick start" Web pages with useful examples, fairly readable java source code (or so I'm told), and an online discussion group that is fairly active monitored by MALLET team members who actually seem to be pretty responsive. MALLET overlaps with Weka a bit, but it also supports sequence modeling (conditional random fields, a.k.a. CRFs) and unsupervised topic modeling (Latent Dirichlet Allocation, a.k.a. LDA). And again, you can get your data into a fairly standard format and play with different parameterizations for learning, there's a java API, etc. (But no GUI.)

Others. There are a variety of other toolkits out there for specific approaches such as maximum entropy modeling, support vector machines, and decision tree learning, and I know LingPipe implements subjectivity and sentiment analysis. It also appears that NLTK offers useful machine learning toolkit functionality (decision tree, maximum entropy, naive Bayes clasifiers, and an interface to Weka) although I'm not particularly familiar with it. There are various other lists of machine learning toolkits out there; probably one of the best too look at would be Hal Daume's list of useful machine learning links and software.

Discussion of machine learning packages is welcome on the class forum, and I'm happy to inquire with my students and former students about their experiences with packages you're considering using.

Bayesian Modeling

Finally, let me observe that there are some interesting Bayesian modeling approaches out there, which might be interesting to try if you're particularly ambitious and want to go beyond existing off the shelf classifiers. Here are two thoughts, both of which will probably make more sense if you skim a manuscript I'm working on called "Gibbs Sampling for the Uninitiated". (I don't want to post the URL but I'll mail it to the class.)

Other Resources

The core of this project involves trying out different features and seeing what might or might help for (subsets of) the tasks laid out above. Feature extraction will undoubtedly require coding on your part, but here are some things that might potentially be useful to you.


There are a whole lot of features you might consider using. Certainly unigrams and bigrams (and variations, e.g. stemming them) have been used before; see Greene and Resnik (2009) and Stephan Greene's dissertation, for example.

The syntactic features introduced by Greene and Resnik would also be something to consider including, especially since I can provide code to extract them. The basic idea here is that the syntactic form of a sentence carries information about the semantic "framing" being adopted by the author, which can be connected to underlying sentiment. For example, "My husband broke the dishwasher" would give rise to features including break-TRANS and obj-dishwasher, indicating respectively that break was used transitively and that dishwasher appears as an object. (We didn't use triples like break-obj-dishwasher because of data sparseness issues.) In contrast, "The dishwasher broke" would give rise to, among others, the feature break-NOOBJ, indicating that the verb break was used without an overt direct object. Because syntactic transitivity is associated with some highly relevant semantic properties (e.g. causation, intended action, and change-of-state in the object), the transitivity-indicating feature encourages an interpretation of the event that foregrounds the husband's causal role, the fact that the dishwasher was strongly affected by the event, etc. If breaking the dishwasher is an undesirable outcome, then the transitive statement encourages an interpretation of the event connected with negativity toward the husband; the inchoative version (no object) de-emphasizes the properties associated with that interpretation.

What are some possible extensions of this idea? Well, Greene and I did not use any external knowledge about the verb. We left it to the machine learning to figure out which features would push the label in which directions; e.g. one might expect break-TRANS to show up more in one kind of document and the same feature with a more positive verb, say rescue-TRANS to show up in the opposite kind of document. But I think it would be interesting to explore whether the subjectivity lexicon could be used in conjunction with these syntactic features in some way in order to capture generalizations based on verb types (perhaps adding features like negativepolarity-TRANS, positivepolarity-TRANS, etc. in addition to the verb-specific features?).

As another thought, it might be interesting to look at the extent to which the author's choice of syntactic frame for the verb differs from its most conventional use. For example, if break is used in an active transitive 5 times as frequently as the passive, based on syntactic analysis in some reference corpus (Penn Treebank?), then its use in the passive seems like it should receive strong weight, while a transitive use might not be telling us anything particularly significant about how the author is framing the situation.

Those are just some pet ideas I've been thinking about. Yes, I think it would be very cool to see some of you guys try them out. But I'm also very open to creative thinking on your part about other features that might be useful for one or more of the classification problems.


The logical paradigm for evaluation here would be k-fold cross validation. Choosing k is as much art as science, probably with k=10 being most common, but for smallish datasets it would not be uncommon to see k=5. (See notes above under "Corpora" for special treatment of cross-validation for the death penalty corpus.)

In terms of evaluation measure, one would generally use simple accuracy: did the classifier's label on the test item match the "ground truth" for that item?

That said, an important part of the evaluation is going beyond the numbers to an analysis of why things turned out the way they did. (And a good analysis can be as important as a positive result, in terms of good research, even if it makes it harder to get a paper accepted.) One form of analysis might be an error analysis for an individual classifier/featureset combination, trying to identify generalizations about what it does well or what it does poorly. Another form of analysis might break the errors into false positives and false negatives, or into other buckets, in order to seek insight into what's working, what's not, and how it could be improved.

Optional Related Reading

There's a lot out there, but here are three authors in particular whose work covers a whole lot of the most recent and interesting research.

What the Team Should Turn In

Here's what I expect to be turned in by each group.
  1. A tarball/zipfile of your source code, including enough information for someone to run it.

  2. A writeup including the elements below. Note that you can organinize these elements however you want in order to create a coherent, readable description with a logical flow to it. (That said, you might want to look at some of the referenced conference papers to get a general feeling for how research descriptions tend to be structured in this particular space.)

  3. Separately, by e-mail, each person should send me ratings for their team members as described under "Grading", below. Please put "COMPLING2 RATINGS" in the subject line so these messages are easy to spot.


The group will receive a grade-in-common out of 75 points. By now I think you have a decent sense of my criteria. I want to see that you've understood the experimental paradigm

For the remaining 25 points, each team member should anonymously rate each other team member as follows -- making sure to look at the definitions below to see what the numeric scales are supposed to mean.

Collaboration: 10 means that this person was great to collaborate with and you'd be eager to collaborate with them again, and 1 means you definitely would avoid collaborating with this person again. Give 5 as an average rating for someone who was fine as a collaborator, but for whom you wouldn't feel strongly about either seeking them out of avoiding them as a collaborator in the future.

Contribution: 10 means that this person did their part and more, over and above the call of duty. 1 means that this person really did not contribute what they were supposed to. Give 5 as an average, they did what they were expected to rating. Note that this is a subjective rating relative to what a person was expected to contribute. If five people were contributors and each did a fantastic job on their pieces, then all five could conceivably get a rating of 10; you would not slice up the pie 5 ways and give each person a 2 because they each did 20% of the work! It is your job as a group to work out what the expected contributions will be, to make sure everyone is comfortable with the relative sizes of the contributions, and to recalibrate expectations if you discover you need to. Try to keep things as equitable as possible, but if one person's skills mean they could do a total of 10% of the work compared to another person's 15%, and everyone is ok with this, then both contributors can certainly get a score of higher than 5 if they both do their parts and do them well. If you need help breaking up tasks, agreeing on expectations, etc., I would be happy to meet with the group to assist in working these things out.

Effort: A rating of 3 should be average, with 5 as superior effort (whether or not they succeeded) and 1 as didn't put in the effort. A rating below 3 would not be expected if the person's contribution was 5 or better. If a person just didn't manage to contribute what they were expected to, but you think they did everything in their power to make it happen, you could give them a top rating for effort even while giving them a low contribution score.

A Final Note

This project is ambitious. It attempts to give you an experience doing something real, not just a textbook exercise. I have not run all facets of this task end-to-end before, particularly for some of the new features I'm suggesting you explore. That means that there might be unanticipated problems, situations where people do not receive inputs they need to get their part done, intra-team politics, interpersonal issues, and who knows what else -- just like in the real world. It also means that something new and interesting might come out of it, which is pretty cool.

Unlike the real world, which is not very forgiving, this is a controlled setting that involves the guidance of an instructor, who can be very forgiving. Remember that the activity is, first and foremost, a collaborative learning activity, with the emphasis on learning. If there are problems or issues of any kind, let me know sooner rather than later, and I will help to get them worked out. Also feel free to use the mailing list or discussion forum: the presence of multiple teams does not mean that you are competing with each other. (I considered adding extra credit for the team with the best results, but I specifically decided against it because I would much rather see a spirit of collaboration not only within teams but at the level of the entire class.)