Assignment 3: Who Hangs Out Together on Wikipedia
Due: Thursday 3/31 (11:59pm)
Pointwise mutual information is a function of two events x and y:
The larger the magnitude of PMI is for x and y is, the more information you know about the probability of seeing y having just seen x (and vice-versa; PMI is symmetrical). If seeing x gives you no information about seeing y, then x and y are independent and the PMI is zero.
Proper nouns are nouns that refer to distinct entities. In English, they're usually capitalized even when they don't start a sentence. Examples are
- adam sandler
- adam smith
- big ben
- big bend
- captain jack
- captain james cook
- captain john smith
- captain kangaroo
These are discovered automatically using a parser (but you don't need to worry about the specifics.
In this project you're going to compute the PMI for different entities appearing in the same sentence.
We created files with the proper nouns set off in angle brackets:
〈 joel norman quenneville 〉 ( born 〈 september 〉 15 , 1958 in 〈 windsor 〉 , 〈 ontario 〉 , 〈 canada 〉 ) is the head coach of the 〈 chicago blackhawks 〉 professional ice hockey team .
a grand coalition of 〈 cdu 〉 and 〈 spd 〉 lasted from 1968 to 1972. a new grand coalition lasted from 1992 to 1996. since 1996 , the < cdu 〉 is cooperating with the fdp .
the elementary school in " the 〈 simpsons 〉 " is based on 〈 mccarthy middle school 〉 , which was 〈 chelmsford 〉 's high school before the construction of 〈 chelmsford high school 〉 in 1974. the town hall in the show is based on the 〈 chelmsford public library 〉 ( prior to the recent reconstruction ) .
Which can be found in "/umd-lin/jbg/data/wackypedia/np". Note that the entities are the whole string inside the angle brackets. E.g. "joel norman quenneville" is one entity and "mccarthy middle school" is another.
In this assignment, you will compute the PMI of proper nouns that appear together more than 25 times among entities that appear in more than 100 sentences. (In other words, if an entity by itself appears in 100 sentences or fewer, we're not interested; if two entities appear together 25 sentences or fewer, we're also not interested). Write the code necessary to do this and at least one unit test (you'll likely want to write more!).
When you're done answer the following questions:
- What strategy did you use to get the necessary information needed to compute PMI?
- Did you use a combiner? If so, describe what it looked like.
- Did you use a partitioner? If so, describe what it looked like.
- How many entities appear more than 100 times?
- How many entity pairs appear more than 25 times?
- What are the entities that have the highest PMI with: "seinfeld", "maryland", and "clinton"?
- What is the pair of entities with the largest PMI?
This assignment is due by 11:59pm, Thursday 3/31. Please send us (both Jordan and Yingying) an email with "[CCC Assignment 3]" as the subject. In the body of the email put answers to the questions above and any source code you created or modified.
Hints / Tips
- Think out your design carefully before you begin, staring at the PMI calculation. Make sure that you'll have all the information you'll need there when you need it.
- If a sentence contains entities A, A, B; then there are not two instances of A and B appearing together. Similarly, there are not two co-occurences of A and B, only one.
- If your design requires everything to be in a single reducer, something is wrong.
- If the number of keys you output from mapper is not linear in the number of observed pairs, then something is wrong.
- Remember there are other ways of getting data to reducers than just key-value pairs.
- Remember that seeking data is expensive but streaming isn't too bad.