Assignment 5, Ling 645/CMSC 723, Fall 1997

Assignment 5, Ling 645/CMSC 723, Fall 1997

Assignment 5: Context-free grammars augmented with features


In this assignment you'll run the bottom-up chart parser that Allen
describes in his book.  (Yes, the code comes from Allen himself.)  The main
goals of this assignment are for you to have a chance to explore the
parser, to walk through its operation, and ultimately to get a feeling for
what it's like to do a bit of writing of grammar rules.

I very HIGHLY recommend that you READ THROUGH this entire assignment before
you do anything else.


1.  Create a directory called hw5 and go into it.

     mkdir hw5
     cd hw5

2.  Get the software via anonymous ftp to;
    the file you want is  pub/resnik/ling645/hw5/CourseParser1.2.tar.gz
    (see previous assignment for how to get files by anonymous ftp).  
    Then unzip and untar the file:

     gunzip CourseParser1.2.tar.gz
     tar xvf CourseParser1.2.tar

    This will create a subdirectory called NLPcode.  Go into it:

     cd ParserCode

3.  Get the full path name for the directory you are in using 'pwd'
    (short for "print working directory"):


    For example, the output of 'pwd' for where I am right now is:


4.  Edit file loadFunction and move down to the line
    where it says

      (let ((*PATH* "/u/james/Code/CourseParser/ParserCode")

    Change what's between the double quotes to be the directory you
    got when you said 'pwd'.  For example, in my case, the
    it becomes:

      (let ((*PATH* "/home/research2/resnik/ling645/lisp/hw5/ParserCode")

    Then save the file.  


5.  In your ParserCode directory, go into LISP and load the 
    loadFunction.lisp file:

      (load "loadFunction")

    Then load up the parser:

      (load "LoadParser")

    Don't worry about the various warnings.

6.  Load the lexicon and grammars from Allen, Chapter 4, by 
    executing the following:


    This defines the grammars and lexicons in Chapter 4. 

7.  Now try parsing a simple sentence:

      (bu-parse '(a fish saw the man))

    What you're seeing is a trace of the operation of the parser
    Allen describes in his chapter, including the way it enters 
    constituents into the chart, and then, at the end, all the
    constituents it found.
    To see the "best" parse tree (or parse trees), type


    You'll get the following information, which we'll look at in some
    detail now.

      S57:<S ((INV -) (AGR |3S|) (1 NP51) (2 VP54))> 
      from 0 to 5 from rule -1>
	NP51:<NP ((AGR |3S|) (1 ART44) (2 N45))> 
        from 0 to 2 from rule -2>
	  ART44:<ART ((LEX A) (AGR |3S|) (ROOT A1))> 
          from 0 to 1 from rule NIL
	  N45:<N ((LEX FISH) (ROOT FISH1) (AGR |3S|) (IRREG-PL +))> 
          from 1 to 2 from rule NIL
	VP54:<VP ((VFORM PAST) (AGR |3S|) (1 V46) (2 NP53))> 
        from 2 to 5 from rule -5>
          from 2 to 3 from rule NIL
	  NP53:<NP ((AGR |3S|) (1 ART49) (2 N50))> 
          from 3 to 5 from rule -2>
	    ART49:<ART ((LEX THE) (ROOT THE1) (AGR |3S|))> 
            from 3 to 4 from rule NIL
	    N50:<N ((LEX MAN) (ROOT MAN1) (AGR |3S|))> 
            from 4 to 5 from rule NIL

    This is telling you that the top level constituent is an S
    with the following list of features:

       ((INV -) (AGR |3S|) (1 NP51) (2 VP54))

     Recall that the "INV" feature concerns whether this is an inverted
     sentence (e.g. a wh-question), which it's not, hence the "minus"
     value.  The AGR (agreement) feature here is set to 3rd person
     singular (3s).  And the two sub-constituents used to build
     this S are a noun phrase, labeled NP51, and a verb phrase,
     labeled VP54.  Finally, since a chart item (corresponding to
     a "state" in the Earley parser) identifies not only the rule
     itself, but also the span, in this case from position 0 to position 5.

     What about, say, constituent VP54?  You can see that it's a
     verb phrase spanning positions 2 to 5 ("saw the fish").  You
     can also see that its agreement feature is 3s and its VFORM
     feature (form of the verb), is PAST.  How did the VP get those
     feature values?  When rule 

       VP(...) -> V(...) NP(...)

     was invoked (Allen, p. 96, rule 5), the features inside
     the VP were determined by matching the right hand side of
     the rule against the V and NP constituents.  The AGR and VFORM
     features came from the V, which itself got those features from 
     the lexicon entry for the verb.

     What does that lexicon entry look like?  Well, you can either
     go into subdirectory Grammars and look at file Chapter4.lisp, 
     where you'll see that one entry for "saw"is:
       (saw (v (root SEE1) (VFORM past) (subcat _np) (agr ?a)))

     Or, you could just parse a one-word "sentence" containing
     the word you're interested in:

       (bu-parse '(saw))

     and look at the chart entries that come out, one of which will be:

       V58:<V ((LEX SAW) (ROOT SEE1) (VFORM PAST) (SUBCAT _NP) (AGR ?A4))> 
       from 0 to 1 from rule NIL


8.  The parser isn't doing much  morphology for you, so you must
    give it morphologically pre-analyzed input sometimes.  That is,
    sometimes you'll have to separate out the roots from the suffixes.
    For instance, if you call

     (bu-parse '(The man cries))

      the parser will complain about an unknown word "cries"

    The right call is:

     (bu-parse '(The man cry +s))

    Try it, and you'll see that one of the constituents built
    is the V

      V79:<V ((AGR |3S|) (VFORM PRES) (ROOT CRY1) (SUBCAT _NONE) 
      (1 V73) (2 |+S74|))> 
      from 2 to 4 from rule -LEX1>

    That is, a V was built from V and +S using lexical rule 1
    (Allen, p. 92) -- the feature-based parser is doing the 
    combining of the base form "cry" with morpheme "+s" to 
    get a V constituent corresponding to the following parse tree:

            ((AGR |3S|)
             (VFORM PRES)
             (ROOT CRY1)
             (SUBCAT _NONE))
               /        \
              /          \
             V            +S
           ((LEX CRY) 
            (ROOT CRY1)
            (VFORM BARE)
            (SUBCAT _NONE))

    If you look at the lexicon on Allen p. 93, you'll recall
    that past tenses are built up this way in his system for verbs 
    having regular past tense morphology (e.g. "cry"), since you
    could expect a morphological analyzer to break them up properly
    (e.g. breaking "cries" into "cry +s"), but irregular past tenses
    are handled by separate entries in the lexicon (e.g. the entry
    for "saw" whose VFORM is PAST).


9.  To get a list of all the defined words, type


    That'll give you the following set, which I've re-ordered and
    modified a tiny bit to make it clearer:

     THE A                       ; determiner
     HE                          ; pronoun
     CRY WANT SEE SAW            ; verb
     WERE BE WAS IS              ; form of "to be"
     +ING +S +ED +EN             ; inflectional suffix for verbs
     TO                          ; infinitival "to"
     HAPPY                       ; adjective
     JACK)                       ; name

    Now look at the lexicon entry for "Jack":

       (bu-parse '(jack))

    or equivalently, see the lexicon entry on Allen, p. 93.

       (Jack (name (agr 3s) (root JACK1)))

    Notice that the word "Jack" is defined to be a member
    of the category NAME -- it's not just any old noun.
    Otherwise "the Jack" would be licensed by rule 2 of
    the grammar as a noun phrase!

10.  In the Grammars subdirectory, look at file Chapter4.lisp, and
    in particular, look at the NP rules.  (Equivalently, look
    at the NP rules on Allen, p. 96.)  Do any of the rules 
    say anything about the NAME constituent?  Hmmm.  So what
    will happen when you type the following?

      (bu-parse '(Jack saw the man))

    That's right:  if you take a look at the constituents that
    were created, there's no NP for "Jack", because no grammar
    rule says that NP's can be built from names.  You get the
    VP spanning positions 1 to 4 ("saw the man"), but there's
    no NP from 0 to 1, so you can't get an S from 0 to 5.

    Let's fix that, and in the process learn how to make changes
    to both the grammar and the lexicon!

11.  In the ParserCode directory, create a subdirectory called "new":
     and go into that directory

      mkdir new
      cd new

     Create a new file called "new1.lisp", and put the following 
     into it:

      (setq *grammar-new1* 
	       (np agr)
               (vp vform agr))
		    (head (name)))))

     This is a little mini-grammar containing just a single rule.
     The context-free part of the rule is

        NP -> NAME

     with NAME declared as the head of the constituent,
     and because AGR is a headfeature for NP's, that's
     equivalent to

        (NP (AGR ?a))  ->  (NAME (AGR ?a))

     Also add the following to new1.lisp:

	(setq *lexicon-new1*
	  '((many   (art (agr 3s)  (root MANY1)))
	    (Edgar  (name (agr 3s) (root EDGAR1)))))

     This creates a little mini-lexicon with entries for
     the words "many" and "Edgar".  (Yes, I know the lexicon
     entry for "many" is incorrect!)

     Finally, to the bottom of the file, add the following
     two lines:

        (augment-grammar *grammar-new1*)
        (augment-lexicon *lexicon-new1*)

     The first is a command that says that the new grammar
     should be added to the existing grammar.  The second
     adds the new lexicon entries to the existing lexicon.


12.  Now, get back to the ParserCode directory,  get into LISP,
     and get the Chapter 4 grammar and lexicon all loaded up -- 
     if you're starting fresh, that's done by doing 
     Steps 5 and 6, above.  (If you were working on a system
     with multiple windows and kept LISP alive while editing
     new1.lisp, obviously this part is unnecessary.)

     If you feel like it, execute 


     so you can see the current list of words in the lexicon.

13.  Now load the additional grammar and lexicon entries you've
     created by typing:

         (loadf "new" "new1")

     This tells the parser to look in subdirectory "new"
     and load the file called "new1".  Execute (defined-words)
     again, and this time you should see "Edgar" and "many" on 
     the list!

14.  To see if our lexical entry for "many" is correct, try

       (bu-parse '(many men))

     Notice that we don't get an NP spanning 0 to 2.  Why is that?
     Yup, it's because in our new lexical entry, the AGR feature for 
     "many" is incorrect -- it should have been 3p (third person PLURAL!)
     not 3s.  Let's fix that now.

15.  Go back to file new1.lisp, and change the 3s in the entry for
     "many" to 3p.

16.  Now get back to LISP, grammar and lexicon all loaded up, etc.,
     just like before -- i.e. the same state you were in at the beginning
     of Step 12.
     Note that if you never exited LISP, you can't just do 

       (loadf "new" "new1.lisp")

     to get the changed lexicon entries -- you'll get a warning
     saying that you tried to define rule "-12>" twice.  To 
     get back to the state you were in at the beginning of Step 12,
     you can execute


     again, and that will re-load the original grammar and lexicon,
     wiping out the changes you made using new1.lisp.

17.  Repeat Steps 13 and 14 -- this time the lexicon you loaded should
     contain the CORRECT entry for "many" (since you've just fixed it)
     and the parse should give you an NP spanning 0 to 2.  Congratulations,
     you've just succeeded in writing your first lexicon entry!  
     (A typical collegiate dictionary of English has on the order of
     100,000 entries, so that means you only have 99,999 to go...  :-)

18.  Now let's verify that the new grammar rule also works the way
     we want it to:

      (bu-parse '(Jack saw the man))

     As usual, you can type


     to see the "best" parse tree.  Voila!  Notice that the subject
     of the sentence is an NP with a NAME as its head.  Congratulations,
     you've just succeeded in writing your first grammar rule!
     (Any guesses as to how many more of *those* you have to write?...)


You should do:

  Problem 5.1
  Problem 5.2
  EITHER problem 5.3 OR problem 5.4 
   (if you do both, I'll take the higher score of the two)

Problem 5.1 [30 points]

  In new1.lisp, add lexical entries for the following words:


  The easiest way to do this will probably be to look at similar
  lexical entries in Chapter4.lisp, and to copy and modify them.

  TURN IN: the listing of new1.lisp (or just the new entries),
  along with a dribble of

     (bu-parse '(edgar admire +s the restaurant))
     (bu-parse '(edgar admire +s most restaurant +s))

  If you want, you can wait until after you've done Problem 5.2 
  and turn in the final listing of new1.lisp after you've solved
  both problems.

Problem 5.2 [40 points]

  In new1.lisp, add lexicon and grammar rules so that the 

     (bu-parse '(edgar saw the fish in the restaurant))

  leads to  a complete parse (i.e. a completed S spanning the 
  entire sentence) with "the fish" as the object of the verb
  and "in the restaurant" as an ADJUNCT to the VP.  That is,
  the overall structure should be consistent with:

       (NP Edgar)
       (VP (VP saw
               (NP the fish))
           (PP (PREP in)
               (NP the restaurant))))
  You do NOT need to be able to parse "the fish in the restaurant"           
  as an NP (although you can add the rule to do so if you'd like).

  Here are three things that might help you as you do this.  First,
  remember that since it's a bottom-up parser, you can parse 
  constituents as you debug, rather than the whole sentence, e.g.

     (bu-parse '(the restaurant))

  to make sure that it parses properly as an NP, using

     (bu-parse '(in the restaurant))

  to make sure a PP gets formed, etc.  That way you don't 
  have to focus on the whole sentence at once.  When debugging
  a program, it's always best to work in little pieces, checking
  how things work each step of the way, rather than trying to do
  it all at once.

  Second, if you execute 


  that will enable verbose tracing, showing you when dotted rules
  (incomplete constituents) are added to the chart.  E.g.
  with verbose tracing on, when you do 

      (bu-parse '(a men))

  you will see 

      Adding active arc: 
      <NP ((AGR |3S|) (1 ART603))> 
	      (<ART ((AGR |3S|))>) 
	       (<N ((AGR |3S|))>) from 0 to 1 

   which tells you that the rule 

      NP -> ART . N

   has been introduced into the chart spanning positions 0 to 1,
   with the agreement feature set to 3s. This makes it clear that 
   the features in the ART constituent succeeded in matching the 
   first part of the NP -> ART N rule, and now the rule is looking 
   to be completed by an N with the AGR feature also set to 3s.
   In this case, that NP will never be completed, because the N
   for "men" has AGR feature 3p.

   You can turn off the verbose tracing by executing


   Third, if you use emacs, you can use LISP from within emacs,
   which will let you easily keep a record of what you're doing,
   including the ability to scroll around.  In emacs, type

      M-x shell

   where M is the meta key, or, equivalently 

      ESC x shell

   This will put you in a shell within emacs.  You can then type


   as usual, and get LISP running.  Except now you can do all
   your LISP work within emacs, including scrolling, cutting and
   pasting, and switching buffers (e.g. to edit files!).  To kill
   the buffer where the shell is, you can go into that buffer and 

      ctl-x k

   TURN IN:  the listing of new1.lisp, together with a dribble of:

      (bu-parse '(edgar saw the fish in the restaurant))

Problem 5.3 [20 points]
[Remember, if you do this problem, you DON'T have to do 5.4!]

   a. Why would the VP rule for subcategorized PP's (Allen, p.89) be
      the wrong rule to use for the sentence in the previous problem?
   b. Consider the following data:

        *Sincerity admires John.
        Edgar admires John.
        John admires sincerity.
        *Sincerity admires John.

        Mary fed the cat.
        *The cat fed Mary.

      Describe in words what one might try WITHIN the grammar and 
      lexicon formalism given so far, in order to account for these 
      data.  What, if any, problems or difficulties would be posed
      by working within the current formalism?

      (Assume Mary is human, not a cat!  Ditto for all others
      with human names.)

Problem 5.4 [20 points]
[Remember, if you do this problem, you DON'T have to do 5.3!]

   a. Describe, either in plain English or by writing lexicon/grammar 
      entries by hand, changes you would need to make to 
      the lexicon and/or grammar in order to account for the 
      following data:

        The beer was expensive.
        A beer was expensive.
        Beer was expensive.
        *Restaurant was expensive.
        *Beer were expensive.
        Beers were expensive.
        Beer in the restaurant was expensive.
        *Beer in the restaurant were expensive.
        Beers in the restaurant were expensive.

   b. Based on the data in 5.4a, explain in words the ambiguity of

        Edgar bought the beer.


I'd really encourage you to try playing with the system -- adding
other lexical items you think might be interesting to play with, and
trying adding new grammar rules.  

To that end, I'll offer 10 points extra credit on this assignment for
implementing an answer to 5.4a, illustrated by showing the new grammar
and lexicon.  Your lexicon/grammar changes must lead to the right
results for

   (bu-parse '(the beer was expensive))
   (bu-parse '(a beer was expensive))

and the other 7 examples above, but you need NOT include the dribble
files to prove it.  (Don't waste the paper!)

Return to the course home page.