UGA Logo
StudentPARSing Environment II
SPARSE II Manual (HTML Version), August 2001


Web
SPARSE II
Manual for
SPARSE II
Teaching
SPARSE II
Download
SPARSE II
C. Darwin's
Home Page

0.0 Release Notes

This is the manual for SPARSE II, Version 0.02, August, 2001. You need to read this to use SPARSE II. If you are using the CGI-Web version, just read Section 1.0 through 2.1.4. Please note that this early release may have a few bugs. Please tell me about them so that I can repair them. Feel free to distribute this as you like. I hope you find this useful.

Clayton
cdarwin@uga.edu

Files required to run the SPARSE II stand-alone version for Windows:

  • sparse2.exe - The parse program.
  • rules.txt - Holds your syntax rules (You may rename this file).
  • libpl.dll - Required in the MS-Windows environment.
  • plterm.dll - Required in the MS-Windows environment.

1.0 Introduction

This text is an introduction to SPARSE II, an improved version of the original Student PARSing Environment (SPARSE) developed by Michael Covington in the spring of 2000 for an introductory course in Natural Language Processing (NLP) (Covington 1999). The intended audience is syntax or NLP students unfamiliar with Prolog (the language in which SPARSE II is written). For technical information about SPARSE II, contact the author.

1.1 What is SPARSE II?

In the simplest description, SPARSE II is a parser. Given a set of grammar rules and a test phrase, SPARSE II will tell the user if the phrase parses, or in other words, if the given rules generate the given phrase. For example, if given the following rules:
np --> d,n.
d --> [the].
n --> [cat].
and asked to parse 'the cat,' SPARSE II will reply that the phrase does indeed parse, and in fact is an 'np.' That is, SPARSE II is indicating that the rules generate the phrase 'the cat.' As well, SPARSE II provides the user with a display of the structure:
       np       
       |        
    +------+    
    d      n    
    |      |    
   the    cat 
This, of course, is a very rudimentary example of the abilities of the program. Language structure is far more complex, and so is SPARSE II. In more appropriate terms, SPARSE II is a no-goal, left-corner parser written in Prolog that allows feature unification using GULP notation (Graph Unification Logic Programming, Covington 1994a). It handles left- recursive rules, null and optional constituents, and displays correct parses as trees and indented structures. As well, it is compiled to a stand-alone package which requires no knowledge of Prolog and runs from a floppy disk in the Windows environment (although the Prolog source code can be compiled for any platform). The key feature of SPARSE II, however, is that grammar rules are not integrated into the program: they are supplied by the user. This means that there is no underlying theoretical model in the program; it simply parses what it can with the rules available. The user is free to implement the rules according to any framework provided Definite-Clause Grammar (DCG) rules and feature unification are sufficient for the task (for readers unfamiliar with those terms, they will be explained later).

1.2 What are the Purpose and Goal of SPARSE II

SPARSE II has a twofold purpose, depending on the experience level of the user. At the introductory level, SPARSE II is intended to be used as a pedagogical tool for helping students grasp the complexity of natural-language grammars and to begin developing their own models. It provides a true introduction to NLP, without requiring familiarity with Lisp or Prolog. For the more advanced user, SPARSE II is a stable platform for testing experimental grammars. In particular, the parse algorithm used by SPARSE II returns all possible parses, which is an aid in debugging grammars. As well, rules are written in DCG format, so they are easily transferred to other applications.

The overall goal in writing the program was that it be complex in function, supporting a wide variety of grammar types and applications, but simple in operation, allowing user proficiency in a short period.

1.3 How does SPARSE II Work?

(If you are using the Web version the ideas are the same but the procedure is slightly different. See the instructions on the site itself for exact details.) The general procedure for using SPARSE II is quite simple. The user first creates an ASCII file of DCG rules (see Section 2), and then starts the program and loads the rules into memory. At this point the user can begin entering test phrases. If SPARSE II finds that the test phrase can be generated by the user's DCG rules, the parsed phrase is displayed first as a tree structure and then as an indented structure with full notation and depth of the constituents. At this point SPARSE II will ask if the user wants to look for additional parses (for ambiguous structures). A negative response ends the current parse sequence and begins a new one, while a positive response causes SPARSE II to backtrack and look for other possibilities until such time that none are found.

1.4 What is New About SPARSE II?

The original SPARSE (Covington 1999) does, as one might expect, have purposes and functions similar to those of its successor. However, there are several differences worth noting. SPARSE II uses DCG rules rather than Prolog structures (as SPARSE uses) for writing grammars. Thus students learn DCG notation which is very similar to the traditional phrase-structure rules found in linguistics rather than more unfamiliar Prolog list structures. As well, the SPARSE top-down parser has been replaced with a left-corner parser capable of handling left recursion and optional and/or null constituents, the SPARSE ordered feature- structure unification has been changed to GULP notation which allows random ordering, and the indented tree display used in SPARSE has been augmented with an ASCII-art tree display. For portability, the Prolog code has been compiled to a stand-alone Windows application (or Web CGI script). And finally, because of the more complex nature of the rules allowed by SPARSE II, the random sentence generator has been removed.

2.0 Using SPARSE II

Of course, what remains is how to operate SPARSE II. This will be discussed in two parts: writing the DCG rules, and running the program. (Those using the Web version need not be concerned with the latter.)

2.1.0 Writing Rules for SPARSE II

In order for SPARSE II to parse, it needs a grammar, which the user must supply. This grammar must be saved as a plain ASCII file, so it is best to use an ASCII editor rather than a word processor. Notepad is a good choice for this and comes standard with Windows. The best place to save this rules file is in the same directory as the SPARSE II executable file 'sparse2.exe.' The best name to give it is 'rules.txt' as this is the first choice of SPARSE II when loading, but the file can have any name and be in any directory. (For the Web version, rules are written on-line or pasted to the web page, so the editor used is unimportant.)

2.1.1 General Rule Writing

The rules SPARSE II uses are called Definite-Clause Grammar (or DCG) rules, and they have a very specific format. Violating this format prevents SPARSE II from loading rules properly which will generally cause parses to fail. Fortunately, the format of DCG rules is similar to the traditional way of writing phrase-structure rules in linguistics, but there are some differences. In general, a DCG rule for a non-terminal constituent (one that does not end with a particular word) is as follows:
np --> n,d.
This means that 'np' consists of an 'n' and a 'd.' The arrow is made with two hyphens and a greater-than sign. Now notice these four very important facts:
  1. DCG rules must end with a period (.). This tells SPARSE II that the rule has ended.
  2. Named items are written in lower case, even proper names. Capitals are reserved for variables. This will be discussed below.
  3. Constituents on the right of the arrow are separated by commas.
  4. There is only one constituent on the left of the arrow.

And that is how to write non-terminal rules. Here are a few more examples:

s --> np,vp.
vp --> v,pp.
pp --> p,np.
np --> np,conj,np.
The question now is how to write terminal rules (word rules). Here is the format:
d --> [the].
n --> [cat];[dog];[rat].
The key ideas to remember here are the following:
  1. DCG rules must end with a period (.). This is the second warning.
  2. Words are enclosed in square brackets.
  3. Words are separated by semicolons.
  4. A terminal constituent cannot have the same name as a non-terminal constituent. That is, SPARSE will not allow the following:
    a --> b,c.
    b --> d,e.
    b --> [this];[will];[not];[work].
    because 'b' is both a terminal and non-terminal constituent.

Here are a few more examples of terminal rules:

p --> [with];[in];[on];[over].
v --> [eat].
v --> [sleep].
Notice that it does not matter for terminal rules if words of the same type are listed separately (as with 'v').

2.1.2 Optional Constituents

In our earlier rule we said that 'vp' was made of a 'v' and a 'pp.' This is not always the case. Sometimes there is only a 'v.' We can express this option in two ways. The first is with two rules:
vp --> v.
vp --> v,pp.
The second is by square-bracketing the optional constituent.
vp -- v,[pp].
In this case, they both mean the same thing. However, sometimes it takes a combination of these two forms to get a construction right, so remember that you can do both. Also remember that constituents are separated by commas. Here are some more examples of optional constituents.
vp --> [aux],[neg],v,[np],[pp].
np --> d,n,[pp].
And never do this:
**np --> [d],[n].
**np --> [n].
The first rule means that there can be an invisible 'np' anywhere (an 'np' is made of two things that don't need to be there), and this will make the parser loop if it is combined with a rule like 'np --> np,conj,np.' The second rule will be read as a terminal rule and recorded as a word entry.

2.1.3 Null Constituents

With optional constituents, the structural position is not kept if the option is not kept. However, there are instances where it would be nice to maintain a structural position without a word entry. For example, it might be desirable to see that a determiner could be present even when it is not, as in the following:
       np       
       |        
    +------+    
    d      n    
    |      |    
   null   cats
This can be done in the following manner:
np --> d,n.
d --> [].
n --> [cats].
The first rule, which is non-terminal, requires 'd,' so the structure is maintained. However, the second rule, which is terminal, says that 'd' may not be present. Thus, 'cats' can be, and is, parsed as and 'np.' The null constituent could also be listed in a series of words.
d --> [the];[a];[].
Here is a more complex example involving a null complimentizer as well as null determiners. It requires the addition of this rule:
cc --> [that];[].


                 s                                                
                 |                                                
        +------------------+                                      
        np                 vp                                     
        |                  |                                      
     +------+       +--------------+                              
     d      n       v              cp                             
     |      |       |              |                              
     |      |       |      +----------------+                     
     |      |       |      cc               s                     
     |      |       |      |                |                     
     |      |       |      |         +--------------+             
     |      |       |      |         np             vp            
     |      |       |      |         |              |             
     |      |       |      |      +------+     +---------+        
     |      |       |      |      d      n     v         np       
     |      |       |      |      |      |     |         |        
     |      |       |      |      |      |     |      +------+    
     |      |       |      |      |      |     |      d      n    
     |      |       |      |      |      |     |      |      |    
   null   birds   know   null   null   cats   eat   null   rats   

2.1.4 Feature Unification

The problem with the above rules is that the determiner cannot be left off of any 'np,' only plural ones. This brings up feature unification, which is the process of matching the plural feature of the 'n' to the null 'd.' To do this, SPARSE II uses GUPL notation (see Covington 1994a or 1994b for a thorough discussion of unification using GULP). Examine the following rules:
np(num:X) --> d(num:X),n(num:X).
d(num:pl) --> [the];[].
d(num:sg) --> [a].
n(num:pl) --> [cats].
n(num:sg) --> [cat].
These rules now have features added to make them parse in the correct way. The first rule uses a variable (anything beginning with a capital letter) to say that an 'np' with 'num' of some value 'X' is made of a 'd' and an 'n' with the same value 'X'. This means that the null 'd' will never match with 'cat' because the null 'd' has the feature 'num:pl' not 'num:sg.' So now SPARSE II can match rules more narrowly. Here are some key ideas to remember about feature unification:
  1. A feature consists of a (name:value) pair added to a constituent. The value may be a name or a variable, or even another (name:value) pair.
  2. A name is anything beginning with a lower case letter.
  3. A variable is anything beginning with a capital letter.
  4. A variable only means the same thing in multiple places if it is in the same rule. So all 'X' in the rule above are the same, but if there were another rule with 'X,' it would be a different 'X' altogether.
  5. A constituent can have as many features as it needs, or none. Multiple features are joined within the same parentheses by putting '..' between them (two periods). For example:
    vp(num:X..tense:Y..whatever:Z) --> v(num:X..tense:Y),np(whatever:Z).
  6. Constituent names without features such as 'np' will unify (or match) any like-named constituent with features such as 'np(case:X..num:Y).'
  7. The order of features is unimportant. SPARSE II will find and match them for you.

Here is a silly grammar that matches 'd' to 'n' by number and 'adj' to 'n' by sex (a bit of semantics).

np(num:N..sex:S) --> d(num:N),[adj(sex:S)],n(num:N..sex:S).
n(num:singular..sex:fem) --> [sow];[mare];[hen].
n(num:singular..sex:male) --> [boar];[stallion];[rooster].
n(num:plural..sex:fem) --> [sows];[mares];[hens].
n(num:plural..sex:male) --> [boars];[stallions];[roosters].
d(num:singular) --> [each];[a];[this].
d(num:plural) --> [all];[these].
d(num:plural) --> [];[the].
adj(sex:fem) --> [motherly].
adj(sex:male) --> [fatherly]. 
Again, see the previously mentioned texts for a more thorough discussion of GULP.

2.2.0 Running the SPARSE II Program

(SPARSE II Web users need not read any further) Once you have a file of rules, you will want to test them and determine how much they can do. After all, the real goal in NLP is to do as much as you can with as little as you can. You will find this testing easy with SPARSE II.

2.2.1 Installing SPARSE II

SPARSE II does not require installation. However, if you wish you can copy it to your hard drive and create a link to it using standard Windows procedures. The only concern is that the three files 'sparse2.exe,' 'plterm.dll,' and 'libpl.dll' be in the same directory. It is also helpful if the rules file is located in the same directory so that the path will not need to be typed when loading the file. If you downloaded the file 'sparse2.zip' then you will have to unzip it before you can use it. You can download a free zip program for doing this from WinZip.

2.2.2 Starting SPARSE II

SPARSE II is started by double-clicking on the 'sparse2.exe' file icon in the directory where you have stored the SPARSE files. An error message indicates that one of the DLL files is missing. Once the program has started, enlarge the screen and follow the prompts until you arrive at the initial load prompt, which looks something like this:

At this point you are ready to begin. If you have saved your rules in a file named 'rules.txt' which is in the same directory as SPARSE II, then pressing ENTER at the next prompt will load your rules. If your rule file has another name and/or location, type 'n' and ENTER. You will then be prompted to enter the location of your rules file. Type the standard DOS path to your file:

C:\temp\trash\rules.txt
and your rules will be loaded. If you receive an error message, you misdirected the computer. You will have to try again using the 'l' command (see below).

After the initial load sequence you will arrive at the main menu, which looks something like this:

Now you are ready to go.

2.2.3 Help

Typing 'h' and ENTER displays a short help file that reminds you of a few things. If you have already read this paper, it will not be of much help.

2.2.4 Load

Typing 'l' and ENTER begins a new load sequence. It works the same as the initial sequence described above. An important thing to remember is that SPARSE II will delete all current rules and load whatever rules it finds in your rule file, as fouled up as they might be.

2.2.5 Parse

Simply follow the prompts. Type the phrase you would like to try, press ENTER, and see what happens. You do not have to worry about capitals and punctuation; SPARSE II removes them all. If the parse is successful, a tree will be displayed, such as this:

After you are finished examining the tree, press ENTER and an indented structure with much more information will be displayed:

You will then be asked if you want to try another parse (the clause may have an ambiguous structure according to your rules). The choice is yours. Eventually you will be told that there are no more parses, and the parse sequence will begin again.

2.2.5 View

The view function displays the current rules being used by the parser, 15 at a time.

2.2.6 Quit

Typing 'q' and ENTER is how you get out of program itself. Exiting the program with 'q' tells Windows that things are normal. Sometimes exiting by using the X in the top right corner of the screen causes problems, but it will work if need be (although it may take several tries).

2.2.7 The SWI-XPCE Environment

The window that opens when your program starts is courtesy of the SWI-XPCE GUI program, and it has some handy features that make using SPARSE II easier. First, there is a scroll bar on the right side for viewing what has already left the screen. As well, the up- arrow key will return previously typed entries to the prompt (try the down-arrow key too). Finally, XPCE likes it much better if you quit the program by using 'q' instead of the X.

3.0 References

  • Covington, Michael. (1994a) GULP 3.1: An extension of Prolog for Unification-Based grammar. Research Report AI-1994-06, ftp://ftp.ai.uga.edu/pub/ai.reports/.
  • Covington, Michael. (1994b) Natural language programming for Prolog programmers. Upper Saddle River, NJ: Prentice Hall.
  • Covington, Michael. (1999) SPARSE - Student PARSing Environment. http//:www.ai.uga.edu/~mc.

UGA Arches This site uses SPARSE II CGI scripts written in SWI Prolog by Clayton M. Darwin
and is made available courtesy of the University of Georgia AI Center.