StudentPARSing
Environment II
SPARSE II Manual (HTML Version), August 2001
0.0 Release Notes
This is the manual for SPARSE II, Version 0.02, August, 2001. You need to read this to
use SPARSE II. If you are using the CGI-Web version, just read Section 1.0 through 2.1.4.
Please note that this early release may have a few bugs. Please tell me about them so
that I can repair them. Feel free to distribute this as you like. I hope you find this useful.
Clayton
cdarwin@uga.edu
Files required to run the SPARSE II stand-alone version for Windows:
- sparse2.exe - The parse program.
- rules.txt - Holds your syntax rules (You may rename this file).
- libpl.dll - Required in the MS-Windows environment.
- plterm.dll - Required in the MS-Windows environment.
1.0 Introduction
This text is an introduction to SPARSE II, an improved version of the original
Student PARSing Environment (SPARSE) developed by Michael Covington in the spring of 2000 for
an introductory course in Natural Language Processing (NLP) (Covington 1999). The intended
audience is syntax or NLP students unfamiliar with Prolog (the language in which SPARSE II is
written). For technical information about SPARSE II, contact the author.
1.1 What is SPARSE II?
In the simplest description, SPARSE II is a parser. Given a set of grammar rules and
a test phrase, SPARSE II will tell the user if the phrase parses, or in other words, if the
given rules generate the given phrase. For example, if given the following rules:
np --> d,n.
d --> [the].
n --> [cat].
and asked to parse 'the cat,' SPARSE II will reply that the phrase does indeed parse, and in
fact is an 'np.' That is, SPARSE II is indicating that the rules generate the phrase 'the
cat.' As well, SPARSE II provides the user with a display of the structure:
np
|
+------+
d n
| |
the cat
This, of course, is a very rudimentary example of the abilities of the program.
Language structure is far more complex, and so is SPARSE II. In more appropriate terms,
SPARSE II is a no-goal, left-corner parser written in Prolog that allows feature unification
using GULP notation (Graph Unification Logic Programming, Covington 1994a). It handles left-
recursive rules, null and optional constituents, and displays correct parses as trees and
indented structures. As well, it is compiled to a stand-alone package which requires no
knowledge of Prolog and runs from a floppy disk in the Windows environment (although the
Prolog source code can be compiled for any platform). The key feature of SPARSE II, however,
is that grammar rules are not integrated into the program: they are supplied by the user.
This means that there is no underlying theoretical model in the program; it simply parses what
it can with the rules available. The user is free to implement the rules according to any
framework provided Definite-Clause Grammar (DCG) rules and feature unification are sufficient
for the task (for readers unfamiliar with those terms, they will be explained later).
1.2 What are the Purpose and Goal of SPARSE II
SPARSE II has a twofold purpose, depending on the experience level of the user. At
the introductory level, SPARSE II is intended to be used as a pedagogical tool for helping
students grasp the complexity of natural-language grammars and to begin developing their own
models. It provides a true introduction to NLP, without requiring familiarity with Lisp or
Prolog. For the more advanced user, SPARSE II is a stable platform for testing experimental
grammars. In particular, the parse algorithm used by SPARSE II returns all possible parses,
which is an aid in debugging grammars. As well, rules are written in DCG format, so they are
easily transferred to other applications.
The overall goal in writing the program was that it be complex in function, supporting
a wide variety of grammar types and applications, but simple in operation, allowing user
proficiency in a short period.
1.3 How does SPARSE II Work?
(If you are using the Web version the ideas are the same but the procedure is slightly
different. See the instructions on the site itself for exact details.)
The general procedure for using SPARSE II is quite simple. The user first creates an
ASCII file of DCG rules (see Section 2), and then starts the program and loads the rules into
memory. At this point the user can begin entering test phrases. If SPARSE II finds that the
test phrase can be generated by the user's DCG rules, the parsed phrase is displayed first as
a tree structure and then as an indented structure with full notation and depth of the
constituents. At this point SPARSE II will ask if the user wants to look for additional
parses (for ambiguous structures). A negative response ends the current parse sequence and
begins a new one, while a positive response causes SPARSE II to backtrack and look for other
possibilities until such time that none are found.
1.4 What is New About SPARSE II?
The original SPARSE (Covington 1999) does, as one might expect, have purposes and
functions similar to those of its successor. However, there are several differences worth
noting. SPARSE II uses DCG rules rather than Prolog structures (as SPARSE uses) for writing
grammars. Thus students learn DCG notation which is very similar to the traditional
phrase-structure rules found in linguistics rather than more unfamiliar Prolog list
structures. As
well, the SPARSE top-down parser has been replaced with a left-corner parser capable of
handling left recursion and optional and/or null constituents, the SPARSE ordered feature-
structure unification has been changed to GULP notation which allows random ordering, and the
indented tree display used in SPARSE has been augmented with an ASCII-art tree display. For
portability, the Prolog code has been compiled to a stand-alone Windows application (or Web
CGI script). And
finally, because of the more complex nature of the rules allowed by SPARSE II, the random
sentence generator has been removed.
2.0 Using SPARSE II
Of course, what remains is how to operate SPARSE II. This will be discussed in two
parts: writing the DCG rules, and running the program. (Those using the Web version
need not be concerned with the latter.)
2.1.0 Writing Rules for SPARSE II
In order for SPARSE II to parse, it needs a grammar, which the user must supply. This
grammar must be saved as a plain ASCII file, so it is best to use an ASCII editor rather than
a word processor. Notepad is a good choice for this and comes standard with Windows. The
best place to save this rules file is in the same directory as the SPARSE II executable file
'sparse2.exe.' The best name to give it is 'rules.txt' as this is the first choice of SPARSE
II when loading, but the file can have any name and be in any directory. (For the Web version,
rules are written on-line or pasted to the web page, so the editor used is unimportant.)
2.1.1 General Rule Writing
The rules SPARSE II uses are called Definite-Clause Grammar (or DCG) rules, and they
have a very specific format. Violating this format prevents SPARSE II from loading rules
properly which will generally cause parses to fail. Fortunately, the format of DCG rules is
similar to the traditional way of writing phrase-structure rules in linguistics, but there are
some differences. In general, a DCG rule for a non-terminal constituent (one that does not
end with a particular word) is as follows:
np --> n,d.
This means that 'np' consists of an 'n' and a 'd.' The arrow is made with two hyphens and a
greater-than sign. Now notice these four very important facts:
- DCG rules must end with a period (.). This tells SPARSE II that the rule has ended.
- Named items are written in lower case, even proper names. Capitals are reserved for
variables. This will be discussed below.
- Constituents on the right of the arrow are separated by commas.
- There is only one constituent on the left of the arrow.
And that is how to write non-terminal rules. Here are a few more examples:
s --> np,vp.
vp --> v,pp.
pp --> p,np.
np --> np,conj,np.
The question now is how to write terminal rules (word rules). Here is the format:
d --> [the].
n --> [cat];[dog];[rat].
The key ideas to remember here are the following:
- DCG rules must end with a period (.). This is the second warning.
- Words are enclosed in square brackets.
- Words are separated by semicolons.
- A terminal constituent cannot have the same name as a non-terminal constituent.
That is, SPARSE will not allow the following:
a --> b,c.
b --> d,e.
b --> [this];[will];[not];[work].
because 'b' is both a terminal and non-terminal constituent.
Here are a few more examples of terminal rules:
p --> [with];[in];[on];[over].
v --> [eat].
v --> [sleep].
Notice that it does not matter for terminal rules if words of the same type are
listed separately (as with 'v').
2.1.2 Optional Constituents
In our earlier rule we said that 'vp' was made of a 'v' and a 'pp.' This is not always
the case. Sometimes there is only a 'v.' We can express this option in two ways. The first
is with two rules:
vp --> v.
vp --> v,pp.
The second is by square-bracketing the optional constituent.
vp -- v,[pp].
In this case, they both mean the same thing. However, sometimes it takes a combination of
these two forms to get a construction right, so remember that you can do both.
Also remember that constituents are separated by commas. Here
are some more examples of optional constituents.
vp --> [aux],[neg],v,[np],[pp].
np --> d,n,[pp].
And never do this:
**np --> [d],[n].
**np --> [n].
The first rule means that there can be an invisible 'np' anywhere (an 'np' is made of two
things that don't need to be there), and this will make the
parser loop if it is combined with a rule like 'np --> np,conj,np.' The second rule will be
read as a terminal rule and recorded as a word entry.
2.1.3 Null Constituents
With optional constituents, the structural position is not kept if the option is not
kept. However, there are instances where it would be nice to maintain a structural position
without a word entry. For example, it might be desirable to see that a determiner could be
present even when it is not, as in the following:
np
|
+------+
d n
| |
null cats
This can be done in the following manner:
np --> d,n.
d --> [].
n --> [cats].
The first rule, which is non-terminal, requires 'd,' so the structure is maintained. However,
the second rule, which is terminal, says that 'd' may not be present. Thus, 'cats' can be, and
is, parsed as and 'np.' The null constituent could also be listed in a series of words.
d --> [the];[a];[].
Here is a more complex example involving a null complimentizer as well as null determiners.
It requires the addition of this rule:
cc --> [that];[].
s
|
+------------------+
np vp
| |
+------+ +--------------+
d n v cp
| | | |
| | | +----------------+
| | | cc s
| | | | |
| | | | +--------------+
| | | | np vp
| | | | | |
| | | | +------+ +---------+
| | | | d n v np
| | | | | | | |
| | | | | | | +------+
| | | | | | | d n
| | | | | | | | |
null birds know null null cats eat null rats
2.1.4 Feature Unification
The problem with the above rules is that the determiner cannot be left off of any 'np,'
only plural ones. This brings up feature unification, which is the process of matching the
plural feature of the 'n' to the null 'd.' To do this, SPARSE II uses GUPL notation (see
Covington 1994a or 1994b for a thorough discussion of unification using GULP). Examine the
following rules:
np(num:X) --> d(num:X),n(num:X).
d(num:pl) --> [the];[].
d(num:sg) --> [a].
n(num:pl) --> [cats].
n(num:sg) --> [cat].
These rules now have features added to make them parse in the correct way. The first rule
uses a variable (anything beginning with a capital letter) to say that an 'np' with 'num' of
some value 'X' is made of a 'd' and an 'n' with the same value 'X'. This means that the null
'd' will never match with 'cat' because the null 'd' has the feature 'num:pl' not 'num:sg.'
So now SPARSE II can match rules more narrowly. Here are some key ideas to remember
about feature unification:
- A feature consists of a (name:value) pair added to a constituent. The value may be
a name or a variable, or even another (name:value) pair.
- A name is anything beginning with a lower case letter.
- A variable is anything beginning with a capital letter.
- A variable only means the same thing in multiple places if it is in the same rule.
So all 'X' in the rule above are the same, but if there were another rule with 'X,'
it would be a different 'X' altogether.
- A constituent can have as many features as it needs, or none. Multiple features
are joined within the same parentheses by putting '..' between them (two periods).
For example:
vp(num:X..tense:Y..whatever:Z) --> v(num:X..tense:Y),np(whatever:Z).
- Constituent names without features such as 'np' will unify (or match) any like-named
constituent with features such as 'np(case:X..num:Y).'
- The order of features is unimportant. SPARSE II will find and match them for you.
Here is a silly grammar that matches 'd' to 'n' by number and 'adj' to 'n' by sex (a bit
of semantics).
np(num:N..sex:S) --> d(num:N),[adj(sex:S)],n(num:N..sex:S).
n(num:singular..sex:fem) --> [sow];[mare];[hen].
n(num:singular..sex:male) --> [boar];[stallion];[rooster].
n(num:plural..sex:fem) --> [sows];[mares];[hens].
n(num:plural..sex:male) --> [boars];[stallions];[roosters].
d(num:singular) --> [each];[a];[this].
d(num:plural) --> [all];[these].
d(num:plural) --> [];[the].
adj(sex:fem) --> [motherly].
adj(sex:male) --> [fatherly].
Again, see the previously mentioned texts for a more thorough discussion of GULP.
2.2.0 Running the SPARSE II Program
(SPARSE II Web users need not read any further) Once you have a file of rules,
you will want to test them and determine how much they
can do. After all, the real goal in NLP is to do as much as you can with as little as you
can. You will find this testing easy with SPARSE II.
2.2.1 Installing SPARSE II
SPARSE II does not require installation. However, if you wish you can copy it to your
hard drive and create a link to it using standard Windows procedures. The only concern is
that the three files 'sparse2.exe,' 'plterm.dll,' and 'libpl.dll' be in the same directory.
It is also helpful if the rules file is located in the same directory so that the path will
not need to be typed when loading the file. If you downloaded the file 'sparse2.zip' then
you will have to unzip it before you can use it. You can download a free zip program for
doing this from WinZip.
2.2.2 Starting SPARSE II
SPARSE II is started by double-clicking on the 'sparse2.exe' file icon in the directory
where you have stored the SPARSE files. An error
message indicates that one of the DLL files is missing. Once the program has started, enlarge
the screen and follow the prompts until you arrive at the initial load prompt,
which looks something like this:
At this point you are ready to begin. If you have saved your rules in a file
named 'rules.txt' which is in the same directory as SPARSE II, then pressing ENTER at the next
prompt will load your rules. If your rule file has another name and/or location, type 'n' and
ENTER. You will then be prompted to enter the location of your rules file. Type the standard
DOS path to your file:
C:\temp\trash\rules.txt
and your rules will be loaded. If you receive an error message, you misdirected the
computer. You will have to try again using the 'l' command (see below).
After the initial load sequence you will arrive at the main menu, which looks
something like this:
Now you are ready to go.
2.2.3 Help
Typing 'h' and ENTER displays a short help file that reminds you of a few things. If
you have already read this paper, it will not be of much help.
2.2.4 Load
Typing 'l' and ENTER begins a new load sequence. It works the same as the initial
sequence described above. An important thing to remember is that SPARSE II will
delete all current rules and load whatever rules it
finds in your rule file, as fouled up as they might be.
2.2.5 Parse
Simply follow the prompts. Type
the phrase you would like to try, press ENTER, and see what happens. You do not have to worry
about capitals and punctuation; SPARSE II removes them all. If the parse is successful, a
tree will be displayed, such as this:
After you are finished examining the tree, press ENTER and an indented structure with much
more information will be displayed:
You will then be asked if you want to try another parse (the clause may have an ambiguous
structure according to your rules). The choice is yours. Eventually you will be told that
there are no more parses, and the parse sequence will begin again.
2.2.5 View
The view function displays the current rules being used by the parser, 15 at a time.
2.2.6 Quit
Typing 'q' and ENTER is how you get out of
program itself. Exiting the program with 'q' tells Windows that things are normal. Sometimes
exiting by using the X in the top right corner of the screen causes problems, but it will work
if need be (although it may take several tries).
2.2.7 The SWI-XPCE Environment
The window that opens when your program starts is courtesy of the SWI-XPCE GUI
program, and it has some handy features that make using SPARSE II easier. First, there is a
scroll bar on the right side for viewing what has already left the screen. As well, the up-
arrow key will return previously typed entries to the prompt (try the down-arrow key too).
Finally, XPCE likes it much better if you quit the program by using 'q' instead of the X.
3.0 References
- Covington, Michael. (1994a) GULP 3.1: An extension of Prolog for Unification-Based
grammar. Research Report AI-1994-06, ftp://ftp.ai.uga.edu/pub/ai.reports/.
- Covington, Michael. (1994b) Natural language programming for Prolog programmers.
Upper Saddle River, NJ: Prentice Hall.
- Covington, Michael. (1999) SPARSE - Student PARSing Environment.
http//:www.ai.uga.edu/~mc.
This site uses SPARSE II CGI scripts written in
SWI Prolog by
Clayton M. Darwin
and is made available courtesy of the
University of Georgia AI Center.
|