Cody Boisclair, November–December 2007
The Morphological Parser is designed to parse common inflectionary forms in English into their component stems and features. Currently it handles the -s form of verbs and nouns, the -ed, -en, and -ing forms of verbs, and the -er and -est forms of adjectives and adverbs, including some common irregular spellings of all these forms.
The algorithm for the handling of regular morphological forms is based on the Prolog code provided in figure 9.5 of Natural Language Processing for Prolog Programmers (Covington, 1994).
Also embedded in the class library is a lexicon of common irregular forms, as given in The Oxford English Grammar (Greenbaum, 1996) and A Practical English Grammar (second edition, Thomson and Martinet, 1980), which is consulted during the parse and which overrides any analysis of the form as regular if a match is found. Both these and all regular forms may be overridden further by the user, as explained in the description of the IrregularsFilePath
property below.
The morphological parser exists in the form of a class library (DLL) known as MorphParser.dll
, which consists of the following classes and enumerations, all in the NLPUtils
namespace:
MorphParser
: The static class which performs the actual morphological analysis.
The following public methods are provided:
ParseWord(string)
: Parses the morphology of the single word specified in the parameter, returning a List<List<Morph>>
, in which each component List<Morph>
represents one possible parse.TokenizeAndParse(string)
: Uses the Tokenizer
class library to tokenize the given string, then calls ParseWord
on each token that is identifiable as a word (i.e., neither numeric nor punctuation). Returns a List<List<List<Morph>>>
, where each top-level element represents the result of running ParseWord
on a single token from the input string.The following properties control certain aspects of the morphological analysis:
IrregularsFilePath
: Optionally specifies the path of an XML file in which additional irregular forms are defined. Irregular forms defined in this external file override both regular forms processed by the parser and irregular forms defined in the built-in lexicon. The format of this file is described in this document; a further example, containing the embedded lexicon of irregular forms, can be found in NLPUtils\MorphParser\IrregularForms.xml
.IrregularsCaseSensitive
: If true, irregular forms are matched from both the embedded list and the optional external list only when the capitalization matches exactly; if false, case is ignored when matching irregular forms. Default is false.PreserveCase
: If true, the stem remains capitalized as it was in the original token; if false, the stem is always lowercased. Default is true.An additional property named Version
returns a System.Reflection.Version
object identifying the version number of the class library.
Token: Represents a single morph derived from the analysis. Contains the following properties:
Spelling
: The spelling/surface representation of the morph. In the case of feature morphs, the most common English spelling of the feature is used, rather than the spelling originally found in the token (for instance, "feet" is represented by two Morph
objects with Spelling
s of "foot" and "s", respectively).Type
: A MorphType
identifying whether the morph is the stem, an added feature, or a non-word (e.g., number or punctuation).Category
: A SyntacticCat
identifying the syntactic category into which the morph falls, if known, or set to SyntacticCat.Unknown
if unknown.MorphType
: An enumeration defining the types used in the Type
property of the Morph
class. Currently supported types are Stem
(the stem of the word), Feature
(an affix added on to the stem), and NonWord
(a 'morph' that is not actually a word, e.g., numbers and punctuation).
SyntacticCat
: An enumeration defining the categories used in the Category
property of the Morph
class. Essentially, these represent the part of speech of a stem, or the part of speech onto which a feature is attached. Currently supported categories are Noun
, Verb
, Adjective
, Adverb
, and Unknown
.
At this time, there is no use of a lexicon to narrow down possible spellings of the stem for regular forms; rather, all possible spellings are generated and returned by the parser. (For instance, the word "quickest" may be analyzed as "quick"+"est", "quicke"+"est", "quic"+"est", or an uninflected "quickest".) The results are returned in the form of a List<List<Morph>>
object, however, which means that each parse may easily be iterated through and eliminated as necessary in the program using the results provided by the parser. I am considering eventually adding routines for optionally rejecting parses during the parsing process based on a provided lexicon, but these have not been added yet.
I have also included a basic command-line tool, known as MorphParserTest
, which uses the MorphParser library to tokenize either a token or a string specified as arguments to the command and displays the results with one token's analysis per line. A compiled version of the program may be found in NLPUtils\TokenizerTest\bin\Release
.
Examples of how to use the program:
MorphParserTest babies
MorphParserTest -s "The quick brown fox jumps over the lazy dog."
Like the Tokenizer, as the bulk of the work has been done on the actual class library itself, I'll admit that this interface is quite primitive; it was mainly designed for my own testing of the class library.
This is a 1.0 beta release; as far as I can tell, it appears to work acceptably, but there may be bugs I have not yet encountered in my own testing. Use at your own discretion. The build number for the compiled assembly included in this package is 1.0.3002.644.
If you have any comments, suggestions, or potential improvements, don't hesitate to drop me a line at codemanb@uga.edu.