C++ Parser Classes for Interactive Fiction

Version 1.1

This software is freeware. Permission to copy, use, modify, sell and distribute this software is granted provided this copyright notice appears in all copies. This software is provided "as is" without express or implied warranty, and with no claim as to its suitability for any purpose.

Enjoy.

Revision History:

Date	Version	Description
03/24/2000	1.0	Initial Release
09/25/2002	1.1	Added regular expression; removed debug log; minor bug fixes

Table of Contents

Over View... 3

Theory.. 3

Compiling.. 3

Programmer’s notes. 4

Modifiables. 4

Parse strings. 4

Parts-of-Speech Codes. 4

Operational Codes. 5

Command Codes. 5

Regular Expressions. 6

Dictionary (diction.hpp) 9

Nfa (nfa.hpp) 11

Dfa (dfa.hpp) 12

LexicalAnalyzer (lexical.hpp) 13

Sentences (sentence.cpp) 14

Parser (parser.cpp) 15

Over View

This project is an attempt to implement an English sentence parser that was adaptable, without regard for complexity. The parser’s parse behavior is modified using “parse strings” that are similar to regular expressions, but using words instead of characters.

Theory

This section explains, in general, how the parse works. Some of the points touch upon will be explain in more detail in later sections.

When the constructor for the parse is called, one of the parameters given to it is a set of parse strings. These strings tell the parse what constitute a valid set of sentences and how to build these sentences using the Sentences class.

Given a set of parse strings, the parse constructs an equivalent Non-deterministic Finite Automaton (NFA). In simplest terms, an NFA is a directed graph made up of a finite set of nodes and edges. The nodes represent valid states for any traversal of the graph. For each node, its set of edges indicates what paths to other (or even the same) nodes exist, and what conditions much exist to follow a particular path. Following an edge from one node to another is called a transition. For the parser, the traversal condition is the matching of an inputted word to a word class, or part-of-speech (POS). Once a transition is made for a given word, the word is considered valid and the parse moves on to the next word. For each word to be valid, there must be a valid path from the current node to another. The parse ends when either there is no valid path for the given word (parse fails) or there are no more words to parse and the current node is a “final” node (parse succeeds). For NFAs, there is also a “free” transition that can always be followed, but the current word of input is not advanced to the next. This transition is called a Llamda (also called a null or epsilon) transition.

The NFA is non-deterministic because, for a given node, there can be more than one valid path. Before being used, to reduce ambiguity, speed up processing, and decrease the size of the NFA, the NFA is converted to a Deterministic Finite Automaton (DFA). With a DFA, there is always only one valid path and there are no Llamda transitions. Technically, though, what is build is still an NFA because some of the features and options allowed by the parse strings cause non-determinism. Ambiguities such as these are alleviated by precedence rules, which will be addressed in a later section.

Compiling

The code is written in C++, making heavy use of templates and the Standard Template Library (STL). You will need a fairly new C++ compiler. I have successful compiled the code and the simple test program using Visual C++ 6.0 (VC6) with service pack 5 and additional fixes at: http://www.dinkumware.com/vc_fixes.html; DJGPP (GNU) version 3.1.

Included with the source, are makefiles for VC6 and GNU C.

Programmer’s notes

Since I developed the code using a PC running Windows, I’ve been very liberal with file names. On a system like Unix, you’ll have to ensure that the case matches in all instances (file’s name, #include, makefile, etc.). On an MS-DOS system, you’ll have to shorten some of the names.
I’ve noticed that when compiling a debug build using VC6, execution times will increase dramatically when using STL. For very complex parse strings, initialization can take several seconds.

Modifiables

Modifiables are non-programming (data-only) features that can be manipulated to change the behavior of the parser, or other associated classes.

Parse strings

This set of strings is declared in “parsestr.h” and defined in “parsestr.cpp”. They define what groupings of words the parse shall except as valid. They also dictate how a Sentences class object is to be built from a given input. Each parse string consists of codes, some defining parts-of-speech (POS) and others defining such things as relation and repetition. White space is ignored, but may be used for clarity. All codes are case sensitive.

Parts-of-Speech Codes

The parse accepts the following POS or word classes:

Code	POS or Classes of Words Accepted as a Match
a	Any word in dictionary
A	Any word, including gibberish (a word not in the dictionary)
g	Gibberish
r	Article
d	Adverb
j	Adjective
c	Conjunction
p	Preposition
v	Verb
n	Noun (any)
o	Pronoun
P	Proper Noun
m	Match exact word (not used directly, see <word, class> below)
x	Regular Expression (not used directly, see /regex/number/ below). Used to match numbers, codes, etc.

Note that punctuation characters are their own POS codes. Punctuation is usually ignored by the parser, except when explicitly used in the parse string. See the example for operational code p, below.

Operational Codes

These codes are used to define when and how the POS codes are acceptable. In defining how these codes are used, the letters X, Y and Z are used to represent any one of the above POS codes. Note that some codes also act upon other codes, not just POS.

Code	Description	Example(s)
p	The POS must be present	pX p!
o	The POS is optional	OX
*	The POS is matched zero or more times	*X
+	The POS is matched one or more times	+X
\|	Any one of the choices may be matched	+X \| *Y \| pZ
()	Groups multiple codes together	(pXoY) *(pX \| pY)
{}	Short-cut for *()	{pX \| oY}
[]	Short-cut for +()	[pX \| oY]
<word, class>	Matches a particular word*	<tell, verb>
/exp/number/	Matches a regular expression**	/[0-9]+/5/

*The word should be in lower case and be defined in the dictionary used by the parser. The class name is optional. If it is present, it should be in lower case, and match one of the class names in WordClassNames found in “vocab.h”. If the class name is not present, the parser will choose one appropriate for the word, giving preference to the one with the highest precedence. From this code, the parser generates a POS code of type ‘m’.

**Use two slashes to represent a slash within the expression. “number” is an integer value used to identify the expression that was matched. “number” (including the following slash) is optional, and defaults to zero (0). From this code, the parser generates a POS code of type ‘x’. See Regular Expressions.

Command Codes

Command codes are optional and follow the POS code. The two codes are separated by a colon. Command codes are used to give special instructions to the parser. Following the command code is a function number. This number instructs the Sentences class how the word is to be added. Function numbers are optional, except with the ‘C’ command where they are mandatory. Function numbers may be present without a command code, but the colon must still be present. The behavior of the Sentence class object using the function number can be changed by modifying the AddWord method in “Sentence.cpp”. Note that function numbers will be “executed” in reverse order of parsed words. This is because the parse works recursively backwards, making sure that all words that come after a given word are valid before accepting the current word as valid. For example, with the parse string “pv:1 pn:2”, an acceptable input (a verb and a noun) will cause 2 to be executed and then 1. Also note that this means that the actual words are add to the sentence object in reverse order of entry.

A POS code without a function number defaults to having a function number of zero (0).

Code	Description	Example(s)
A	Add word to parsed words*	<green>:A1 <tree>:A
C	Execute Command only**	pv:C5 pn:c67
N	No command, same as omitting command	op:N3 pd:4

*This command code can only be used with m (<word, class>) codes. It does not match the word from input, but inserts it into the sentence at the current point of the parse. For example, the parse string “pv <to, preposition>:A <the> pn” would match a verb, insert the word “to” as a preposition, match the word “the”, and lastly match any one noun. Of course, “<to, preposition>” could be shorted to “<to>” if the word “to” is not part of any word class with a higher precedence than a preposition.

**With this command, both the operational code and the POS code are ignored, except for determining precedence.

Note that some interesting things can be done without the need to add more command codes. For example, to force the parser to match a word but discard it, use a function number that does nothing. Assuming function number 99 does nothing, the parse string “<tell>:99 pP <to>:99 pv pn” would match the input “tell bob to go home” as “bob go home”. Using this technique can greatly simplify processing of the sentence by having the parser remove redundant or frivolous words and phrases.

The set of parse strings are treated as if they were one string of the form “(string 1) | (string 2) | … (string n)”, except that when ambiguities arise, precedence is given the string with the lower number.

Regular Expressions

Regular expressions are of the form: /regex/number/, where regex is the regular expression and number is an integer value used to identify the matching expression. A slash may be used with the expression by representing it as two slashes (“//”). The number and its trailing slashing are optional and the number defaults to zero (0). Regular Expressions may contain the following:

Expression	Description
$ (dollar sign)	Match end of string.
. (period)	Match any single character.
* (asterisk)	Match zero or more repetitions of the preceding expression.
+ (plus sign)	Match one or more repetitions of the preceding expression.
? (question mark)	Match zero or one occurrences of the preceding expression.
\c	Match the literal character ‘c’ (escape sequence).
\xdd	Match the character with the value of 0xdd.
{count}	Match the specified number of the preceding expression.
{min,}	Match at least the specified number of the preceding expression.
{min,max}	Match between min and max of the preceding expression.
r1\|r2	Match either expression r1 or r2. ‘\|’ has the lowest precedence of all expressions, except for ‘/’.
r1/r2	Match r1 only if it is followed by r2. r2 is not consumed. '/' has the lowest precedence, and binds right to left. For example: given r1/r2/r3, r1 will only match if r2 follows r1, and r2 will only match if r3 follows r2. Neither r2 nor r3 are consumed.
() (parentheses)	Used to group expressions together.
[] (brackets)	Match any character in brackets (or any not in brackets, if the first character is '^'). Ranges can be specified, such as: 0-9, a-z, or A-Za-z. A closing bracket (']') may be in the set if it is the first character after the opening bracket ('[') or the '^' character. A minus sign ('-') may be in the set of it is the first character after the opening bracket ('[') or the '^' character or the last character before the closing bracket. A carat ('^') may be in the set if it is not the first character.

Bracketed expressions may also contain the following character classes:

Class	Description
[:alpha:]	Match any letter, as determined by isalpha()
[:lower:]	Match any lowercase letter, as determined by islower()
[:upper:]	Match any uppercase letter, as determined by isupper()
[:alnum:]	Match any letter or digit, as determined by isalnum()
[:digit:]	Match any digit, as determined by isdigit()
[:xdigit:]	Match any hexidecimal digit, as determined by isxdigit()
[:space:]	Match any whitespace (except \n or \r)
[:cntrl:]	Match any control character, as determined by iscntrl()
[:print:]	Match any printable character, as determined by isprint()
[:graph:]	Match any printable character, as determined by isgraph()
[:punct:]	Match any punctuation, as determined by ispunct(). Note that this can accept different characters those defined in PUNCT_CHARS.

Word Lookup

When a Dictionary class is instantiated, two parameters are given to the dictionary to tell it how to search its list of known words for a match. The first is a number, indicated the number of significant letters (SL) in a word. The second is a code that tells the dictionary the type of comparison to perform. These codes are defined as follows:

Code	Description
SiNone	Only the SL letters are significant
siAllLetters	All letters are significant (SL is ignored)
siNoAmbiguity	The dictionary will accept as few letters (>= SL) that avoid ambiguity

Examples:

If SL = 3 and code = siNone, then inputting “nor” will match “north” in the dictionary. If “northeast” is also in the dictionary, it is unknown which one the dictionary will pick.
If code = siAllLetters, the dictionary will only match inputted words that exactly match words in the dictionary.
If SL = 3 and code = siNoAmbiguity, then inputting “nor” will match “north” only if no other word in the dictionary starts with “nor”. Inputting “north” is not an ambiguity when “north” is in the dictionary, even if “northeast” is also in the dictionary. Exact matches are never ambiguous.

Note that siAllLetters has a search time of O(log n), while both siNone and siNoAmbiguity have a search time of O(n).

Known Words

The words known to the dictionary, including their types and other information, are contained in an array given to the constructor. The element type is WordRec, which is defined in “vocab.h”. In the current set of source code, the known words are defined in the Vocabulary array contained in the file “vocab.cpp”.

Punctuation

The characters that the lexical analyzer (lexer) and the parser treat as punctuation are declared in “vocab.h” and defined in “vocab.cpp” as the string variable Punctuation. Punctuation is usually ignored, except when explicitly referenced in a parse string. The lexer treats punctuation characters in between letters with no intervening white space as part of the word. For example, from the input “M.R.”, the lexer will extract “m.r” as the word. This seems to be an acceptable trade off of accepting abbreviations and ignoring punctuation.

Word Classes

Word classes, or parts-of-speech, are the basic mechanism that the parse uses to classify and determine the validity of words. The set of class types and there textual names, are define in WordClass and WordClassNames, respectively. These constants are found in “vocab.h”. They must be in the same order. If these are modified, the POS codes and corresponding source code should be modified accordingly.

Ambiguities

All ambiguities that arise while the parser validates an input are handled through precedence rules. When applying these rules, the parse will proceed from highest precedence to lowest until a valid match is found. Keep in mind that the match being valid is conditional on the words following the current one also being successfully parsed. When more than one rule can resolve an ambiguity, the rule used is the one with the lowest number, as they are list here. These rules are as follows:

When two or more parse strings can validate a given input, precedence is given to the one with the lowest number. Parse strings are numbered, starting with zero, in the order they are created.
When a given word is in more than one class, precedence is given to the class that occurs first in WordClass defined in “vocab.h”.
When a given word can match more than one transition (POS code) in the parser’s DFA, the parser determines which has precedence using the CharSet (character set) object trans_set found in “parser.cpp”. The POS code that occurs first in the string TransOrder is given precedence and is the one used to match the word.
If the parse has a choice between two paths with the same transition (POS code), the parser gives precedence to the one with the lowest function number. Note that the command code is currently not used in the determination of precedence. With precedence rules for both POS codes and function numbers, this is not needed.
Identical transitions are taken care of when the parser converts the NFA to a DFA, so these are not an issue.

Understanding the precedence rules and properly defining the items they are based upon is the key to ensuring that when ambiguities arise, the parse will choose the path you expect it to.

Exceptions

All exceptions thrown by the parser classes are base on the class std::exception. Exceptions are only thrown for unusual and detrimental reasons. A well-behaved program should never thrown one of these exceptions.

Classes

This section documents the various classes used to implement the parser. The descriptions are not in depth. The documentation is limited to the class description and major public identifiers (types, variables, and functions). More information can be gleaned from the header files. For each class, the relevant header file is listed in parentheses.

Dictionary (diction.hpp)

The dictionary class is used to hold information about all the words known to the parser. The dictionary maintains such information as the word’s internal code, the word classes it belongs to, and its textual (character string) representation. Words can only be added to the dictionary during construction.

Exceptions Thrown:

DuplicateWordFound
Thrown when a caller attempts to add a word already contained in the dictionary and has a different word code. It is acceptable to add a word that already exists with the same word code. In this case, the new information (i.e., word classes) will be merged into the existing.

Public Types:

SigCode
Enumeration of the set of allowable codes that indicate how the dictionary is to do word comparisons when validating a word (as apposed to adding a word, in which case an “exact match” is used.). See Word Lookup, under Modifiables.
ElementType
The structure used to store each word and its relevant information in the dictionary.
const_iterator
A constant STL iterator pointing to a dictionary item
value_type
The type of a dereferenced iterator. Under the current implementation, the dictionary’s internal word list is an STL map that maps a word’s textual name to its ElementType structure. Thus, value_type is of type pair<const std::string, element_type>.

Public Variables:

None

Constructor(s):

Dictionary(const WordRec vocab[], int sl, SigCode co = siNone)
This constructor may throw the DuplicateWordFound exception.

vocab: A list of information on all the known words. See Known Words, under Modifiables.
sl: The number of significant letters when comparing words. See Word Lookup, under Modifiables.
co: A code used to instruct the dictionary how to compare words when deciding if an inputted word is valid. See Word Lookup, under Modifiables.

Public Methods:

const_iterator begin(void) const
Returns a constant iterator to the first word in the dictionary’s internal list of words.
const_iterator end(void) const
Returns a constant iterator to one element passed the last word in the dictionary’s internal list of words. This may seem weird to those not familiar with STL, but this is done to simplify for loops using iterators. For example:
            for (const_iterator i = begin(); i != end(); ++i)

A little known fact is that both standard C and C++ allow a similar mechanism with pointers:
            int array[100];
            ...
            for (int* p = array; p != &array[100]; ++p)
Notice that array[100] is one past the last element of the array.
const_iterator find(const std::string& word) const
Locates a word in the dictionary by its textual representation, and returns an iterator to the word. If the word is not found, the function returns end(). Note that this function can take O(1), O(log n), or O(n) time. The time taken depends on whether the dictionary is instructed to do an “exact match” search (O(log n)), or not (O(n)). The dictionary can achieve an average time of O(1) (constant time) when doing an “exact match” search if the STL map container is implemented using a hash table.
const_iterator find(WordCode code) const
Locates a word in the dictionary by its word code, and returns an iterator to the word. If the word is not found, the function returns end(). Note that this function always takes O(n) time to execute.
bool find(const std::string& word, WordCode& code, WordClasses& classes, WordInfo& info) const
This function searches for a word in the dictionary using its textual representation, returning true or false to indicate whether the word is found. If the word is found, the word’s code, classes, and information are also returned. For information on the time this function takes to execute, see the description for the method const_iterator find(const std::string& word) const.

Nfa (nfa.hpp)

The NFA template class is used to represent a Non-deterministic Finite Automaton.

Namespace:

NFA

Exceptions Thrown:

None

Public Types:

Note that some of these types, though used with NFA classes, are defined outside of the class definition.

Edge
The class used to describe an edge
Node
The class used to implement a node. The class has several methods for inserting edges, including a convenient insertion operator (<<) that behaves in a similar way to the insertion operator used with streams (node1 << edge1 << edge2 ... << edge_n). The node class contains a public variable “final” of type bool used to initial whether or not the node is a final node.

Public Variables:

nodes
The NFA object’s list of nodes.
initial
The index of the initial node of the NFA.

Constructor(s):

NFA(void)

Public Methods:

void clear(void)
Removes all existing nodes from the NFA.
int addNode()
Adds an empty node to the NFA.
void addNode(const Node& pNode)
Adds a node to the NFA. The NFA also makes available an insertion operator (<<) that behaves in a similar way to the insertion operator used with streams (NFA1 << node1 << node2 << ... << node_n).
int addEdge(int from, int to, const trans_type& trans)
Adds an edge to the NFA.
int addNullEdge(int from, int to, EdgeFlag aFlag = NO_EDGE_FLAG)
Adds a Null (Llamda) edge to the NFA.
int last() const
Returns the index of the current last node in the NFA.

Dfa (dfa.hpp)

The Dfa template class is used to represent a Deterministic Finite Automaton. It is a child class of the Nfa template class.

Exceptions Thrown:

None

Public Types:

None

Public Variables:

None

Constructor(s):

DFA(void)
explicit Dfa(const Nfa& nfa)

Public Methods:

None

LexicalAnalyzer (lexical.hpp)

It is the job of the Lexical Analyzer (lexer or scanner) to group the input into words and punctuation. See Punctuation, under Modifiables, for an explanation of what the lexer considers to be punctuation.

Exceptions Thrown:

None

Public Types:

None

Public Variables:

None

Constructor(s):

LexicalAnalyzer (Dictionary& D)

D: The dictionary object that the lexer will use in validating inputted words.

Public Methods:

void Reset(void)
Resets the lexer’s position in the current input to the start of the line.
string::const_iterator SaveState(void) const
Returns a constant string iterator that marks the lexer’s current position so it can be used to restore that position at a later time.
void RestoreState(string::const_iterator pstring)
Restores the state of the lexer using the iterator returned by SaveState.
bool GetNextLine(void)
Causes the lexer to retrieve the next line of input. The function returns true if the retrieved line is empty.
bool GetNextWord(string& word, WordCode& code, WordClasses& classes, const Dictionary::ElementType*& entry)
This function return the next word from the current input line. The function returns true if there are no more words left in the line. Note that with this function, each punctuation character is consider a word and has its own word code. Punctuation characters embedded inside words are considered part of the code. See Punctuation, under Modifiables.
bool LineIsEmpty(void) const
Returns true if the current line of input is empty.
std::string CopyLine(bool skipws = true)
Copies the current line from the current position to the end.

Sentences (sentence.cpp)

The Sentences class is used by the parser in building structured sentences from the words returned by the lexer. When using Sentences and related classes, it is important to keep certain things in mind:

All lists of words contained within the Sentences class object will be in reverse order. Thus, the first subject in a list of subjects will be last one retrieved from input. This is because the parser validates and adds words recursive backwards, and because it is more efficient to add new words to the end of the vectors used to store them. This is the trade-off required for storing the words in vectors, which are more memory efficient (with better locality of reference) than lists or deques. To process the words in the order they were inputted, just use reverse iterators.
An input containing (seemingly) more than one sentence, but using the same set of subjects, will be stored as one sentence with multiple predicates. This is how compound sentences are handled. Multiple sentences, each with their own (even equivilent) set of subjects, will be stored as individual sentences. Remember that, for the reasons given above, they will be in reverse order of input.
Conjunctions are unique in that they are stored at the same level as what they separate. For example, an input of two sentences separated with the word “and”, will be stored as three sentences: the first; the word “and” as a sentences by itself; then the second. Similarly, the phase “the red and black house”, would store three adjectives under the word “house”: “red”, “and”, and “black”.

Exceptions Thrown:

None

Public Types:

Note that several of these types, though used with Sentences classes, are defined outside of the class definition.

Word
The class used to store information on a word parsed from input. All the word lists used with the Sentences class use this class as their element type.
Words
The basic word list type.
Adverb
The word class used to store an adverb.
Adverbs
The list type used to store a set of adverbs.
Adjective
The class used to store an adjective and its associated adverbs.
Adjectives
The list type used to store a set of adjectives.
Phrase
The class used to store noun phrases, which include a noun, its preposition, article, and adjectives.
Phrases
The list type used to store a set of noun phrases.
Subject
The class used to store a subject, its article, and set of adjectives.
Subjects
The list type used to store a set of subjects.
Predicate
The class used to store predicates, which include a verb, its adverbs, and noun phrases.
Predicates
The list type used to store a set of predicates.
Sentence
The class used to store a sentence, which includes its subjects and predicates.

Public Variables:

parse_string_num
This variable will, after a successful parse, hold an integer equal to the parse string number (zero-based) that was used to build the sentences.

Constructor(s):

Sentences(void)

Public Methods:

void clear(void)
Removes all the sentences from the list.
void AddWord(const Word& W, int code)
Adds a word to the list of sentences.

W: the word to be added.
code: the function number retrieved from the parse string. This code is use to tell the class how to add the word.

Parser (parser.cpp)

This class is the actual parser that validates inputted sentences and build a structured list of these sentences.

Exceptions Thrown:

BadParseString
This exception is thrown when the parser encounters an error in a parse string while build the NFA from it.

Public Types:

ParseResult
An enumeration of all possible resulted returned by the parse after parsing an input.

parseOk
Returned after a successful parse.
parseBadGrammer
The parser returns this value after an unsuccessful parse.
parseUnresolvedPronoun
This value is returned if the parse was successful, but the parse could not determine what a pronoun was referring to.

Public Variables:

None

Constructor(s):

Parser(LexicalAnalyzer& pL, Dictionary& pD, const char* const parse_strings[])
This constructor may throw a BadParseString exception.

pL: The lexer the parse will use to retrieve words from input.
pD: The dictionary the parse will use to retrieve information on words (such as for ‘m’ transitions). This dictionary should be the same object used to construct the lexer.
parse_strings: the set of strings that the parse will use in build the DFAs.

Public Methods:

bool parse(Sentences& S)
Parses the next line of input and places it in S. The function returns true if the parse was successful.
ParseResult GetParseResult(void) const
Returns the result code from the last parse attempt.
const Word& GetUnresolvedPronoun(void) const
If there was an unresolved pronoun in that last parse attempt, this function is used to determine what it was.
WordCode GetFailedWordCode(void) const
If the last parse failed, this function will return the code of the word that caused the failure.
const string& GetFailedWord(void) const
If the last parse failed, this function will return the textual representation of the word that caused the failure.
void SetObjectPronoun(const Word& w)
void SetMalePronoun(const Word& w)
void SetFemalePronoun(const Word& w)
These functions set the current word use for pronoun resolution of the type indicated in the function name.

C++ Parser Classes for Interactive Fiction

Modifiables

Code

POS or Classes of Words Accepted as a Match

Code

Description

Example(s)

Code

Description

Regular Expressions

Expression

Description

Class

Description

Word Lookup

Code

Punctuation

Exceptions Thrown: