C++ Parser Classes for Interactive Fiction

Version 1.1

Copyright © 2002 Rich Herrick (xenos@stny.rr.com)

 

This software is freeware.  Permission to copy, use, modify, sell and distribute this software is granted provided this copyright notice appears in all copies. This software is provided "as is" without express or implied warranty, and with no claim as to its suitability for any purpose.

 

Enjoy.

 

Revision History:

 

Date

Version

Description

03/24/2000

1.0

Initial Release

09/25/2002

1.1

Added regular expression; removed debug log; minor bug fixes

 

 

 

 


Table of Contents

 

Over View... 3

Theory.. 3

Compiling.. 3

Programmer’s notes. 4

Modifiables. 4

Parse strings. 4

Parts-of-Speech Codes. 4

Operational Codes. 5

Command Codes. 5

Regular Expressions. 6

Word Lookup. 7

Known Words. 8

Punctuation. 8

Word Classes. 8

Ambiguities. 8

Exceptions. 9

Classes. 9

Dictionary (diction.hpp) 9

Nfa (nfa.hpp) 11

Dfa (dfa.hpp) 12

LexicalAnalyzer (lexical.hpp) 13

Sentences (sentence.cpp) 14

Parser (parser.cpp) 15


Over View

This project is an attempt to implement an English sentence parser that was adaptable, without regard for complexity.  The parser’s parse behavior is modified using “parse strings” that are similar to regular expressions, but using words instead of characters.

 

Theory

This section explains, in general, how the parse works.  Some of the points touch upon will be explain in more detail in later sections.

 

When the constructor for the parse is called, one of the parameters given to it is a set of parse strings.  These strings tell the parse what constitute a valid set of sentences and how to build these sentences using the Sentences class.

 

Given a set of parse strings, the parse constructs an equivalent Non-deterministic Finite Automaton (NFA).  In simplest terms, an NFA is a directed graph made up of a finite set of nodes and edges.  The nodes represent valid states for any traversal of the graph.  For each node, its set of edges indicates what paths to other (or even the same) nodes exist, and what conditions much exist to follow a particular path.  Following an edge from one node to another is called a transition.  For the parser, the traversal condition is the matching of an inputted word to a word class, or part-of-speech (POS).  Once a transition is made for a given word, the word is considered valid and the parse moves on to the next word.  For each word to be valid, there must be a valid path from the current node to another.  The parse ends when either there is no valid path for the given word (parse fails) or there are no more words to parse and the current node is a “final” node (parse succeeds).  For NFAs, there is also a “free” transition that can always be followed, but the current word of input is not advanced to the next.  This transition is called a Llamda (also called a null or epsilon) transition.

 

The NFA is non-deterministic because, for a given node, there can be more than one valid path.  Before being used, to reduce ambiguity, speed up processing, and decrease the size of the NFA, the NFA is converted to a Deterministic Finite Automaton (DFA).  With a DFA, there is always only one valid path and there are no Llamda transitions.  Technically, though, what is build is still an NFA because some of the features and options allowed by the parse strings cause non-determinism.  Ambiguities such as these are alleviated by precedence rules, which will be addressed in a later section.

 

Compiling

The code is written in C++, making heavy use of templates and the Standard Template Library (STL).  You will need a fairly new C++ compiler.  I have successful compiled the code and the simple test program using Visual C++ 6.0 (VC6) with service pack 5 and additional fixes at: http://www.dinkumware.com/vc_fixes.html; DJGPP (GNU) version 3.1.

 

Included with the source, are makefiles for VC6 and GNU C.

Programmer’s notes

 

Modifiables

Modifiables are non-programming (data-only) features that can be manipulated to change the behavior of the parser, or other associated classes.

 

Parse strings

This set of strings is declared in “parsestr.h” and defined in “parsestr.cpp”.  They define what groupings of words the parse shall except as valid.  They also dictate how a Sentences class object is to be built from a given input.  Each parse string consists of codes, some defining parts-of-speech (POS) and others defining such things as relation and repetition.  White space is ignored, but may be used for clarity.  All codes are case sensitive.

 

Parts-of-Speech Codes

The parse accepts the following POS or word classes:

 

Code

POS or Classes of Words Accepted as a Match

a

Any word in dictionary

A

Any word, including gibberish (a word not in the dictionary)

g

Gibberish

r

Article

d

Adverb

j

Adjective

c

Conjunction

p

Preposition

v

Verb

n

Noun (any)

o

Pronoun

P

Proper Noun

m

Match exact word (not used directly, see <word, class> below)

x

Regular Expression (not used directly, see /regex/number/ below).  Used to match numbers, codes, etc.

 

Note that punctuation characters are their own POS codes.  Punctuation is usually ignored by the parser, except when explicitly used in the parse string.  See the example for operational code p, below.

Operational Codes

These codes are used to define when and how the POS codes are acceptable.  In defining how these codes are used, the letters X, Y and Z are used to represent any one of the above POS codes.  Note that some codes also act upon other codes, not just POS.

 

Code

Description

Example(s)

p

The POS must be present

pX p!

o

The POS is optional

OX

*

The POS is matched zero or more times

*X

+

The POS is matched one or more times

+X

|

Any one of the choices may be matched

+X | *Y | pZ

()

Groups multiple codes together

(pXoY)  *(pX | pY)

{}

Short-cut for *()

{pX | oY}

[]

Short-cut for +()

[pX | oY]

<word, class>

Matches a particular word*

<tell, verb>

/exp/number/

Matches a regular expression**

/[0-9]+/5/

*The word should be in lower case and be defined in the dictionary used by the parser.  The class name is optional.  If it is present, it should be in lower case, and match one of the class names in WordClassNames found in “vocab.h”.  If the class name is not present, the parser will choose one appropriate for the word, giving preference to the one with the highest precedence.  From this code, the parser generates a POS code of type ‘m’.

 

**Use two slashes to represent a slash within the expression.  “number” is an integer value used to identify the expression that was matched.  “number” (including the following slash) is optional, and defaults to zero (0).  From this code, the parser generates a POS code of type ‘x’.  See Regular Expressions.

 

Command Codes

Command codes are optional and follow the POS code.  The two codes are separated by a colon.  Command codes are used to give special instructions to the parser.  Following the command code is a function number.  This number instructs the Sentences class how the word is to be added.  Function numbers are optional, except with the ‘C’ command where they are mandatory.  Function numbers may be present without a command code, but the colon must still be present.  The behavior of the Sentence class object using the function number can be changed by modifying the AddWord method in “Sentence.cpp”.  Note that function numbers will be “executed” in reverse order of parsed words.  This is because the parse works recursively backwards, making sure that all words that come after a given word are valid before accepting the current word as valid.  For example, with the parse string “pv:1 pn:2”, an acceptable input (a verb and a noun) will cause 2 to be executed and then 1.  Also note that this means that the actual words are add to the sentence object in reverse order of entry.

 

A POS code without a function number defaults to having a function number of zero (0).

 

Code

Description

Example(s)

A

Add word to parsed words*

<green>:A1 <tree>:A

C

Execute Command only**

pv:C5 pn:c67

N

No command, same as omitting command

op:N3 pd:4

*This command code can only be used with m (<word, class>) codes.  It does not match the word from input, but inserts it into the sentence at the current point of the parse. For example, the parse string “pv <to, preposition>:A <the> pn” would match a verb, insert the word “to” as a preposition, match the word “the”, and lastly match any one noun.  Of course, “<to, preposition>” could be shorted to “<to>” if the word “to” is not part of any word class with a higher precedence than a preposition.

 

**With this command, both the operational code and the POS code are ignored, except for determining precedence.

 

Note that some interesting things can be done without the need to add more command codes.  For example, to force the parser to match a word but discard it, use a function number that does nothing.  Assuming function number 99 does nothing, the parse string “<tell>:99 pP <to>:99 pv pn” would match the input “tell bob to go home” as “bob go home”.  Using this technique can greatly simplify processing of the sentence by having the parser remove redundant or frivolous words and phrases.

 

The set of parse strings are treated as if they were one string of the form “(string 1) | (string 2) | … (string n)”, except that when ambiguities arise, precedence is given the string with the lower number.

 

Regular Expressions

Regular expressions are of the form: /regex/number/, where regex is the regular expression and number is an integer value used to identify the matching expression.  A slash may be used with the expression by representing it as two slashes (“//”).  The number and its trailing slashing are optional and the number defaults to zero (0).  Regular Expressions may contain the following:

 

Expression

Description

$ (dollar sign)

Match end of string.

. (period)

Match any single character.

* (asterisk)

Match zero or more repetitions of the preceding expression.

+ (plus sign)

Match one or more repetitions of the preceding expression.

? (question mark)

Match zero or one occurrences of the preceding expression.

\c

Match the literal character ‘c’ (escape sequence).

\xdd

Match the character with the value of 0xdd.

{count}

Match the specified number of the preceding expression.

{min,}

Match at least the specified number of the preceding expression.

{min,max}

Match between min and max of the preceding expression.

r1|r2

Match either expression r1 or r2.  ‘|’ has the lowest precedence of all expressions, except for ‘/’.

r1/r2

Match r1 only if it is followed by r2.  r2 is not consumed.  '/' has the lowest precedence, and binds right to left. For example: given r1/r2/r3, r1 will only match if r2 follows r1, and r2 will only match if r3 follows r2.  Neither r2 nor r3 are consumed.

() (parentheses)

Used to group expressions together.

[] (brackets)

Match any character in brackets (or any not in brackets, if the first character is '^').  Ranges can be specified, such as: 0-9, a-z, or A-Za-z.  A closing bracket (']') may be in the set if it is the first character after the opening bracket ('[') or the '^' character.  A minus sign ('-') may be in the set of it is the first character after the opening bracket ('[') or the '^' character or the last character before the closing bracket.  A carat ('^') may be in the set if it is not the first character.

 

Bracketed expressions may also contain the following character classes:

Class

Description

[:alpha:]

Match any letter, as determined by isalpha()

[:lower:]

Match any lowercase letter, as determined by islower()

[:upper:]

Match any uppercase letter, as determined by isupper()

[:alnum:]

Match any letter or digit, as determined by isalnum()

[:digit:]

Match any digit, as determined by isdigit()

[:xdigit:]

Match any hexidecimal digit, as determined by isxdigit()

[:space:]

Match any whitespace (except \n or \r)

[:cntrl:]

Match any control character, as determined by iscntrl()

[:print:]

Match any printable character, as determined by isprint()

[:graph:]

Match any printable character, as determined by isgraph()

[:punct:]

Match any punctuation, as determined by ispunct(). Note that this can accept different characters those defined in PUNCT_CHARS.

 

 

Word Lookup

When a Dictionary class is instantiated, two parameters are given to the dictionary to tell it how to search its list of known words for a match.  The first is a number, indicated the number of significant letters (SL) in a word.  The second is a code that tells the dictionary the type of comparison to perform.  These codes are defined as follows:

 

Code

Description

SiNone

Only the SL letters are significant

siAllLetters

All letters are significant (SL is ignored)

siNoAmbiguity

The dictionary will accept as few letters (>= SL) that avoid ambiguity

 

Examples:

 

  1. If SL = 3 and code = siNone, then inputting “nor” will match “north” in the dictionary.  If “northeast” is also in the dictionary, it is unknown which one the dictionary will pick.
  2. If code = siAllLetters, the dictionary will only match inputted words that exactly match words in the dictionary.
  3. If SL = 3 and code = siNoAmbiguity, then inputting “nor” will match “north” only if no other word in the dictionary starts with “nor”.  Inputting “north” is not an ambiguity when “north” is in the dictionary, even if “northeast” is also in the dictionary.  Exact matches are never ambiguous.

 

Note that siAllLetters has a search time of O(log n), while both siNone and siNoAmbiguity have a search time of O(n).

 

Known Words

The words known to the dictionary, including their types and other information, are contained in an array given to the constructor.  The element type is WordRec, which is defined in “vocab.h”.  In the current set of source code, the known words are defined in the Vocabulary array contained in the file “vocab.cpp”.

 

Punctuation

The characters that the lexical analyzer (lexer) and the parser treat as punctuation are declared in “vocab.h” and defined in “vocab.cpp” as the string variable Punctuation.  Punctuation is usually ignored, except when explicitly referenced in a parse string.  The lexer treats punctuation characters in between letters with no intervening white space as part of the word.  For example, from the input “M.R.”, the lexer will extract “m.r” as the word.  This seems to be an acceptable trade off of accepting abbreviations and ignoring punctuation.

Word Classes

Word classes, or parts-of-speech, are the basic mechanism that the parse uses to classify and determine the validity of words.  The set of class types and there textual names, are define in WordClass and WordClassNames, respectively.  These constants are found in “vocab.h”.  They must be in the same order.  If these are modified, the POS codes and corresponding source code should be modified accordingly.

 

Ambiguities

All ambiguities that arise while the parser validates an input are handled through precedence rules.  When applying these rules, the parse will proceed from highest precedence to lowest until a valid match is found.  Keep in mind that the match being valid is conditional on the words following the current one also being successfully parsed.  When more than one rule can resolve an ambiguity, the rule used is the one with the lowest number, as they are list here.  These rules are as follows:

 

  1. When two or more parse strings can validate a given input, precedence is given to the one with the lowest number.  Parse strings are numbered, starting with zero, in the order they are created.
  2. When a given word is in more than one class, precedence is given to the class that occurs first in WordClass defined in “vocab.h”.
  3. When a given word can match more than one transition (POS code) in the parser’s DFA, the parser determines which has precedence using the CharSet (character set) object trans_set found in “parser.cpp”.  The POS code that occurs first in the string TransOrder is given precedence and is the one used to match the word.
  4. If the parse has a choice between two paths with the same transition (POS code), the parser gives precedence to the one with the lowest function number.  Note that the command code is currently not used in the determination of precedence.  With precedence rules for both POS codes and function numbers, this is not needed.
  5. Identical transitions are taken care of when the parser converts the NFA to a DFA, so these are not an issue.

 

Understanding the precedence rules and properly defining the items they are based upon is the key to ensuring that when ambiguities arise, the parse will choose the path you expect it to.

 

Exceptions

All exceptions thrown by the parser classes are base on the class std::exception.  Exceptions are only thrown for unusual and detrimental reasons.  A well-behaved program should never thrown one of these exceptions.

Classes

This section documents the various classes used to implement the parser.  The descriptions are not in depth.  The documentation is limited to the class description and major public identifiers (types, variables, and functions).  More information can be gleaned from the header files.  For each class, the relevant header file is listed in parentheses.

 

Dictionary (diction.hpp)

The dictionary class is used to hold information about all the words known to the parser.  The dictionary maintains such information as the word’s internal code, the word classes it belongs to, and its textual (character string) representation.  Words can only be added to the dictionary during construction.

 

Exceptions Thrown:

 

Public Types:

 

Public Variables:

            None

 

Constructor(s):

 

Public Methods:

 

Nfa (nfa.hpp)

The NFA template class is used to represent a Non-deterministic Finite Automaton.

 

Namespace:

NFA

 

Exceptions Thrown:

            None

 

Public Types:

Note that some of these types, though used with NFA classes, are defined outside of the class definition.

 

Public Variables:

 

Constructor(s):

 

Public Methods:

 

Dfa (dfa.hpp)

The Dfa template class is used to represent a Deterministic Finite Automaton.  It is a child class of the Nfa template class.

 

Exceptions Thrown:

            None

 

Public Types:

None

 

Public Variables:

None

 

Constructor(s):

 

Public Methods:

None

 

LexicalAnalyzer (lexical.hpp)

It is the job of the Lexical Analyzer (lexer or scanner) to group the input into words and punctuation.  See Punctuation, under Modifiables, for an explanation of what the lexer considers to be punctuation.

 

Exceptions Thrown:

            None

 

Public Types:

            None

 

Public Variables:

            None

 

Constructor(s):

 

Public Methods:

 

Sentences (sentence.cpp)

The Sentences class is used by the parser in building structured sentences from the words returned by the lexer.  When using Sentences and related classes, it is important to keep certain things in mind:

  1. All lists of words contained within the Sentences class object will be in reverse order.  Thus, the first subject in a list of subjects will be last one retrieved from input.  This is because the parser validates and adds words recursive backwards, and because it is more efficient to add new words to the end of the vectors used to store them.  This is the trade-off required for storing the words in vectors, which are more memory efficient (with better locality of reference) than lists or deques.  To process the words in the order they were inputted, just use reverse iterators.
  2. An input containing (seemingly) more than one sentence, but using the same set of subjects, will be stored as one sentence with multiple predicates.  This is how compound sentences are handled.  Multiple sentences, each with their own (even equivilent) set of subjects, will be stored as individual sentences.  Remember that, for the reasons given above, they will be in reverse order of input.
  3. Conjunctions are unique in that they are stored at the same level as what they separate.  For example, an input of two sentences separated with the word “and”, will be stored as three sentences: the first; the word “and” as a sentences by itself; then the second.  Similarly, the phase “the red and black house”, would store three adjectives under the word “house”: “red”, “and”, and “black”.

 

Exceptions Thrown:

            None

 

Public Types:

Note that several of these types, though used with Sentences classes, are defined outside of the class definition.

 

Public Variables:

 

Constructor(s):

 

Public Methods:

Parser (parser.cpp)

This class is the actual parser that validates inputted sentences and build a structured list of these sentences.

 

Exceptions Thrown:

 

Public Types:

 

Public Variables:

            None

 

Constructor(s):

 

Public Methods: