PC-PATR Reference Manual

Stephen McConnel

Alpha test version 0.97a9 of October 30, 1995

Contents:

Overview

This document describes PC-PATR an implementation of the PATR-II computational linguistic formalism for personal computers. It is available for MS-DOS, Microsoft Windows, Macintosh, and Unix. (The Microsoft Windows implementation uses the Microsoft C QuickWin function, and the Macintosh implementation uses the MPW C SIOW function.)

PC-PATR uses a left corner chart parser with these characteristics:

PC-PATR is still under development, and neither the program nor this document are close to being finished. We would appreciate feedback directed to the author at the following address:
mail address                         telephone   
--------------------------------------------------------------              
Stephen McConnel                     (972)708-7361 (office)   
Academic Computing Department        (972)708-7363 (fax)      
Summer Institute of Linguistics                               
7500 W. Camp Wisdom Road             electronic mail address   
Dallas, TX 75236                     -------------------------
U.S.A.                               steve@acadcomp.sil.org   
                                     or steve.mcconnel@sil.org

1 The PATR-II formalism

The PATR-II formalism can be viewed as a computer language for encoding linguistic information. It does not presuppose any particular theory of syntax. It was originally developed by Stuart M. Shieber at Stanford University in the early 1980's (Shieber 1984, Shieber 1986). A PATR-II grammar consists of a set of rules and a lexicon. Each rule consists of a context-free phrase structure rule and a set of feature constraints that is, unifications on the feature structures associated with the constituents of the phrase structure rules. The lexicon provides the items that can replace the terminal symbols of the phrase structure rules, that is, the words of the language together with their relevant features.

1.1 Phrase structure rules

Context-free phrase structure rules should be familiar to anyone who has studied either linguistic theory or computer science. They look like this:
        LHS ---> RHS_1 RHS_2 . . .
LHS (the symbol to the left of the arrow) is a nonterminal symbol for the type of phrase that is being described. To the right of the arrow is an ordered list of the constituents of the phrase. These constituents are either nonterminal symbols, appearing on the left hand side of some rule in the grammar, or terminal symbols, representing basic classes of elements from the lexicon. These basic classes usually correspond to what are commonly called parts of speech. In PATR-II the terminal and nonterminal symbols are both referred to as categories.

Figure 1: Context-free phrase structure grammar of English subset

Rule  S       -> NP VP (SubCl)
Rule  NP      -> {(Det) (AdjP) N (PrepP)} / PR
Rule  Det     -> DT / PR
Rule  VP      -> VerbalP (NP / AdjP) (AdvP)
Rule  VerbalP -> V
Rule  VerbalP -> AuxP V
Rule  AuxP    -> AUX (AuxP_1)
Rule  PrepP   -> PP NP
Rule  AdjP    -> (AV) AJ (AdjP_1)
Rule  AdvP    -> {AV / PrepP} (AdvP_1)
Rule  SubCl   -> CJ S

Consider the PC-PATR style context-free phrase structure grammar in figure 1 It has ten nonterminal symbols (S, NP, Det, VP, VerbalP, AuxP, PrepP, AdjP, AdvP, and SubCl), and nine terminal symbols (N, PR, DT, V, AUX, PP, AV, AJ, and CJ). This grammar describes a small subset of English sentences. Several aspects of this grammar are worth mentioning.

  1. Optional constituents (or sets of constituents) on the right hand side are enclosed in parentheses.

  2. Alternative constituents (or sets of constituents) on the right hand side are separated by slashes.

  3. Braces are used to group alternative sets of elements together, so that alternations are not ambiguous.

  4. Symbols should not be repeated verbatim within a rule. Repeated symbols should be distinguished from each other by adding a different index number to a symbol each time it is repeated. Index numbers are introduced by the underscore (_) character.

Figure 2: Parse of sample English sentence

(see figure 3)

Figure 3: Parse of sample English sentence (PC-PATR output)

                S
      __________|__________
     NP                  VP
   ___|____      _________|__________
  Det     N   VerbalP  NP         AdvP
   |     man     |      |           |
  DT             V     PR         PrepP
  the          sees    us      _____|______
                              PP         NP
                             with     ____|_____
                                     Det       N
                                      |    telescope
                                     DT
                                      a

A significant amount of grammar development can be done just with context-free phrase structure rules such as these. For example, parsing the sentence "the man sees us with a telescope" with this simple grammar produces a parse tree like that shown in figure 3. Parsing the similar sentence "we see the man with a telescope" produces two different parses as shown in figure 4, correctly showing the ambiguity between whether we used a telescope to see the man, or the man had a telescope when we saw him.

Figure 4: Parses of an ambiguous English sentence

             S 
   __________|__________
  NP                   VP 
   |      _____________|_____________
  PR   VerbalP        NP           AdvP 
  we      |         ___|____         |
          V       Det       N      PrepP 
         see       |       man  _____|______
                  DT           PP          NP 
                  the         with      ____|_____
                                       Det       N 
                                        |     telescope 
                                       DT 
                                        a 


          S 
   _______|________
  NP              VP 
   |      ________|________
  PR   VerbalP            NP 
  we      |      _________|__________
          V     Det       N        PrepP 
         see     |       man     _____|______
                DT               PP          NP 
                the             with     ____|_____
                                        Det       N 
                                         |    telescope 
                                        DT 
                                         a 
A fundamental problem with context-free phrase structure grammars is that they tend to grossly overgenerate. For example, the sample grammar would incorrectly recognize the sentence "*he see the man with a telescope", assigning it tree structures similar to those shown in figure 4. With only the simple categories used by context-free phrase structure rules, a very large number of rules are required to accurately handle even a small subset of a language's grammar. This is the primary motivation behind feature structures, the basic enhancement of PATR-II over context-free phrase structure grammars. (Gazdar and Mellish (1989, pages 142-147) discuss why context-free phrase structure grammars are inadequate to model some human languages. The PATR-II formalism (unification of feature structures added to the context-free phrase structure rules) is shown to be adequate for those cases.)

1.2 Feature structures

The basic data structure of the PATR-II formalism is called a feature structure. A feature structure contains one or more features. A feature consists of an attribute name and a value. Feature structures are commonly written as attribute-value matrices like this (example 1):
(1)
[ lex:  telescope 
  cat:  N ]
where lex and cat are attribute names, and telescope and N are the values for those attributes. Note that the feature structure is enclosed in brackets, with an open bracket marking the beginning of a feature structure and a matching close bracket marking its end. Each feature occurs on a separate line, with the name coming first, followed by a colon and then its value. Feature names and (simple) values are single words consisting of alphanumeric characters.

Feature structures can have either simple values, such as the example above, or complex values, such as this (example 2):

(2)
[ lex:       telescope 
  cat:       N 
  gloss:     `telescope 
  head:    [ agr:   [ 3sg:  + ] 
             number:  SG 
             pos:     N  
             proper:  - 
             verbal:  - ] 
  root_pos:  N ]
where the value of the head feature is another feature structure, that also contains an embedded feature structure. Feature structures can be arbitrarily nested in this manner.

Portions of a feature structure can be referred to using the path notation. A path is a sequence of one or more feature names enclosed in angled brackets ( <> ). For instance, example 3-5 would all be valid feature paths based on the feature structure of example 2

(3) <head> 
(4) <head number> 
(5) <head agr 3sg> 
Paths are used in feature templates and feature constraints, described below.

Different features within a feature structure can share values. This is not the same thing as two features having identical values. In example 6 below, the <head agr> and <subj head agr> features have identical values, but in example 7 they share the same value:

(6)
[  cat:   S 
   pred:  [ cat:   VP 
            head:  [ agr:   [ 3sg:  + ] 
                     finite:  + 
                     pos:     V 
                     tense:   PAST 
                     vform:   ED ] ] 
   subj:  [ cat:   NP 
            head:  [ agr:   [ 3sg:     + ] 
                     case:    NOM 
                     number:  SG 
                     pos:     N 
                     proper:  - 
                     verbal:  - ] ] ]

(7)
[  cat:   S 
   pred:  [ cat:   VP 
            head:  [ agr:     $1[ 3sg:  + ] 
                     finite:  + 
                     pos:     V 
                     tense:   PAST 
                     vform:   ED ] ] 
   subj:  [ cat:   NP 
            head:  [ agr:     $1 
                     case:    NOM 
                     number:  SG 
                     pos:     N 
                     proper:  - 
                     verbal:  - ] ] ]
(Example 8 not used, same as example 7.)

Shared values are indicated by the coindexing markers $1, $2, and so on.

Note that upper and lower case letters used in feature names and values are distinctive. For example, NUMBER is not the same as Number or number. (This is also true of the symbols used in the context-free phrase structure rules.)

1.3 Unification

Unification is the basic operation applied to feature structures in PC-PATR It consists of the merging of the information from two feature structures. Two feature structures can unify if their common features have the same values, but do not unify if any feature values conflict.

Consider the following feature structures:

(9)
[ agreement:  [ number:  singular 
                person:  first ] ]

(10)
[ agreement:  [ number:  singular ] 
  case:         nominative ] ]

(11)
[ agreement:  [ number:  singular 
                person:  third ] ]

(12)
[ agreement:  [ number:  singular 
                person:  first ] 
  case:       nominative ] ]

(13)
[ agreement:  [ number:  singular 
                person:  third ] 
  case:       nominative ] ]

Feature 10 can unify with either feature 9 (producing feature 12) or feature 11 (producing feature 13). However, feature 9 cannot unify with feature 11 due to the conflict in the values of their <agreement person> features.

1.4 Feature constraints

The feature constraints associated with phrase structure rules in PC-PATR consist of a set of unification expressions. Each expression has three parts, in this order:
  1. a feature path, the first element of which is one of the symbols from the phrase structure rule

  2. an equal sign (=)

  3. either a simple value, or another feature path that also starts with a symbol from the phrase structure rule
As an example, consider the following PC-PATR rules:
(14)
Rule S -> NP VP (SubCl)
    <NP head agr>  = <VP head agr>
    <NP head case> = NOM
    <S subj>       = <NP>
    <S head>       = <VP head>
(15)
Rule NP -> {(Det) (AJ) N (PrepP)} / PR
    <Det head number> = <N head number>
    <NP head>         = <N head>
    <NP head>         = <PR head>
Rule 14 has two feature constraints that limit the co-occurrence of NP and VP, and two feature constraints that build the feature structures for S. This highlights the dual purpose of feature constraints in PC-PATR: limiting the co-occurrence of phrase structure elements and constructing the feature structure for the element defined by a rule. The first constraint states that the NP and VP <head agr> features must unify successfully, and also modifies both of those features if they do unify. The second constraint states that NP's <head case> feature must either be equal to NOM or else be undefined. In the latter case, it is set equal to NOM. The last two constraints create a new feature structure for S from the feature structures for NP and VP.

Rule 15 illustrates another important point about feature constraints. Constraints are applied only if they involve the phrase structure constituents actually found for the rule.

Figure 5: PC-PATR grammar of English subset

Rule  S -> NP VP (SubCl)
        <NP head agr>  = <VP head agr>
        <NP head case> = NOM
        <S subj> = <NP>
        <S pred> = <VP>

Rule  NP -> {(Det) (AdjP) N (PrepP)} / PR
        <Det head number> = <N head number>
        <NP head> = <N head>
        <NP head> = <PR head>

Rule  Det -> DT / PR
        <PR head case> = GEN
        <Det head> = <DT head>
        <Det head> = <PR head>

Rule  VP -> VerbalP (NP / AdjP) (AdvP)
        <NP head case>   = ACC
        <NP head verbal> = -
        <VP head> = <VerbalP head>

Rule  VerbalP -> V
        <V head finite> = +
        <VerbalP head>  = <V head>

Rule  VerbalP -> AuxP V
        <V head finite> = -
        <VerbalP head>  = <AuxP head>

Rule  AuxP -> AUX (AuxP_1)
        <AuxP head> = <AUX head>

Rule  PrepP -> PP NP
        <NP head case> = ACC
        <PrepP head> = <PP head>

Rule  AdjP -> (AV) AJ (AdjP_1)

Rule  AdvP -> {AV / PrepP} (AdvP_1)

Rule  SubCl -> CJ S

Figure 6: PC-PATR output with feature structure

1:
                S 
      __________|__________
     NP                  VP 
   ___|____      _________|__________
  Det     N   VerbalP  NP         AdvP 
   |     man     |      |           |
  DT             V     PR         PrepP 
  the           saw    us      _____|______
                              PP         NP 
                             with     ____|_____
                                     Det       N 
                                      |    telescope 
                                     DT 
                                      a 

[ cat: S pred: [ cat: VP head: [ agr: $1[ 3sg: + ] finite:+ pos: V tense: PAST vform: ED ] ] subj: [ cat: NP head: [ agr: $1 case: NOM number:SG pos: N proper:- verbal:- ] ] ]

1 parse found

Figure 5 shows the grammar of figure 1 augmented with a number of feature constraints. With this grammar (and a suitable lexicon), the parse output shown in figure 3 would include the sentence feature structure, as shown in figure 6. Note that the <subj head agr> and <pred head agr> features share a common value as a result of the feature constraint unifications associated with the rule S --> NP VP (SubCl).

PC-PATR allows disjunctive feature constraints with its phrase structure rules. Consider rules 16 and 17 below. These two rules have the same phrase structure rule part. They can therefore be collapsed into the single rule 18, which has a disjunction in its unification constraints.

(16)
Rule CP -> NP C'        ; for wh questions with NP fronted
    <NP type wh> = +
    <C' moved A-bar> = <NP>
    <CP type wh> = <NP type wh>
    <CP type> = <C' type>
    <CP moved A-bar> = none
    <CP type root> = +          ; root clauses
    <CP type q> = +
    <CP type fin> = +
    <CP moved A> = none
    <CP moved head> = none
(17)
Rule CP -> NP C'        ; for wh questions with NP fronted
    <NP type wh> = +
    <C' moved A-bar> = <NP>
    <CP type wh> = <NP type wh>
    <CP type> = <C' type>
    <CP moved A-bar> = none
    <CP type root> = -          ; non-root clauses
(18)
Rule CP -> NP C'        ; for wh questions with NP fronted
    <NP type wh> = +
    <C' moved A-bar> = <NP>
    <CP type wh> = <NP type wh>
    <CP type> = <C' type>
    <CP moved A-bar> = none
    {
    <CP type root> = +		; root clauses
    <CP type q> = +
    <CP type fin> = +
    <CP moved A> = none
    <CP moved head> = none
        /
    <CP type root> = -		; non-root clauses
    }

Not only does PC-PATR allow disjunctive unification constraints, but it also allows disjunctive phrase structure rules. Consider rule 19; it is very similar to rule 18. These two rules can be further combined to form rule 20, which has disjunctions in both its phrase structure rule and its unification constraints.

(19)
Rule CP -> PP C'        ; for wh questions with PP fronted
    <PP type wh> = +
    <C' moved A-bar> = <PP>
    <CP type wh> = <PP type wh>
    <CP type> = <C' type>
    <CP moved A-bar> = none
    {
    <CP type root> = +		; root clauses
    <CP type q> = +
    <CP type fin> = +
    <CP moved A> = none
    <CP moved head> = none
        /
    <CP type root> = -		; non-root clauses
    }
(20)
; for wh questions with NP or PP fronted
Rule CP -> { NP / PP } C'
    <NP type wh> = +
    <C' moved A-bar> = <NP>
    <CP type wh> = <NP type wh>
    <PP type wh> = +
    <C' moved A-bar> = <PP>
    <CP type wh> = <PP type wh>
    <CP type> = <C' type>
    <CP moved A-bar> = none
    {
    <CP type root> = +		; root clauses
    <CP type q> = +
    <CP type fin> = +
    <CP moved A> = none
    <CP moved head> = none
        /
    <CP type root> = -		; non-root clauses
    }

Since the open brace ({) introduces disjunctions both in the phrase structure rule and in the unification constraints, care must be taken to avoid confusing PC-PATR when it is loading the grammar file. The end of the phrase structure rule, and the beginning of the unification constraints, is signaled either by the first constraint beginning with an open angle bracket (<) or by a colon (:). If the first constraint is part of a disjunction, then the phrase structure rule must end with a colon. Otherwise, PC-PATR will treat the unification constraint as part of the phrase structure rule, and will shortly complain about syntax errors in the grammar file.

Perhaps it should be noted that disjunctions in phrase structure rules or unifications are expanded when the grammar file is read. They serve only as a convenience for the person writing the rules.

1.5 The lexicon

The lexicon provides the basic elements (atoms) of the grammar, which are usually words. Information like that shown in feature 2 is provided for each lexicon entry. Unlike the original implementation of PATR-II, PC-PATR stores the lexicon in a separate file from the grammar rules. See section 3 below for details.

2 Syntax of the grammar file

The following specifications apply generally to the grammar file:

2.1 Rules

A PC-PATR grammar rule has these parts, in the order listed:
  1. the keyword Rule

  2. an optional rule identifier enclosed in braces ({})

  3. the nonterminal symbol to be expanded

  4. an arrow (->) or equal sign (=)

  5. zero or more terminal or nonterminal symbols, possibly marked for alternation or optionality

  6. an optional colon (:)

  7. zero or more feature constraints

  8. an optional period (.)

The optional rule identifier consists of one or more words enclosed in braces. Its current utility is only as a special form of comment describing the intent of the rule. (Eventually it may be used as a tag for interactively adding and removing rules.) The only limits on the rule identifier are that it not contain the comment character and that it all appears on the same line in the grammar file.

The terminal and nonterminal symbols in the rule have the following characteristics:

The symbols on the right hand side of a phrase structure rule may be marked or grouped in various ways:

A rule can be followed by zero or more feature constraints that refer to symbols used in the rule. A feature constraint has these parts, in the order listed:

  1. a feature path that begins with one of the symbols from the phrase structure rule

  2. an equal sign

  3. either another path or a value

A feature constraint that refers only to symbols on the right hand side of the rule constrains their co-occurrence. In the following rule and constraint, the values of the agr features for the NP and VP nodes of the parse tree must unify:

        Rule S -> NP VP
                <NP agr> = <VP agr>
If a feature constraint refers to a symbol on the right hand side of the rule, and has an atomic value on its right hand side, then the designated feature must not have a different value. In the following rule and constraint, the head case feature for the NP node of the parse tree must either be originally undefined or equal to NOM:
        Rule S -> NP VP
                <NP head case> = NOM
(After unification succeeds, the head case feature for the NP node of the parse tree will be equal to NOM.)

A feature constraint that refers to the symbol on the left hand side of the rule passes information up the parse tree. In the following rule and constraint, the value of the tense feature is passed from the VP node up to the S node:

        Rule S -> NP VP
                <S tense> = <VP tense>

2.2 Feature templates

A PC-PATR feature template has these parts, in the order listed:
  1. the keyword Let

  2. the template name

  3. the keyword be

  4. a feature definition

  5. an optional period (.)
If the template name is a terminal category (a terminal symbol in one of the phrase structure rules), the template defines the default features for that category. Otherwise the template name serves as an abbreviation for the associated feature structure.

The characters (){}[]<>=: cannot be used in template names since they are used for special purposes in the grammar file. The characters /_ can be freely used in template names. The character \ should not be used as the first character of a template name because that is how fields are marked in the lexicon file.

The abbreviations defined by templates are usually used in the feature field of entries in the lexicon file. For example, the lexical entry for the irregular plural form feet may have the abbreviation pl in its features field. The grammar file would define this abbreviation with a template like this:

     Let pl be [number: PL]
The path notation may also be used:
     Let pl be <number> = PL
More complicated feature structures may be defined in templates. For example,
     Let 3sg be [tense:  PRES
                 agr:    3SG
                 finite: +
                 vform:  S]
which is equivalent to:
    Let 3sg be [<tense>  = PRES
                <agr>    = 3SG
                <finite> = +
                <vform>  = S]
In the following example, the abbreviation irreg is defined using another abbreviation:
    Let irreg be <reg> = -
                 pl
The abbreviation pl must be defined previously in the grammar file or an error will result. A subsequent template could also use the abbreviation irreg in its definition. In this way, an inheritance hierarchy features may be constructed.

Feature templates permit disjunctive definitions. For example, the lexical entry for the word deer may specify the feature abbreviation sg-pl. The grammar file would define this as a disjunction of feature structures reflecting the fact that the word can be either singular or plural:

    Let sg/pl be {[number:SG]
                  [number:PL]}
This has the effect of creating two entries for deer, one with singular number and another with plural. Note that there is no limit to the number of disjunct structures listed between the braces. Also, there is no slash (/) between the elements of the disjunction as there is between the elements of a disjunction in the rules. A shorter version of the above template using the path notation looks like this:
    Let sg/pl be <number> = {SG PL}
Abbreviations can also be used in disjunctions, provided that they have previously been defined:
    Let sg be <number> = SG
    Let pl be <number> = PL
    Let sg/pl be {[sg] [pl]}
Note the square brackets around the abbreviations sg and pl without square brackets they would be interpreted as simple values instead.

Feature templates can assign default atomic feature values, indicated by prefixing an exclamation point (!). A default value can be overridden by an explicit feature assignment. This template says that all members of category N have singular number as a default value:

    Let N be <number> = !SG
The effect of this template is to make all nouns singular unless they are explicitly marked as plural. For example, regular nouns such as book do not need any feature in their lexical entries to signal that they are singular; but an irregular noun such as feet would have a feature abbreviation such as pl in its lexical entry. This would be defined in the grammar as [number: PL], and would override the default value for the feature number specified by the template above. If the N template above used SG instead of !SG, then the word feet would fail to parse, since its number feature would have an internal conflict between SG and PL.

2.3 Parameter settings

A PC-PATR parameter setting has these parts, in the order listed:
  1. the keyword Parameter

  2. an optional colon (:)

  3. one or more keywords identifying the parameter

  4. the keyword is

  5. the parameter value

  6. an optional period (.)

PC-PATR recognizes the following parameters:

2.4 Lexical rules

A PC-PATR lexical rule has these parts, in the order listed:
  1. the keyword Define

  2. the name of the lexical rule

  3. the keyword as

  4. the rule definition

  5. an optional period (.)
The rule definition consists of one or more mappings. Each mapping has three parts: an output feature path, an assignment operator, and the value assigned, either an input feature path or an atomic value. Every output path begins with the feature name out and every input path begins with the feature name in. The assignment operator is either an equal sign (=) or an equal sign followed by a "greater than" sign (=>). (These two operators are equivalent in PC-PATR since the implementation treats each lexical rule as an ordered list of assignments rather than using unification for the mappings that have an equal sign operator.)

Consider the information shown in figure 7

Figure 7: PC-PATR lexical rule example

; lexicon entry
\w stormed
\c V
\f Transitive AgentlessPassive
   <head trans pred> = storm
;
 definitions from the grammar file
Let Transitive be
        <subcat first cat> = NP
        <subcat rest first cat> = NP
        <subcat rest rest> = end
        <head trans arg1> = <subcat first head trans>
        <head trans arg2> = <subcat rest first head trans>.

Define AgentlessPassive as <out cat> = <in cat> <out subcat> = <in subcat rest> <out lex> = <in lex> ; added for PC-PATR <out head> = <in head> <out head form> => passiveparticiple.

Figure 8: Feature structure before lexical rule

[  lex:    stormed 
   cat:    V       
   head:   [ trans: [ arg1: $1 [  ] 
                            arg2: $2 [  ] 
                            pred: storm    ] ] 
   subcat: [ first: [ cat:  NP                    
                            head: [ trans: $1 ] ] 
                rest:  [ first: [ cat:  NP                    
                                        head: [ trans: $2 ] ] 
                            rest: end  ] ] ]

Figure 9: Feature structure after lexical rule

[  lex:    stormed 
   cat:    V       
   head:   [ trans: [ arg1: [  ] 
                            arg2: $1 [  ] 
                            pred: storm    ] 
                form:  passiveparticiple      ] 
   subcat: [ first: [ cat:  NP                    
                            head: [ trans: $1 ] ] 
                rest:  end  ] ]
When the lexicon entry is loaded, it is initially assigned the feature structure shown in figure 8, which is the unification of the information given in the various fields of the lexicon entry. Since one of the the labels stored in the \f (feature) field is actually the name of a lexical rule, after the complete feature structure has been built, the named lexical rule is applied. After the rule has been applied, the feature structure has been changed to the one shown in figure 9 Note that all of the information in the output feature structure is from the input feature structure, but not all of the input feature information is found in the the output feature structure. Using a lexical rule in conjunction with the PC-Kimmo morphological parser within PC-PATR is illustrated in figures 10-12

Figure 10: PC-PATR lexical rule for using PC-Kimmo

Define MapKimmoFeatures as
        <out cat>       = <in head pos>
        <out head>      = <in head>
        <out gloss>     = <in root>
        <out root_pos>  = <in root_pos>

Figure 11: Feature structure received from PC-Kimmo

[  cat:       Word   
   clitic:    []   
   drvstem:   []     
   head:      [ agr:    [ 3sg: + ] 
                   finite: +    
                   pos:    V    
                   tense:  PRES 
                   vform:  S    ] 
   root:      `sleep 
   root_pos: V ]

Figure 12: Feature structure sent to PC-PATR

[  cat:       V      
   gloss:     `sleep 
   head:      [ agr:    [ 3sg: + ] 
                   finite: +    
                   pos:    V    
                   tense:  PRES 
                   vform:  S ] 
   lex:       sleeps 
   root_pos: V ]
Figure 10 shows the lexical rule for mapping from the top-level feature structure produced by the morphological parser to the bottom-level feature structure used by the sentence parser. Note that this rule must be named MapKimmoFeatures (unorthodox capitalization and all). Figure 11 shows the feature structure created by the PC-Kimmo parser. After the lexical rule shown in figure 10 has been applied (and after some additional automatic processing), the feature structure shown in figure 10 is passed to the PC-PATR parser.

Note that the feature structure passed to the PC-PATR parser always has both a lex feature and a gloss feature, even if the MapKimmoFeatures lexical rule does not create them. The default value for the lex feature is the original word from the sentence being parsed. The default value for the gloss feature is the concatenation of the glosses of the individual morphemes in the word.

In contrast to the lex and gloss features which are provided automatically by default, the cat feature must be provided by the MapKimmoFeatures lexical rule. There is no way to provide this feature automatically, and it is required for the phrase structure rule portion of PC-PATR.

3 Syntax of the lexicon file

The lexicon file is a standard format ("standard" in SIL terms, that is) database file consisting of any number of records, each of which represents one word. These records are divided into fields, each of which begins with a standard format marker at the beginning of a line. These markers begin with the \ (backslash) character followed by one or more alphanumeric characters. Each record begins with a designated field. PC-PATR recognizes four different fields, with these default field markers: Note that the fields containing the lexical form of the word and its category must be present for each word (record) in the lexicon. The other two fields (glosses and features) are optional, as are additional fields that may be present for other purposes.

Each word loaded from the lexicon file is assigned certain features based on the fields described above.

These feature names should be treated as reserved names and not used for other purposes.

For example, consider these entries for the words fox and foxes.

        \w fox
        \c N
        \g canine
        \f <number> = singular

        \w foxes
        \c N
        \g canine+PL
        \f <number> = plural
When these entries are used by the grammar, they are represented by these feature structures:
       [ cat:    N
         gloss:  canine
         lex:    foxes
         number: singular ]

       [ cat:    N
         gloss:  canine+PL
         lex:    foxes
         number: plural ]
The lexicon entries can be simplified by defining feature templates in the grammar file. Consider the following templates:
   Let PL be <number> = plural
   Let N  be <number> = !singular
With these two templates, defining an abbreviation for "plural" and defining a default feature for category N (noun), the lexicon entries can be rewritten as follows:
        \w fox
        \c N
        \g canine
        \f

        \w foxes
        \c N
        \g canine+PL
        \f PL
Note that the feature (\f) field of the first entry could be omitted altogether since it is now empty.

4 Using the embedded morphological parsers

Normally, PC-PATR requires the linguist to develop a full-fledged lexicon of words with their features. This may be unnecessary if a morphological analysis, and a comprehensive lexicon of morphemes, has already been developed using either PC-Kimmo (version 2) or AMPLE. These morphological parsing programs are also available from SIL.

4.1 PC-Kimmo

Version 2 of PC-Kimmo supports a PC-PATR style grammar for defining word structure in terms of morphemes. This provides a straightforward way to obtain word features as a result of the morphological analysis process. For best results, the (PC-Kimmo) word grammar and the (PC-PATR sentence or phrase grammar should be developed together.

When using the PC-Kimmo morphological parser, PC-PATR requires a special lexical rule in the (sentence level) grammar file. This rule is named MapKimmoFeatures and is used automatically to map from the features produced by the word parse to the features needed by the sentence parse. For example, consider the following definition:

Define MapKimmoFeatures as
        <out cat>       = <in head pos>
        <out lex>       = <in lex>
        <out head>      = <in head>
This lexical rule uses the <head pos> feature produced by the PC-Kimmo parser as the <cat> feature for the PC-PATR parser, and passes the <lex> and <head> features from the morphological parser to the sentence parser unchanged.

4.2 AMPLE

(This has not yet been implemented. It will be similar in concept to the embedded PC-Kimmo processing.)

5 Running the program

5.1 Installation

5.1.1 MS-DOS

The MS-DOS version of PC-PATR is named pcpatr.exe. It is a 32-bit application (with an embedded DOS extender), so it requires at least a 386SX machine to run. The following line should be added to your autoexec.bat file:
     set GO32TMP=c:/temp
Replace c:/temp with a suitable directory on your system for storing temporary files. Note that the slashes in this command are forward (/) slashes, not backslashes (\).

5.1.2 Microsoft Windows

The Windows version of PC-PATR is named wpcpatr.exe. Put it wherever you want, and install it in a program group wherever you want under any name you want. It is a QuickWin application, which is a big advantage for the lazy programmer, but barely looks like a Windows application.

5.1.3 Macintosh

The Macintosh version of PC-PATR is named PC-PATR. Put it wherever you want on your disk, although there are problems if it is located anywhere but the directory where your data files live. It is an SIOW application, which is a big advantage for the lazy programmer, but barely looks like a Macintosh application.

5.1.4 Unix

Get the sources and compile them. See the file INSTALL that comes with the sources for instructions on configuring, compiling, and installing the program.

The sources have been successfully compiled on a SparcStation 1+ using both GNU gcc 2.5.7 and the standard cc compiler that comes with SunOS 4.1.3_U1.

5.2 Command line options

PC-PATR has the following command line options for MS-DOS and Unix: These options are processed in the order indicated by the list above, after the initialization file (if any) is processed, and before any commands typed interactively by the user.

5.3 Initialization file

When it starts, PC-PATR looks for a special take file to initialize itself. It searches for the following files in order, stopping with the first one it finds:
  1. pcpatr.tak in the current directory

  2. pc-patr.tak in the current directory

  3. pcpatr.tak in the directory where the program is located (MS-DOS or Windows), or in the user's home directory (Unix)

  4. pc-patr.tak in the directory where the program is located (MS-DOS or Windows), or in the user's home directory (Unix)

  5. .pcpatrrc in the user's home directory (Unix)
If no initialization file is found, then PC-PATR comes up with a standard, default set of settings. If an initialization file is found, then it is read before any command line options are processed.

5.4 PC-PATR commands

Each of the commands available in PC-PATR is described below. Each command consists of one or more keywords followed by zero or more arguments. Keywords may be abbreviated to the minimum length necessary to prevent ambiguity.
  1. cd directory
    changes the current directory to the one specified. Spaces in the directory pathname are not permitted.

    For MS-DOS or Windows, you can give a full path starting with the disk letter and a colon (for example, a:); a path starting with \ which indicates a directory at the top level of the current disk; a path starting with .. which indicates the directory above the current one; and so on. Directories are separated by the \ character.

    For the Macintosh, you can give a full path starting with the name of your hard disk, a path starting with : which means the current folder, or one starting :: which means the folder containing the current one (and so on).

    For Unix, you can give a full path starting with a / (for example, /usr/pcpatr); a path starting with .. which indicates the directory above the current one; and so on. Directories are separated by the / character.

  2. clear
    erases all existing grammar and lexicon information, allowing the user to prepare to load information for a new language. Strictly speaking, it is not needed since the load grammar command erases the previously existing grammar, and the load lexicon and load analysis commands erase any previously existing lexicon.

  3. close
    closes the current log file opened by a previous log command.

  4. directory
    lists the contents of the current directory. This command is available only for the MS-DOS and Unix implementations. It does not exist for Microsoft Windows or the Macintosh.

  5. edit filename
    attempts to edit the specified file using the program indicated by the environment variable EDITOR. If this environment variable is not defined, then edlin is used to edit the file on MS-DOS and vi is used to edit the file on Unix. (These defaults should convince you to set this variable!) This command is not available for Microsoft Windows or the Macintosh.

  6. exit
    stops PC-PATR returning control to the operating system. This is the same as quit.

  7. file disambiguate input.ana [ out.ana]
    reads sentences from the specified AMPLE analysis file and writes the corresponding parse trees and feature structures either to the screen or to the optionally specified output file. If the output file is written, ambiguous word parses are eliminated as much as possible as a result of the sentence parsing. When finished, a statistical report of successful (sentence) parses is displayed on the screen.

  8. file parse input-file [ output-file]
    reads sentences from the specified input file, one per line, and writes the corresponding parse trees and feature structures to the screen or to the optionally specified output file. The comment character is in effect while reading this file. PC-PATR currently makes no attempt to handle either capitalization or punctuation. (Probably some capability for handling punctuation will be added at some point.)

    This command behaves the same as parse except that input comes from a file rather than the keyboard, and output may go to a file rather than the screen. When finished, a statistical report of successful parses is displayed on the screen.

  9. help command
    displays a description of the specified command. If help is typed by itself, PC-PATR displays a list of commands with short descriptions of each command.

  10. load ample control file list
    not yet implemented

  11. load ample infix infix.dic
    not yet implemented

  12. load ample prefix prefix.dic
    not yet implemented

  13. load ample roots rt1.dic [ rt2.dic ...]
    not yet implemented

  14. load ample suffix suffix.dic
    not yet implemented

  15. load analysis file1.ana [file2.ana ...]
    erases any existing lexicon and reads a new lexicon from the specified AMPLE analysis file(s). Note that more than one file may be loaded with the single load analysis command: duplicate entries are not stored in the lexicon.

    The default filetype extension for load analysis is .ana, and the default filename is ample.ana. l a is a synonym for load analysis.

  16. load grammar file.grm
    erases any existing grammar and reads a new grammar from the specified file.

    The default filetype extension for load grammar is .grm, and the default filename is grammar.grm. l gis a synonym for load grammar.

  17. load kimmo grammar file.grm
    erases any existing PC-Kimmo (word) grammar and reads a new word grammar from the specified file.

    The default filetype extension for load kimmo grammar is .grm, and the default filename is grammar.grm. l k g is a synonym for load kimmo grammar.

  18. load kimmo lexicon file.lex
    erases any existing PC-Kimmo lexicon information and reads a new morpheme lexicon from the specified file. A PC-Kimmo rules file must be loaded before a PC-Kimmo lexicon file can be loaded.

    The default filetype extension for load kimmo lexicon is .lex, and the default filename is lexicon.lex. l k l is a synonym for load kimmo lexicon.

  19. load kimmo mapping file.map
    reads a file that defines the mapping from the feature structures produced by the PC-Kimmo (word) parse and the features needed by the PC-PATR (sentence) parse.

    The default filetype extension for load kimmo mapping is .map, and the default filename is feat.map. l k m is a synonym for load kimmo mapping.

  20. load kimmo rules file.rul
    erases any existing PC-Kimmo rules and reads a new set of rules from the specified file. This also erases any stored AMPLE information.

    The default filetype extension for load kimmo rules is .rul, and the default filename is rules.rul. l k r is a synonym for load kimmo rules.

  21. load lexicon file1.lex [ file2.lex ...]
    erases any existing lexicon and reads a new lexicon from the specified file(s). Note that more than one file may be loaded with a single load lexicon command.

    The default filetype extension for load lexicon is .lex, and the default filename is lexicon.lex. l l is a synonym for load lexicon.

  22. log [ file.log ]
    opens a log file. Each item processed by a parse command is stored to the log file as well as being displayed on the screen.

    If a filename is given on the same line as the log command, then that file is used for the log file. Any previously existing file with the same name will be overwritten. If no filename is provided, then the file pcpatr.log in the current directory is used for the log file.

    Use close to stop recording in a log file. If a log command is given when a log file is already open, then the earlier log file is closed before the new log file is opened.

  23. parse [ sentence or phrase]
    attempts to parse the input sentence according to the loaded grammar. If a sentence is typed on the same line as the command, then that sentence is parsed. If the parse command is given by itself, then the user is prompted repeatedly for sentences to parse. This cycle of typing and parsing is terminated by typing an empty "sentence" (that is, nothing but the Enter or Return key).

    Both the grammar and the lexicon must be loaded before using this command.

  24. quit
    stops PC-PATR returning control to the operating system. This is the same as exit.

  25. save lexicon [file.lex]
    writes the current lexicon contents to the designated file. The output lexicon file must be specified. This can be useful if you are using a morphological parser to populate the lexicon.

  26. save status [ file.tak]
    writes the current settings to the designated file in the form of PC-PATR commands. If the file is not specified, the settings are written to pcpatr.tak in the current directory.

  27. set ambiguities number
    limits the number of analyses printed to the given number. The default value is 10. Note that this does not limit the number of analyses produced, just the number printed.

  28. set check-cycles value
    enables or disables a check to prevent cycles in the parse chart. set check-cycles on turns on this check, and set check-cycles off turns it off. This check slows down the parsing of a sentence, but it makes the parser less vulnerable to hanging on perverse grammars. The default setting is on.

  29. set comment character
    sets the comment character to the indicated value. If character is missing (or equal to the current comment character), then comment handling is disabled. The default comment character is ; (semicolon).

  30. set failures value
    enables or disables grammar failure mode set failures on turns on grammar failure mode, and set failures off turns it off. When grammar failure mode is on, the partial results of forms that fail the grammar module are displayed. A form may fail the grammar either by failing the feature constraints or by failing the constituent structure rules. In the latter case, a partial tree (bush) will be returned. The default setting is off.

    Be careful with this option. Setting failures to on can cause the PC-PATR to go into an infinite loop for certain recursive grammars and certain input sentences. (We may try to do something to detect this type of behavior, at least partially.)

  31. set features value
    determines how features will be displayed. set features all turns on features display mode and displays the feature structures for all nodes of the tree. set features top turns on features display mode and displays the feature structure for the top node of the tree. This is the default setting. set features off turns off features display mode and outputs no feature structures. set features flat causes features to be displayed in a flat, linear string that uses less space on the screen. set features full causes features to be displayed in an indented form that makes the embedded structure of the feature set clear.

  32. set gloss value
    enables the display of glosses in the parse tree output if value is on, and disables the display of glosses if value is off. If any glosses exist in the lexicon file, then gloss is automatically turned on when the lexicon is loaded. If no glosses exist in the lexicon, then this flag is ignored.

  33. set marker category marker
    establishes the marker for the field containing the category (part of speech) feature. The default is \c.

  34. set marker features marker
    establishes the marker for the field containing miscellaneous features. (This field is not needed for many words.) The default is \f.

  35. set marker gloss marker
    establishes the marker for the field containing the word gloss. The default is \g.

  36. set marker record marker
    establishes the field marker that begins a new record in the lexicon file. This may or may not be the same as the word marker. The default is \w.

  37. set marker word marker
    establishes the marker for the word field. The default is \w.

  38. set timing value
    enables timing mode if value is on, and disables timing mode if value is off. If timing mode is on, then the elapsed time required to process a command is displayed when the command finishes. If timing mode is off, then the elapsed time is not shown. The default is off. (This option is useful only to satisfy idle curiosity.)

  39. set top-down-filter value
    enables or disables top-down filtering based on the categories. set top-down-filter on turns on this filtering, and set top-down-filter off turns it off. The top-down filter speeds up the parsing of a sentence, but might cause the parser to miss some valid parses. The default setting is on.

    This should not be required in the final version of PC-PATR.

  40. set tree value
    specifies how parse trees should be displayed. set tree full turns on the parse tree display, displaying the result of the parse as a full tree. This is the default setting. A short sentence would look something like this:
               Sentence
                   |
              Declarative
              _____|_____
             NP        VP
              |      ___|____
              N      V    COMP
            cows    eat     |
                           NP
                            |
                            N
                          grass
    
    set tree flat turns on the parse tree display, displaying the result of the parse as a flat tree structure in the form of a bracketed string. The same short sentence would look something like this:
     (Sentence (Declarative (NP
         (N  cows)) (VP (V  eat) (COMP
         (NP (N  grass))))))
    
    set tree indented turns on the parse tree display, displaying the result of the parse in an indented format sometimes called a northwest tree The same short sentence would look like this:
        Sentence
            Declarative
                NP
                    N  cows
                VP
                    V  eat
                    COMP
                        NP
                            N  grass
    
    set tree off disables the display of parse trees altogether.

  41. set trim-empty-features value
    disables the display of empty feature values if value is on, and enables the display of empty feature values if value is off. The default is not to display empty feature values.

  42. set unification value
    enables or disables feature unification. set unification on turns on unification mode. This is the default setting. set unification off turns off feature unification in the grammar. Only the context-free phrase structure rules are used to guide the parse; the feature contraints are ignored. This can be dangerous, as it is easy to introduce infinite cycles in recursive phrase structure rules.

  43. set verbose value
    enables or disables the screen display of parse trees in the file parse command. set verbose on enables the screen display of parse trees, and set verbose off disables such display. The default setting is off.

  44. set warnings value
    enables warning mode if value is on, and disables warning mode if value is off. If warning mode is enabled, then warning messages are displayed on the output. If warning mode is disabled, then no warning messages are displayed. The default setting is on.

  45. show lexicon
    prints the contents of the lexicon stored in memory on the standard output. This is not very useful, and may be removed.)

  46. show status
    displays the names of the current grammar, sentences, and log files, and the values of the switches established by the set command. show (by itself) and status are synonyms for show status.

  47. system [command]
    allows the user to execute an operating system command (such as checking the available space on a disk) from within PC-PATR This is available only for MS-DOS and Unix, not for Microsoft Windows or the Macintosh.

    If no system-level command is given on the line with the system command, then PC-PATR is pushed into the background and a new system command processor (shell) is started. Control is usually returned to PC-PATR in this case by typing exit as the operating system command. ! (exclamation point) is a synonym for system.

  48. take [ file.tak]
    redirects command input to the specified file.

    The default filetype extension for take is .tak, and the default filename is pcpatr.tak. take files can be nested three deep. That is, the user types take file1, file1 contains the command take file2, and file2 has the command take file3. It would be an error for file3 to contain a take command. This should not prove to be a serious limitation.

    A take file can also be specified by using the -t command line option when starting PC-PATR When started, PC-PATR looks for a take file named pcpatr.tak in the current directory to initialize itself with.

Acknowledgements

Several people have contributed to the development of PC-PATR over the past few years. Alan Buseman, Jim Skon, Bob Kasper, and Nathan Miles all contributed to an earlier program named SILPATR that contained the same basic parsing and unification functions. (SILPATR is a Microsoft Windows program, which limits its availability. It is no longer being developed or supported.) Femke Hemels helped adapt the SILPATR functions to work in the PC-Kimmo program. Evan Antworth has served as a willing guinea pig in testing the revised PC-Kimmo program, developing an English lexicon with a PATR word grammar. Markus Koetter suggested some improvements for dealing with certain pathological types of context-free phrase structure grammars. Responsibility for PC-PATR must rest with the author, however, as he has greatly revised the code produced by others.

Bibliography

Shieber, Stuart M., "The Design of a Computer Language for Linguistic Information," In Proceedings of Coling84, 10th International Conference on Computational Linguistics, Stanford University, Stanford, California, 2-7 July 1984, pages 362-366.

Shieber, Stuart M., "Using Restriction to Extend Parsing Algorithms for Complex-feature-based Formalisms," In Proceedings of the 22nd Annual Meeting of the Association for Computational Linguistics, University of Chicago, Chicago, Illinois, 8-12 July 1985, pages 145-152.

Shieber, Stuart M., An Introduction to Unification Based Approaches to Grammar, CSLI Lecture Notes Series, Number 4, Center for the Study of Language and Information, Stanford University, 1986.

Gazdar, Gerald and Chris Mellish, Natural Language Processing in LISP, Addison-Wesley Publishing Company, Reading, MA, 1989.


Last modified: 16-Nov-1995

Reply to WWW@sil.org

[SIL Home Page | PC-PATR]