Technical Report. ISTL-92-2. October 1992. Xerox Palo Alto Research Center. Palo Alto, California. 1992. Copyright (c) 1992 by the Xerox Corporation. All rights reserved.

Two-Level Rule Compiler

Lauri Karttunen
Xerox Palo Alto Research Center

Kenneth R. Beesley
Microlytics

Abstract: Two-Level rules are declarative constraints that describe morphological alternations, such as the y~ie alternation in the plural of some English nouns (spy~spies). This document describes a compiler, written in C, that converts two-level rules into deterministic, minimized finite-state transducers. It describes the format of two-level grammars, the rule formalism, and the user interface to the compiler. It also explains how the compiler can assist the user in the development of a two-level grammar.
  1. Introduction
  2. Running the compiler
  3. Two-level grammars
    1. General format
    2. Regular expressions
      1. Brackets and parentheses
      2. Special symbols
      3. Incomplete pairs
      4. Complex expressions, operators
      5. Some equivalences
    3. Rule operators
    4. Rule contexts
    5. Rule variables
    6. Feasible pairs
  4. Error checking
    1. Simple errors
    2. Rule conflicts
    3. Conflict resolution
  5. Rule testing
    1. Applying to lexical forms
    2. Checking lexical:surface pairs
  6. Compiler options
  7. Acknowledgements, References
  8. Appendix 1: Help, Summary of commands
  9. Appendix 2: Sample two-level grammar

1. What is it?

In his 1983 dissertation [1], Kimmo Koskenniemi proposed a declarative system of constraints for describing morphological alternations. Constraints of this type are known as two-level rules. Two-level rules are similar to phonological rewrite-rules introduced in the 1960s by Noam Chomsky and Morris Halle [2] in that both types of rules denote regular relations [3, 4, 5]. Consequently, they can be modeled by finite-state transducers. An important difference is that rewrite-rules are applied in a serial order whereas two-level rules are parallel constraints that have to be satisfied simultaneously. The two-level formalism is expressive enough so that it can describe phenomena that in classical phonology require rule ordering [6].

The compiler described in this document (TwoL Compiler) converts two-level rules to deterministic, minimized finite-state transducers. The first version of the compiler was written in 1985-1987 in Interlisp by Lauri Karttunen, Kimmo Koskenniemi, and Ronald M. Kaplan. The current C version, based on Karttunen's 1989 Common Lisp implementation, was written by Lauri Karttunen, Todd Yampol, and Kenneth R. Beesley (Microlytics) in consultation with Ronald M. Kaplan at Xerox PARC in 1991-92. The compilation algorithm and the rule formalism are as described in [8]. The compiler currently works on Macintosh and UNIX operating systems. The interface to the compiler described here is common to both versions. The transducers produced with the compiler, combined with a lexicon, can be used for word recognition and generation in systems such as Koskenniemi's two-l program [1] and Evan Antworth's PC-KIMMO [7].

Section 2 of this document (Running the compiler) gives an example of how to apply the compiler to an existing set of rules. Readers who are unfamiliar with two-level grammars may wish to start by first reading Section 3 (Two-level grammars) that specifies the grammar format and the rule formalism in detail. Section 4 (Error checking) and 5 (Rule testing) describe ways in which the compiler can assist the rule writer in the development of a two-level grammar. Section 6 (Compiler options) contains further details about the user interface to the compiler. Appendix 1 lists the on-line help messages; Appendix 2 is an example of a two-level grammar.

2. Running the compiler

When you start the compiler (by double-clicking the icon on a Macintosh or by typing 'twolc' in UNIX), it prints a welcoming banner and waits for the user to type a command:

(1)

         ******************************************************
         *               Two-Level Compiler 1.0               *
         *                    created by                      *
         *           Lauri Karttunen, Todd Yampol,            *
         *      Kenneth R. Beesley, and Ronald M. Kaplan.     *
         *  Copyright (c) 1991-92 by the Xerox Corporation.   *
         *                All Rights Reserved.                *
         ******************************************************


Compiler commands:      compile, install-binary, intersect,
                        list-rules, read-grammar, save-binary,
                        save-tabular, show, show-rules.
Switches:               resolve, time, trace.
Information:            ?, banner, help, storage, switches.
Command (C-Q = quit):
The last line is a prompt line where the user may type one of the available commands. The compiler will keep prompting with a new command line until the user quits by typing a control-C (UNIX) or command-Q (Macintosh).

Let us start with 'help':

(2)

Command (C-Q = quit): help
Typing "help " explains what  does.
       "help all" prints out all help messages
"?" prints a list of available commands.

Command (C-Q = quit): 
The two main commands are 'read-grammar' and 'compile'. To illustrate what they do, let us assume that the user has the following mini-grammar stored as a file named rules.txt:


(3)
! First example. Two rules for making the lexical form "kaNpan"
! realized as "kamman".

Alphabet
a b c d e f g h i j k l m N:m N:n n o p q r s t u v x y w z ;

Rules

 "N realized as m"   ! Before a lexical "p." Always, and only there.
    N:m <=> _ p:        ;

  "p realized as m"   ! After a surface "m". Always and only there.
    p:m <=> :m _        ;

Note that the exclamation mark is a comment character; after seeing !, the compiler ignores the rest of line. (More about the format of the grammar in Section 3.)

The first step towards getting the rules converted to transducers is to have the compiler read them and verify that there are no syntactic errors in the input. This is what the command 'read-grammar' does, after prompting the user for the name of the input file.

(4)

Command (C-Q = quit): read-grammar
Input file: rules.txt
reading from 'rules.txt'...
Alphabet... Rules...
  "N realized as m" "p realized as m"
Command (C-Q = quit): 
As it processes the input, the compiler prints messages about its progress and about errors it detects along the way. Here no errors occurred during `read-grammar', so we are ready for the next step, `compile':

(5)

Command (C-Q = quit): compile
Compiling "rules.txt"
Expanding... Analyzing...
Compiling rule components:
  "N realized as m" "p realized as m"
Compiling rules:
  "N realized as m"
    N:m <=>
  "p realized as m"
    p:m <=>
Done.
Command (C-Q = quit):
The 'redo' command is useful if an error is detected in the course of parsing or compiling. After correcting the problem, the user can re-execute the 'read-grammar' and 'compile' commands on the current file by just typing 'redo'.

The commands 'show-rules' and 'show' display on the screen the resulting transducers; 'show-rules' displays all the compiled rules, and 'show' prompts for a rule name and displays just the transducer for that rule.

(6)

Command (C-Q = quit): show
Rule name: "N realized as m"

   a p N:m N:n
1: 1 1  3   2
2: 1    3   2
3.   1
Equivalence classes:
((a b c d e f g h i j k l m n o q r s t u v w x y z) (p p:m) (N:m)
 (N:n))
Command (C-Q = quit):
Transducers are displayed on the screen as tables in which each row represents a state and the columns represent transition functions for a class of symbol pairs. In a transducer, all symbols are pairs but conventionally we write just one symbol when the two members of the pair are the same. Thus a and p in the table above mean a:a and p:p, respectively. The first member of a pair is a lexical symbol, the second is a surface symbol. The rows are labeled with the name of the state (an integer) followed by a colon if the state is final and a period if it is nonfinal. At the head of each column is the representative first member of a class of symbol pairs that have the transitions indicated by the column. For example, the number 3 in the first row of the N:m column in the above table means that there is a transition from state 1 (a final state) to state 3 (a nonfinal state) labeled N:m.

Only some of the transitions are shown explicitly; the rest can be inferred from the listing of equivalence classes at the bottom of the table. It shows that N:m is in a class by itself, but the p column also represents the transitions of p:m. The equivalence class headed by a contains most of the symbol pairs in the alphabet. Thus the number 1 in the first row of the a column means that there is a transition from state 1 back to itself labeled a; and because a stands for the whole equivalence class (a b c ... x y z), the transducer will react to all these symbol pairs in the same way.

The absence of a transition means that the transducer fails in that state on that input. For example, if the transducer is in state 3, a p or p:m must follow in the input string because the state is nonfinal and there are no transitions for other input pairs.

A conventional state diagram for the same transducer is shown below.

In commands that require an argument, such as a file name, it is possible to provide the argument without waiting for the prompt. To see the transducer for the second rule, one can just type

(7)

Command (C-Q = quit): show "p realized as m"
   a m p p:m
1: 1 2 1
2: 1 2    2
Equivalence classes:
((a b c d e f g h i j k l n o q r s t u v w x y z N:n) (m N:m) (p)
 (p:m))
This table is equivalent to the state diagram

The command `list-rules' prints out just the rule names and, after compilation, the sizes of the corresponding transducer tables.

(8)

Command (C-Q = quit): list-rules
"N realized as m" 3 x 4
"p realized as m" 2 x 4
The command 'lex-test' prompts for a lexical string and applies the transducers to it, returning any surface strings that are generated and the mapping from the lexical to the surface string as a listing of symbol pairs.

(9)

Command (C-Q quit): lex-test

Lexical string (q = quit): kaNpat
                           kammat
k
a
N:m
p:m
a
t
The command 'pair-test' prompts for both a lexical string and a surface string; these paired strings, which must contain an equal number of symbols, are then run through the transducers, which will either accept or reject them.

(10)

Command (C-Q = quit): pair-test

Lexical string (q = quit): kaNpat
Surface string (q = quit): kammat
k
a
N:m
p:m
a
t
ACCEPTED

Lexical string (q = quit): kaNpat
Surface string (q = quit): kampat
k
a
N:m
p
REJECTED: "p realized as m" fails in state 2.
The results of the compilation can be saved in two formats: 'save-binary' and 'save-tabular'. The command 'save-binary' produces a compact binary file from which the automata can be reloaded to the compiler for testing or intersection; 'save-tabular' makes a simple ASCII file that can be read in by Kimmo Koskenniemi's two-l program for morphological analysis. We expect that future versions of PC-KIMMO [8] will also accept the same tabular format.

Both save-commands first prompt the user for the name of the output file:

(11)

Command (C-Q = quit): save-tabular
Output file: rules.aut
Writing "rules.aut"
Command (C-Q = quit):
The resulting tabular file looks like this:


(12)
ALPHABET
  N a b c d e f g h i j k l m n o p q r s t u v w x y z
NULL 0
END

AUTOMATA
"N realized as m" 3 4
1: 1 1 3 2 
2: 1 0 3 2 
3. 0 1 0 0 

"p realized as m" 2 4
1: 1 2 1 0 
2: 1 2 0 2 

ALIGNMENT
  a a 1 1
  b b 1 1
  ...           [omitting alignments of b:b, c:c, etc.]
  z z 1 1
  N m 3 2
  N n 4 1
  p m 2 4
END

Transition tables in a tabular file are the same that 'show' and 'show-rules' display on the screen except that 0 rather than a blank is used to indicate that there is no transition for a symbol. Equivalence classes are not explicitly shown. The alignment table at the end of the file indicates for every pair in the alphabet which column in each transducer applies to that pair. For example, the line

N m 3 2
means that the transitions for the pair N:m are in the third column of the first transducer and in the second column of the second transducer.

If the number of rules is not too large and the transducers are not big, it may be advantageous to combine them into a single transducer. This is what the command 'intersect' does:

(13)

Command (C-Q = quit): intersect
Intersect all rules? (y/n) [y]:
Name the intersection:
Composing common alphabet...Expanding...Intersecting...
4 states, 6 equivalence classes, 15 arcs.
Command (C-Q = quit):
The user has the option of selecting a set of rules to be intersected or applying the operation to all the rules. The default is to intersect all rules. Another option is to choose a name for the result of the intersection. If the user simply hits a carriage return in response to both prompts, as in the example above, the result is called "Unnamed 1".

Intersected rules can be displayed and saved with the same commands as the input automata:

(14)

Command (C-Q = quit): show-rules
"Unnamed 1"
   a m p N:m N:n p:m
1: 1 3 1  2   4
2.                3
3: 1 3    2   4   3
4: 1 3    2   4
 Equivalence classes:
((a b c d e f g h i j k l n o q r s t u v w x y z) (m) (p) (N:m) 
(N:n) (p:m))
This single transducer describes the same relation as original rules in tandem but we can no longer see which individual rule causes the rejection of an incorrect pair:

(15)

Command (C-Q = quit): pair-test

Lexical string (q = quit): kaNpat
Surface string (q = quit): kampat
k
a
N:m
p
REJECTED: Rule "Unnamed 1" fails in state 2.
The intersection of two or more rule transducers typically has more states than its components; in the worst case its size is the product of the input sizes. The processing time required to run a transducer during analysis or generation is fairly constant and independent of the size; if it is feasible to intersect some rules, the resulting system is faster.

Appendix 2 illustrates the use of the compiler on a larger and more sophisticated set of rules.

3. Two-level grammars

The compiler expects as its input a two-level grammar. A two-level grammar is a system of rules that constrain the surface realization of lexical forms, hence the name two-level rule. Two-level rules are parallel constraints; a pairing of a lexical form with a surface form is valid only if it satisfies all the rules. In principle, the grammar could consist of just a set of rules and a specification of a pair alphabet, but in practice it is useful to introduce some auxiliary notions. In this section we first describe the general format of a two-level grammar and then focus on some important details.

3.1 General format

A two-level grammar consists of sections that are headed by a keyword. Every grammar must start with an Alphabet declaration and end with Rules. In addition there may be three optional sections headed by the keywords Diacritics, Sets, and Definitions. In addition, comments can be freely inserted anywhere. Anything that follows an exclamation mark, up to the end of the line, is treated as a comment and ignored. If ! or some other special symbol is used as an ordinary character, it must be immediately preceded with a %.

The Alphabet section is a list of elements (single symbols, complete or incomplete pairs) terminated with a semicolon. The rule compiler needs to know all symbols in the lexical and surface alphabet and all possible correspondences. Much of that information it can deduce directly from the rules. But if some symbol or correspondence is not explicitly mentioned elsewhere, it must be declared in the Alphabet section. For example:

(16)

Alphabet
  a b:0 0:c :d e: ;
This declaration tells the compiler that a (= a:a), b:0, and 0:c are possible pairs. It does not constrain them in any way. If b and b:0 are possible pairs and there are no rules that require b to be realized only one way in some contexts, then b can be deleted or realized as b anywhere. The incomplete pair :d tells the compiler that d is a possible surface symbol but does not indicate what its lexical counterparts are; e: stands for pairs that have an e on the lexical side. Because incomplete pairs like :d and e: in the Alphabet section are meaningful only if there are complete pairs of that type declared elsewhere, it is not necessary to mention them, although it may be a useful reminder for the rule writer.

The optional Diacritics section is a list of single lexical symbols terminated with a semicolon. The compiler converts them to a list of diacritic pairs with zero surface realization. Diacritics are typically used as features that force or block the application of particular rules. All rules that do not specifically refer to a diacritic allow that diacritic to occur freely without affecting the context that the rule specifies. For example, the declaration

(17)

Diacritics
  ' @ ;
tells the compiler that ':0 and @:0 are possible pairs but are irrelevant to any rule that doesn't mention them explicitly.

The optional Sets declaration serves the same purpose as distinctive features in generative phonology: it enables the user to define classes of symbols that are equivalent in some respect. For example, it is often useful to define a set of vowels and a set of consonants:

(18)

Sets
  Vowel = a e i o u;
  Cons = b c d f g h j k l m n p q r s t v x y w z ;
  VclessStop = k p t ;
  Liquid = l r ;
A symbol that has been defined as a set can be used in a rule to replace a union. For example,

(19)

k:0 <=> Vowel _ Vowel ;
is equivalent to

(20)

k:0 <=> {a | e | i | o | u}  _  {a | e | i | o | u} ;
Similarly, Vowel:0 means {a:0 | e:0 | i:0 | o:0 | u:0}, assuming that all five pairs occur somewhere in the grammar (more about that in Section 3.6).

The optional Definitions section allows the user to write regular expressions to create and name two-level relations. For example,

(21)

Definitions
  ClosedOffset = :Cons {:Cons | #:} ;
defines ClosedOffset as a relation that has on the surface side a consonant followed by a second surface consonant or a word boundary on the lexical side. When ClosedOffset is found in a rule, the compiler interprets it as a reference to a relation it only needs to compute once. From the grammar writer's point of view, definitions can be thought of and used like macros.

The Rules section is the last part of a two-level grammar:

(22)

Rules

  "Consonant gradation"   ! Single k, p, and t are realized as
                          ! zero, v, and d, respectively, in the
                          ! beginning of a closed syllable.

     Cx:Cy <=>  h | Liquid | Vowel: _ Vowel ClosedOffset;
                where Cx in (k p t)
                      Cy in (0 v d)
                matched ;
   
  "Geminate gradation"   ! Geminate stops are shortened to
                         ! single stops in a closed syllable.

     Cx:0 <=>  Cx _ Vowel ClosedOffset ;
               where Cx in VclessStop ;
Each rule must have (1) a name, written within double quotes, (2) a lexical:surface correspondence like k:0 that is constrained by the rule, (3) a rule operator that defines the nature of the constraint, (4) one or more contexts, divided between a left and a right context by an underscore, and, optionally (5) a where-clause that specifies the assignment of values to variables (if any). Each context and the optional where clause must be terminated with a semicolon. Although the formalism does not require it, in practice it is useful to include a couple of comment lines (starting with !) to explain what the rule is intended to do. If a semicolon, underscore or exclamation point is used in a rule as an ordinary symbol, it must be prefixed with %, the escape character. The same goes, of course, for the escape character itself.

3.2 Regular Expressions

Like the definitions, the correspondence and context parts of rules are written as regular expressions of symbol pairs. Symbols that occur as the first member of some pair constitute the lexical alphabet, the set of second members is called the surface alphabet. The total pair alphabet of a grammar is called the set of feasible pairs. Because all symbols in two-level rules denote pairs (for convenience x:x is written as x), the regular expressions of a two-level grammar designate equal-length relations rather than simple string languages. The correspondence part typically consists of a single pair but it can be a more complicated expression, for example, a union of pairs. This section reviews the syntax of regular expressions recognized by the compiler and the regular relations over symbol pairs that they denote.

3.2.1 Brackets and Parentheses

Enclosing a regular expression in a pair of matching brackets yields an equivalent expression: a:b, [a:b], [[a:b]], etc. denote the relation consisting of the single pair a:b. An empty pair of brackets, [], denotes the empty relation that maps the empty string to itself.

Round parentheses indicate optionality. (A) denotes the union of the relation A with the empty relation.

Curly brackets, { }, can be used as an alternative to square brackets to form a union of regular relations. Union and other Boolean operators are discussed in section 3.2.4.

If brackets or parentheses are used as ordinary symbols, they must be prefixed with a percent sign; for example, %[ is the relation that maps the left bracket to itself.

3.2.2 Special symbols

The numeral 0 (zero) is special when it does not constitute a part of some larger symbol (as in 007). It is the conventional two-level counterpart of the epsilon symbol which stands for the empty string. But for technical reasons 0 is different from epsilon in that it is treated like any other symbol during the compilation of two-level rules to transducers. In the two-level calculus, all regular expressions must denote pairs of strings that are of the same length, thus in a:0 both the lexical side of the relation, a, and the surface side, 0, are one-element strings. This is important because only equal-length regular relations (see [5]) are closed under intersection and negation, which are essential operations in the compilation of two-level rules to transducers. The interpretation of zero as the empty string takes place only in the application of the resulting transducers to input strings. A zero that is prefixed with a percent sign, %0, is interpreted as an ordinary numeral.

In the two-level calculus, the question mark ? stands for the union of all known and unknown feasible pairs; it is called the other symbol. When a regular expression containing ? is compiled into an automaton, the interpretation of ? is dynamically updated to reflect the changes in the pair alphabet. For example, in the course of converting the sequence

(23)

? a:b c:d
to a transducer from left to right, the first member of the concatenation starts out just as other and becomes the union of other with a:b and c:d as the interpretation proceeds. An equivalent expression for the same relation is

(24)

[ ? | a:b | c:d] a:b c:d
where the brackets and vertical bars mark the union. A question mark prefixed with a percent sign, %?, is interpreted as a question mark without a special meaning.

The current implementation of the compiler assumes that the pair alphabet is completely specified by the Alphabet declaration in conjunction with the rules. When the total pair alphabet becomes known, the compiler removes all remaining instances of ? from the automata. Consequently, the transducers fail to apply to strings that contain unknown symbols. Because this has proved inconvenient in practice, future versions of the compiler may retain ? in output tables as a place holder for new symbols.

For historical reasons, and for backward compatibility with older rule systems, the compiler also recognizes the equal sign, =, as a special any symbol. The interpretation of = as a single symbol is the union of all feasible pairs. It is like ? except that ? also potentially includes unknown pairs. Symbol pairs in which = appears on one side are interpreted in the same manner as incomplete pairs (see next section). For example, a: is equivalent to a:=.

The symbol # is commonly used to indicate a lexical word boundary. Languages typically have rules that refer to the beginning or the end of words. The compiler does not need to give any special treatment to #, but other software that uses the results of the rule compiler may have a built-in assumption that words are flanked by #.

3.2.3 Incomplete pairs

It is possible to leave one side of a symbol pair unspecified. Such one-sided pairs are interpreted as the union of all feasible pairs which contain the specified symbol on the appropriate side. For example, if k appears on the lexical side only in the feasible pairs k, k:0, k:v and k:j, then the one-sided symbol k: is implicitly defined as the union of these four relations. Similarly, :k denotes the union of all feasible pairs that have k as the surface symbol.

The equal sign may be used as a filler in a partially specified pair: k:= and =:k are equivalent to k: and :k, respectively. It is also legal, although possibly confusing, to use a plain colon to stand for a completely unspecified pair, denoting the union of all feasible pairs, which is what a lone = also means.

3.2.4 Complex expressions

Complex relations can be defined by means of operators. The basic operations of the two-level calculus are (1) concatenation, (2) union, (3) intersection, (4) minus, (5) ignore, (6) complementation, (7) term complementation, (8) containment, and (9) iteration. (1)-(5) are binary operations, (6)-(9) apply to a single argument.

The operators are listed here in the order of decreasing priority. The unary operators (6)-(9) combine with their argument before any of the binary operators are applied. Prefix operators (6)-(8) have priority over suffix operators (9). Concatenation has priority over Boolean operators (2)-(4) but not over the ignore operator (5).

complement (~)

~A
is the relation that includes everything that is not in the relation A taking into account the pairs that actually appear in the grammar. For example, ~a:b denotes the relation that contains every relation over the total pair alphabet except the pair a:b. That includes the empty relation and pairings of any length. ~A is equivalent to ?* - A. Only equal-length regular relations are closed under complementation.

term complement (\)

\A
is the union of all feasible pairs excluding pairs that belong to the relation A. \A is equivalent to ? - A. In writing two-level rules, \ is in general a more useful operator than ~ because it does not include the empty relation or pairings longer than one. \a:b corresponds more closely than ~a:b to what "not a:b" intuitively means: any feasible pair except a:b.
containment ($)
$A
is the union of all relations that contain some member of A either as such or as a part of some larger concatenation of feasible pairs. $a:b is equivalent to ?* a:b ?*.
Kleene star (*)
A*
is zero or more iterated concatenations of A to itself. On each iteration, every member of A is concatenated with every member in the result derived so far. A* includes the empty relation. ?* denotes the universal relation that includes all relations. ~[?*] is the null relation. (Brackets are needed here because prefix operators take precedence over suffix operators; ~?* is equivalent to [~?]*).
Kleene plus (+)
A+
is one or more iterated concatenations of A to itself. On each iteration, every member of A is concatenated with every member in the result derived so far. (A+) is equivalent to A*. ?+ includes all nonempty relations.
ignore (/)
A/B
is the relation consisting of members of A with zero or more insertions from B appearing freely. For example, if A is a:b c:d and B is x:0, then A/B is the infinite relation x:0* a:b x:0* c:d x:0*. Diacritics are implemented using ignore.
concatenation ( )
A B
is the result of forming all pairwise linkages from the end of members of A to the beginning of members of B. There is no explicit operator for concatenation, just the line-up of the arguments. Because the operation is associative, no grouping brackets are needed to form n-ary concatenations. All unary operators and the ignore operator take precedence over concatenation, thus A B/C means A [B/C] rather than [A B]/C.
union (|)
A | B
is the relation that consists of everything that is in the relation A or in the relation B. A union may be surrounded by curly brackets instead of square brackets: {A | B} and [A | B] are equivalent. Because union is an associative operation, any number of operands may be grouped together without pairwise bracketing: A | B | C is equivalent to [A | B] | C and A | [B | C]. Union has lower priority than concatenation: A | B C is equivalent to A | [B C] rather than [A | B] C.
intersection (&)
A & B
is the relation that consists of everything that is in A as well as B. Boolean connectives (union, intersection, and minus) have the same priority and are interpreted from left to right. Thus A | B & C is equivalent to [A | B] & C rather than A | [B & C]. Only equal-length regular relations are closed under intersection. Because 0 is not a true epsilon in the two-level calculus, DeMorgan's laws are valid: ~[A | B] is equivalent to ~A & ~B; ~[A & B] [equivalence] ~A | ~B. Concatenation takes precedence over intersection: A & B C is equivalent to A & [B C] rather than [A & B] C.
minus (-)
A - B
is the subrelation in A that is not included in B. A - B is equivalent to A & ~B. Minus is weaker than concatenation: A - B C is equivalent to A - [B C] rather than [A - B] C. In writing two-level rules, it may be useful to know that conditions that pertain to the beginning or end of a relation can be expressed by using minus. For example, "any nonempty relation not ending in A" can be written as ?+ - ?* A.
If a regular expression operator is used as an ordinary symbol, it must be prefixed with the escape character, the percent sign. For example, %+:0 might be used to indicate a lexical morpheme boundary with zero surface realization.

3.2.5 Some equivalences

Looking at the following additional equivalences may be useful in learning to understand the semantics of two-level regular expressions.

[] = ([]) Because [] denotes the empty relation, ([]) is the union of the empty relation with itself.
\[] = ? The term complement of the empty relation consists of the union of all feasible pairs.
~[] = ?+ The complement of the empty relation is the set of all nonempty relations.
$? = ?+ Yet another way of describing the class of nonempty relations.
$[] = ?* The class of relations that contain the empty relation is the same as zero or more concatenations of all possible pairs.
~$[] = ~[?*] Two equivalent ways of describing the null relation, that is, a relation that does not contain anything at all, not even the pairing of an empty string with itself.
A | ~$[] = A The null relation is the identity element for union.
A & ?* = A The universal relation is the identity for intersection.
A - ~$[] = A The null relation is the right-sided identity for minus.
A ~$[] = ~$[] Concatenating anything to the null relation (on either side) yields the null relation.
A [] = A The empty relation is the identity for concatenation.
?* ?* = ?* Concatenating the universal relation to itself is vacuous.
(A) ?* = ?* Concatenating an optional element to the universal relation (on either side) is vacuous because (A) includes the empty relation.
~[?+] ?* = ?* Concatenating the complement of any nonempty relation to the universal relation (on either side) is vacuous for the same reason.

In writing two-level rules, it is important to keep the last two equivalences in mind. Optional relations and negated nonempty relations at the outer edges of a context specification are meaningless because all contexts are extended by concatenating the universal relation to the beginning of the left context and to the end of the right context. A rule written as

(25)

A <=> B _ C ;
in fact means

(26)

A <=> ?* B _ C ?* ;

3.3 Rule operators

A two-level rule is a deontic statement to the effect that a certain relation, expressed by the correspondence part of the rule, is restricted to, required by or prohibited in a certain environment. The modal force of the rule is expressed by the rule operator. The following simple rules illustrate the four rule operators introduced by Koskenniemi [1].

Operator Example Translation Table 1
<= a:b <=  c _ d a is always realized as b in the context c _ d.
=> a:b  => c _ d a is realized as b only in the context c _ d.
<=> a:b <=> c _ d a is realized as b in c _ d and nowhere else.
/<= a:b /<= c _ d a is never realized as b in the context c _ d.

The meaning of simple two-level rules that do not involve multiple contexts or variables can be explained easily in terms of the regular expression calculus described in section 3.2.

The prohibition rule,

(27)

a:b /<= c _ d ;
disallows the appearance of a:b when preceded by c and followed by d. A regular expression stating this condition is

(28)

~[?* c a:b d ?*]
Note the extension of the context with the universal relation on both sides.

The left-arrow rule

(29)

a:b <=  c _ d ;
requires that a lexical a be realized as b when preceded by c and followed by d. It fails if a lexical a has some other surface realization in this context. A regular expression to that effect is

(30)

~[?* c [a: - a:b] d ?*]
There is an asymmetry in that the rule does not require a surface b in this environment to have a lexical a as its counterpart. This shows up in the expression [a: - a:b] that denotes the union of all pairs that have a on the lexical side but something other than b on the surface side.

The right arrow rule

(31)

a:b =>  c _ d ;

restricts the occurrence of a:b to contexts in which it is preceded by c and followed by d. A regular expression for this constraint is a bit more convoluted than in the case of /<= and <= rules. The first disjunct below says that c does not precede a:b, the second disjunct states that d does not follow a:b; the entire expression states that neither one of these bad situations is allowed.

(32)

~[[~[?* c] a:b ?*] | [?* a:b ~[d ?*]]]
The double arrow rule

(33)

a:b <=> c _ d ;

is simply a combination of the left and right arrow rules. It requires the realization of a as b in this context and allows it only there. The corresponding regular expression is the conjunction of the formulas expressing the <= and => aspects of the constraint.

3.4 Contexts

A two-level rule can have any number of contexts. In the compilation of <= and /<= rules (and the <= side of double arrow rules), multiple contexts are treated conjunctively.

(34)

a:b <= c _ d ;
       e _ f ;
means that a must be realized as b in the context c _ d and in the context e _ f. The corresponding /<= rule prohibits the occurrence of a:b in both contexts.

In => rules, multiple contexts are read disjunctively.

(35)

a:b => c _ d ;
       e _ f ;
says that a:b must occur either in the context c _ d or in the context e _ f. The compilation of right-arrow rules with multiple contexts is more complicated than the formula for the one-context rule in section 3.3 suggests because the relation that is regulated may itself be involved in context specifications. For example, the rule

(36)

a:b =>    _ :b ;
       a: _    ;
allows the realization of the lexical string aa as bb:

(37)

The first a:b correspondence is allowed because it precedes b on the surface side, as the first context stipulates; the second a:b correspondence is allowed because it follows an a on the lexical side, as required by the second context. Deriving the correct transducer from rules of this sort is quite complicated. See [7] for details.

Because two-level rules can refer to both lexical and surface contexts, they can describe phenomena that in classical rewriting systems require ordered rules. For example, the two rules in Example (3) in section 2:

(3)

  "N realized as m"   ! Before a lexical "p." Always, and only there.
    N:m <=> _ p:  ;

"p realized as m" ! After a surface "m". Always and only there. p:m <=> :m _ ;

have the effect of realizing the lexical string kaNpan as kamman when applied in parallel:

The numbers in the middle trace the state transitions in the transducers (6) and (7) compiled from the two rules in section 2 as they process the lexical-surface matches in this example. The corresponding rewrite rules,

(38)

N -> m  /   _  p
p -> m  / m _
would have to be applied sequentially to generate the same surface forms (kaNpan --> kampan --> kamman). See [6] for more examples and discussion of this aspect of the two-level rule formalism.

3.5 Variables

A two-level rule may contain any number of variables that take as value a set of simple symbols or sets. Variables and the assignment of values to them are specified in a where clause that follows the last context of the rule. A where clause consists of (1) the keyword where followed by (2) one or more variable declarations, and (3) an optional keyword (matched or freely) that specifies the mode of value assignment. A variable declaration contains (1) a variable name, (2) the keyword in, and (3) a range of values. The range may be either a set name or a list closed in parentheses containing symbols or set names. For example,

(39)

  "Final Devoicing"

     Cx:Cy <=> _    #: ;
              where Cx in (b d c  g)
                    Cy in (p t c2 k)
           matched ;
In this rule, the two variables, Cx and Cy, each have four possible values. The keyword matched means that the assignment of values to the two variables is done in tandem so that when Cx takes its n'th value, Cy has its n'th value. The compiler interprets this type of rule as a conjunction of four independent subrules:

(40)

        b:p  <=> _ #: ;
        d:t  <=> _ #: ;
        c:c2 <=> _ #: ;
        g:k  <=> _ #: ;
Subrules are compiled individually and the resulting transducers are intersected.

If the keyword matched is not present at the end of a where clause, the assignment of values to variables is not coordinated. In this case, assigning values freely to the two variables would create 16 subrules with all possible pairings of values from the two sets.

Matched variables are also convenient when there is a dependence between the correspondence and context parts of the rule. A case of that sort is a rule that creates a geminate consonant at a morpheme boundary:

(41)

"Gemination"

  %+:Cx <=> Cy _ Vowel ;
            where Cx in (b k d f g k l m n p r s t v z)
                  Cy in (b c d f g k l m n p r s t v z)
            matched ;
For example, the rule would realize lexical forms such as big+er, picnic+ed, and yak+ing as the appropriate English surface forms bigger, picnicked, and yakking (but... see section 4.2).

Variables can also be used to encode dependences between left and right parts of the context. For example, if HighLabial is defined as

(42)

HighLabial = u y ;
the following rule realizes k as v in two contexts:

(43)

"Gradation of k between u/y"   ! k weakens to v between u's
                               ! and y's.
  k:v <=> Cons Vx _ Vx ClosedOffset ;
          where Vx in HighLabial ;
Here no variable occurs in the correspondence part of the rule; the compiler only needs to expand the context part:

(44)

     k:v <=> Cons u _ u ClosedOffset ;
             Cons y _ y ClosedOffset ;
Because the rule only contains one variable, the interpretation of (43) is the same regardless of whether the keyword matched is present or absent.

3.6 Inventory of feasible pairs

When a set is used as the range specification, the values are assigned from the set in the order they appear in the set declaration. A more complicated case is when a set name and an ordinary symbol appear in the same value list. Assuming the set definitions

(45)

BackVowel = a i4 o u ;
HiHarmony = 0 I ;
what is the meaning of the following harmony rule for Turkish?

(46)

"Back Harmony"

  Vx:Vy <=> :BackVowel :Cons* _   ;
            where Vx in (HiHarmony E)
                  Vy in (BackVowel a)
            matched ;

The compiler expands the "Back Harmony" rule to two subrules:

(47)

HiHarmony:BackVowel <=> :BackVowel :Cons* _       ;
        E:a         <=> :BackVowel :Cons* _       ;

The second subrule can be compiled straightforwardly but the first subrule raises the question of what interpretation should be assigned to the pair

HiHarmony:BackVowel
which has a set name on both sides. This same problem also rises in any simple rule that has a set name in the correspondence part.

One possibility would be to take the crossproduct of the two sets in question: 0:a, 0:i4, 0:o, 0:u, I:a, I:i4, I:o, I:u. However, the compiler follows a more conservative policy: every feasible pair must appear explicitly somewhere in the grammar. Pairs that do not appear in the alphabet declaration or in some rule are considered to be impossible. In the case at hand, it turns out that four of the eight pairs appear elsewhere in the grammar (0:i4, 0:u, I:i4, I:u) but the other four do not. Consequently, the compiler interprets the "Back Harmony" rule as

(48)

0:i4 | 0:u | I:i4 | I:u <=> :BackVowel :Cons* _  ;
            E:a         <=> :BackVowel :Cons* _  ;
Note that the E:a pair is assumed to be valid even if it does not appear anywhere else in the grammar because it arises directly from the interpretation of the where clause. Because the interpretation of "implicit definitions," such as E:, :a, HiHarmony:BackVowel, etc. depends on the inventory of feasible pairs, it is important to keep in mind that some pairs can arise as a result of interpreting where clauses.

4. Error checking

4.1 Simple errors

If there is a syntactic error in the grammar, such as a missing colon or an extra bracket, it will be detected and reported by `read-grammar' when the rules and declarations are parsed. The parser also performs some more sophisticated checks. It makes sure that sets and relation names are defined before they are used. It tries to determine that variables are used correctly. The `compile' command is not available if syntactic errors are detected in the parsing phase.

The `compile' routine makes further checks on definitions and variable assignments. For example, it reports an error if a term that has been defined as a relation appears on the upper or lower side of a symbol pair. If errors are detected in the initial stages, the compilation is aborted. When a rule has been compiled, the resulting transducer is checked to make sure that it does not completely block the appearance of some symbol pair or pairs. This type of error occurs, for example, in the following rule:

(49)

"Glottal stop insertion" ! Insert '?' before a vowel.
                         ! (This rule does not have the intended
                         ! effect.)
  0:%? <=    _ Vowel ;
The problem with this rule is that the left context is completely unspecified. This not an error in rules that constrain the surface realization of some lexical symbol but this rule requires the insertion of a surface element that corresponds to a zero on the lexical side. The rule requires that there be 0:? pair before every vowel even if the vowel is preceded by such a pair. This creates a need for an infinite regression of inserted glottal stops before every vowel. Consequently, the resulting transducer does not allow any vowels at all:

(50)

"Glottal stop insertion"
   b
1: 1
Equivalence classes:
((a e i o u) (b c d f g h j k l m n p q r s t v w x y z 0:?))
One equivalence class in the table has no transitions. This is a type of error that the compiler anticipates:

(51)

*** Error: Rule "Glottal stop insertion"
    is defective.  It disallows
    a e i o u
    Perhaps the rule requires an infinite progression of "0:?"
The correction that the user needs to make is to specify in the rule that one insertion of a glottal stop is enough:

(52)

0:%? <=   \0:%? _ Vowel ;
\0:%? denotes the union of all feasible pairs other than 0:%?.

Even if the compiler is not able to discover the reason for the error, it will complain if some class of symbol pairs becomes completely blocked by a rule.

4.2 Rule conflicts

Because two-level rules are statements about what is required, permitted, or prohibited and these statements all have equal force, it can easily happen that some rules are in conflict with one another. There are two basic types of conflicts. In a left-arrow conflict, two rules require a certain lexical symbol to be realized in two different ways in the same context. In a right-arrow conflict, different rules place contradictory constraints on contexts in which a certain correspondence is allowed.

The following pair of sandhi rules for Sanskrit is an example of a typical left-arrow conflict:

(53)

"8b. Visarga after a (change preceding a)"
  a:o <=> _ s: @ [ VoicedCons: | a: ] ;

"8c. Visarga after a (change following a)"
  a:' <=> a: s: @ _ ;
The two rules appear have different contexts but in fact they overlap in the case of lexical strings such as

(54)

a s @ a s @ a
According to the rule 8b, the middle a should be realized as o, according to 8c it should be realized as '. Because the rules are in conflict, lexical forms containing such strings (if they exist in the language), do not have any surface realization.

Some conflicts are more difficult to detect because they arise only under certain assignments of values to variables. On closer inspection, our Gemination rule, (37), discussed in section 3.5 turns out to harbor a right-arrow conflict.

(41)

"Gemination"

  %+:Cx <=> Cy _ Vowel ;
            where Cx in (b k d f g k l m n p r s t v z)
                  Cy in (b c d f g k l m n p r s t v z)
            matched ;

When the compiler expands the rule replacing Cx and Cy with their matching values, it creates, among others, the two subrules shown below:

(55)

  %+:k <=> c _ Vowel ;
  %+:k <=> k _ Vowel ;
The first subrule is for the inserted k in words like frolicked, the second for words like trekked. Because the morpheme boundary gets realized as k in both cases, the rules are in conflict: one allows the %+:k pair only in the context c _ Vowel, the second restricts it to k _ Vowel. The result is that frolic+ed and trek+ed do not get realized at all unless the right-arrow conflict is resolved.

One option is to remove the two conflicting cases from the general gemination rule and write a second rule especially for %+:k:

(56)

"K gemination"

  %+:k <=> c | k _ Vowel ;
It is also possible to leave the original rule as it is and let the compiler take care of the problem. This is the topic of the next section.

4.3 Conflict Resolution

The compiler can find all left- and right-arrow conflicts before a rule is compiled by comparing every one of its contexts and subrules with all other contexts and rules in the grammar. Whether the compiler does that depends on the setting of the `resolve' flag that the user can toggle (section 5). The default is to check, report and, if possible, resolve all conflicts. For example, the problem with the two sandhi rules, (49) in section 4.2, is duly noticed when the compiler takes up the first rule:

(57)

  "8b. Visarga after a (change preceding a)"
   _ s: @ [VoicedCons: | a:] ;
  a: s: @ _  ;
*** Warning: Unresolved <= conflict with respect to 'a:o' vs. 'a:''
    between "8b. Visarga after a (change preceding a)"
        and "8c. Visarga after a (change following a)"
The message identifies the conflicting contexts, the pair that is causing the problem, and the names of the rules involved.

In a case like this one, if it is necessary to eliminate the conflict, the rule writer has to decide which rule should have precedence in the disputed case and amend the context specifications in the appropriate way. There is no general principle that could be relied on to resolve the matter. But there are left-arrow conflicts that can be resolved in a principled way. A case in point is the conflict between the "Consonant gradation" rule, (22) in section 3.1, and the "Gradation of k between u/y", (43) in section 3.5;

(58)

"Consonant gradation"   ! Single k, p, and t are realized as
                        ! zero, v, and d, respectively, in the
                        ! in the beginning of a closed syllable.

  Cx:Cy <=>  h | Liquid | Vowel: _ Vowel ClosedOffset;
             where Cx in (k p t)
                   Cy in (0 v d)
          matched ;

"Gradation of k between u/y"  ! k weakens to v between u's and y's

  k:v <=> Cons Vx _ Vx ClosedOffset ;
          where Vx in HighLabial ;
The "Consonant gradation" rule expands to three subrules:

(59)

k:0 <=>  h | Liquid | Vowel: _ Vowel ClosedOffset ;
p:v <=>  h | Liquid | Vowel: _ Vowel ClosedOffset ;
t:d <=>  h | Liquid | Vowel: _ Vowel ClosedOffset ;
The k:v rule expands to

(60)

k:v <=> Cons u _ u ClosedOffset ;
        Cons y _ y ClosedOffset ;
The k:0 subrule of "Consonant gradation" is in obvious conflict with "Gradation of k between u/y" because lexical forms like pukun satisfy the context specification of both rules. Should k be deleted or realized as v in this context?

What makes this case different from the conflict between the Sanskrit rules is that the context of one rule is properly nested inside the context of the other. Both rules have ClosedOffset at the edge of the right context, a pair of identical vowels is a pair of vowels, and the left end of the k:v context, Cons, is a part of the universal relation that extends the left side of the k:0 rule. The relation between the two rule contexts is illustrated in Figure 4:

Viewed in this way, the k:v rule is obviously an exception to a more general rule about k which in turn is a subcase of a yet more general rule dealing with the alternation of single stops in Finnish. One expects a specific rule to take precedence over a general one, and that is the case here: in the domain of the k:v rule, k should be realized as v.

When the compiler sees a left-arrow conflict between two rules that are related in this way, it does not report it as an error but automatically gives precedence to the more specific rule. In the case at hand, it reports:

(61)

>>> Resolving a <= conflict with respect to 'k:0' vs. 'k:v'
    between "Consonant gradation"
        and "Gradation of k between u/y"
    by giving precedence to "Gradation of k between u/y"
If the more specific rule is a double-arrow rule, as is the case here, the resolution can be done by modifying the correspondence part of the general rule so that it also allows the realization required in the exceptional case. The first subrule of "Consonant gradation" is compiled as if it were

(62)

k:0 | k:v <=>  h | Liquid | Vowel: _ Vowel ClosedOffset ;
Weakening the general rule in this way when the specific rule has a => side does not lead to any problems. Because the k:v rule does not allow k to be realized as v outside its own domain, it does not matter if the general rule allows k to be realized as v or deleted in the wider context. If the specific rule is only a left-arrow rule, the conflict is resolved by subtracting its contexts from the contexts of the general rule.

Another common type of conflict that the compiler can resolve is illustrated by the two contradictory subrules of the gemination rule discussed in section 4.2:

(63)

%+:k <=> c _ Vowel ;
%+:k <=> k _ Vowel ;
By their right-arrow side, both of these rules constrain the realization of the morpheme boundary as k, but the contexts in which they allow it are completely disjoint: if %+ is preceded by c it is not preceded by k, and vice versa. When the compiler sees a conflict of this type, it takes the view that the rule writer does not intend to completely prohibit the %+:k realization but wants to allow it in both environments. In compiling the right-arrow sides of the two subrules, the compiler amends the rules so that they allow the %+:k pair in both contexts and reports:

(64)

>>> Resolving a => conflict with respect to '+:k'
    within "Gemination"
Besides the left-arrow conflict discussed above, our "Consonant gradation" rule is also involved in a right-arrow conflict with the "Geminate gradation" rule we presented in example (18), section 3.1:

(65)

"Geminate gradation"      ! Geminate stops are shortened to
                          ! single stops in a closed syllable.

  Cx:0 <=>  Cx _ Vowel ClosedOffset ;
            where Cx in VclessStop ;
One of the expansions of the Finnish geminate rule is

(66)

k:0 <=> k _ Vowel ClosedOffset ;
The => side of this rule clashes with the first subrule of the gradation rule for single stops which also constrains the deletion of k. But because the contexts are completely disjoint, the compiler eliminates the problem:

(67)

>>> Resolving a => conflict with respect to 'k:0'
    between "Consonant gradation"
        and "Geminate gradation"
In dealing with rule conflicts the compiler relies on two principles:

(1) Left-arrow conflicts are resolved if the context of one rule is completely subsumed by the context of the other. The more specific rule has precedence. Other left-arrow conflicts are reported.

(2) Right-arrow conflicts are always resolved.

Checking for conflicts and resolving them increases the compilation time, up to 30% in large rule systems, but it can save the rule writer from errors that take much longer to track down. It allows the option of writing rules that express broad generalizations even though there are specific exceptions. See Appendix 2 for more examples of conflict resolution.

5. Rule testing

The compiler provides two kinds of rule testing: 'lex-test' and 'lex-test-file' apply the rule transducers to lexical strings, generating surface strings; 'pair-test' and 'pair-test-file' apply the rule transducers to a lexical string and a surface string, either accepting or rejecting the pair. Both test modes can be valuable for debugging rules.

5.1 Applying rule transducers to lexical forms

Once rules have been read in using 'read-grammar' and compiled using 'compile', they can be tested using 'lex-test'. Previously compiled rules that have been saved to file using 'save-binary' can also be loaded for testing using 'install-binary'.

'lex-test' prompts for a lexical string, applies the rule transducers, and returns any surface strings that are generated. The following example uses the rules.txt grammar in example (3). When the user enters the lexical string kampan or kaNpan, the rule transducers generate the surface string kamman.

(68)

Command (C-Q = quit): read-grammar rules.txt
reading from 'rules.txt' ...
Alphabet... Rules...
  "N realized as m" "p realized as m"
Command (C-Q = quit): compile
--- output omitted --
Done.
Command (C-Q = quit): lex-test

Lexical string (q = quit): kaNpan
                           kamman
k
a
N:m
p:m
a
n
For grammars that refer to word-boundary characters, '#' is automatically added to each end of the lexical string. 'lex-test' finds and reports various errors, including undeclared lexical symbols and the absence of compiled transducers to test. The 'lex-test' facility is especially valuable for detecting overgeneration, when the rules generate illegal surface strings.

The list of lexical-surface pairs is helpful when the output string is related to the input in a more complicated way than in the cases we have seen so far. For example, a set of rules deriving past tense forms for irregular verbs in English might relate buy to bought in the following way:

(69)

Lexical string ('q' = quit): buy`+{Vpast}
                             bought
#:0
b
u:o
y:u
0:g
`:0
+:h
{VPast}:t
#:0
Because 0's are suppressed in the lexical input and the surface output for convenience and legibility, only the pairwise listing shows the real mapping.

The command 'lex-test-file' works exactly like 'lex-test' except that the rule transducers are applied to lexical forms read in from a file. The user is prompted for the input file and output file; specifying the output file '-' (hyphen) causes the output to be written to the screen.

The input file for 'lex-test'file' should consist of lexical strings, one per line. Blank lines are ignored.

(70)

kaNpat
kaNmat
kampat
KaNkat

5.2 Checking lexical-surface string pairs

Compiled rules can also be tested in 'pair-test' mode, which prompts for both a lexical string and a surface string. If the rule transducers allow the pairing of the two strings, the program prints ACCEPTED; if the pairing of the strings is illegal, it prints out REJECTED, the name of the violated rule, and the state in which the rule failed. Because 'pair-test' also prints out any initial substings up to and including the point of rejection, the exact input causing the failure is easily identified.

(71)

Command (C-Q = quit): pair-test

Lexical string (q = quit): kaNpat
Surface string (q = quit): kammat
k
a
N:m
p:m
a
t
ACCEPTED

Lexical string (q = quit): kaNpat
Surface string (q = quit): kanpat
k
a
N:n
p
REJECTED: "N realized as m" fails in state 2.
The lexical and surface strings must contain an equal number of symbols, including 0 (the two-level epsilon) where appropriate. To facilitate lining up the lexical and surface symbols, the lexical and surface strings can optionally be filled out with spaces, which are ignored. For grammars that refer to word-boundary symbols, the pair #:0 is automatially added to both ends of the paired strings.

The 'pair-test' facility detects and reports various input errors, including the absence of compiled transducers, mismatched string lengths, undeclared alphabetic symbols, and illegal lexical-surface pairs. 'pair-test' is especially valuable for finding the cause of undergeneration, where some rule in the system is blocking an analysis or generation that is considered valid.

The command 'pair-test-file' works exactly like 'pair-test' except that the input is read from a file. The user is prompted for an input file and and output file. As with 'lex-test-file', specifying '-' (hyphen) as the output file causes the output to be written to the screen.

The input file for 'pair-test-file' should contain pairs of a lexical string followed on a separate line by a matching surface string, with one string per line. Blank lines are ignored. As with 'pair-test', the strings can optionally be filled out with spaces to allow vertical alignment of individual feasible pairs.

(72)

b u y 0 ` + {Vpast}
b o u g 0 h    t

b r i n g ` + {Vpast}
b r o u g 0 h    t

f i 0 g h t ` + {VPast}
f o u g h t 0 0    0

6. Compiler options

6.1 Binary output

The results of the compilation can be saved as simple ASCII tables (`save-tabular') or in a more compact binary representation (`save-binary'). The binary encoding format is documented in [9]. If the data is saved in binary form, the transducers can later be read back in from the file using the command `install-binary' for intersecting or just for displaying the tables on the screen. There is currently no `install-tabular' command.

6.2 Switches

The compiler has three on-off switches that the user can toggle: `resolve', `time', and `trace'. The command `switches' reports on the current status:

(73)

Command (C-Q = quit): switches
  Conflict resolution is ON.
  Timing is OFF.
  Verbose mode is OFF.
Command (C-Q = quit): 
These are the default settings. The user can change the setting of a switch by typing its name. `resolve' turns conflict resolution off if it is on, and vice versa.

6.2.1 Resolve

Turning off conflict resolution speeds up the compiler, but it is not recommended unless the rule system is very large and already known to be free of conflicts. Even if the `resolve' mode is turned off, the compiler checks for gross errors of the type discussed in 4.1. It can catch errors in a single rule, but it does not compare any rule with respect to other rules in the grammar. Because of that, doing the intersection of a set of rules compiled with `resolve' turned off can fail because the intersection brings out any hidden rule conflicts.

6.2.2 Time

`time' prints out the execution time of `read-grammar', `compile', and `intersect' commands.

6.2.3 Trace

If the `trace' flag (= verbose mode) is turned on for `read-grammar', the alphabet, rules and other sections of the grammar are displayed on the screen as they are read in. Regular expressions are printed with minimal bracketing as the compiler has interpreted them.

With the `trace' flag off, `compile' prints a status message as it moves from one compilation phase to the next and the name of the rule it is working on. During the last stage, when individual subrules are compiled, the correspondence part of the subrule and the operator are displayed on the screen. If the rule is a double-arrow rule, the left-arrow part is compiled first, then the right-arrow part counting down context numbers, if there are more than one; finally, the transducers for the two parts are intersected. For example, when the compiler starts to work on the "Gradation of k between u/y" rule in (59), which expands to one subrule with two contexts (60), it prints

k:v <=
when the left-arrow compilation starts, then > at the start of the right arrow compilation followed by 2 and 1 as the two contexts are processed. When the intersection begins, the line reads:
k:v <=> 2 1
In the verbose mode, `compile' shows how rules that contain variables are expanded with subrules and subcontexts. The list of feasible pairs and implicit definitions for underspecified symbols such as :Vowel are also shown. Before a rule is compiled, the compiler shows the partitioning of the alphabet into equivalence classes it has inferred by inspecting individual rule components. It may turn out during the compilation that some of the classes can be merged because their representative symbols turn out to be equivalent after all. In that case, the adjusted classes are displayed at the end. If the rule is a double-arrow rule, the transducer tables for the <= and => compilation are shown separately before intersection.

6.3 Storage

The `storage' command lists the amount of memory the compiler is using for various internal data structures:

(74)

Command (C-Q = quit): storage

Memory in Use

  Strings: 1979 bytes
  RE cells: 3288 bytes
  Rules: 388 bytes
  Subrules: 476 bytes
  Contexts: 340 bytes
  Where clauses: 64 bytes
  Alphabets: 2182 bytes
  Networks: 21426 bytes

Total: 30143 bytes

The largest item on the list (Networks) is the space occupied by transducers at the end of the compilation. When a new grammar is read in, the previous data is expunged.

Acknowledgments

Thanks to Ronald M. Kaplan, and Annie Zaenen for the many comments and suggestions that helped to improve the content and the structure of this paper.

References

[1] Koskenniemi, Kimmo. Two-level Morphology: A General Computational Model for Word-Form Recognition and Production. Department of General Linguistics. University of Helsinki. 1983.

[2] Chomsky, Noam and Morris Halle. The Sound Pattern of English. Harper and Row. New York. 1968.

[3] Johnson, C. Douglas. Formal Aspects of Phonological Description. Mouton. The Hague. 1972.

[4] Kaplan, Ronald M. and Martin Kay. Phonological rules and finite-state transducers [Abstract]. Linguistic Society of America Meeting Handbook. Fifty-sixth Annual Meeting, December 27-30, 1981. New York.

[5] Ritchie, Graeme. Languages Generated by Two-Level Morphological Rules. Computational Linguistics 18(1): 41-59. 1992.

[6] Karttunen, Lauri. Finite-State Constraints. In the Proceedings of the International Conference on Current Issues in Computational Linguistics, June 10-14, 1991. Universiti Sains Malaysia, Penang, Malaysia. 1991. To appear in The Last Phonological Rule, ed. by John Goldsmith, Chicago University Press.

[7] Antworth, Evan L. PC-KIMMO: a two-level processor for morphological analysis. Occasional Publications in Academic Computing No. 16, Summer Institute of Linguistics, Dallas, Texas. 1990.

[8] Karttunen, Lauri, Kimmo Koskenniemi, and Ronald M. Kaplan. A Compiler for Two-level Phonological Rules. In Dalrymple, M. et al. Tools for Morphological Analysis. Center for the Study of Language and Information. Stanford University. Palo Alto. 1987.

[9] Karttunen, Lauri. Binary Encoding Format for Finite-State Networks. Technical Report SSL-90-17. Xerox Palo Alto Research Center. Palo Alto. March 1990.


APPENDIX 1

On-line help messages

"?" prints a list of available commands.
"help <command>" explains what <command> does.
"help all" prints out all help messages.

"banner"
gives information about the version and authors of the compiler.
"compile"
converts a rule file that has been parsed by a "read-grammar command to finite-state transducers.
"install-binary"
restores transducers from a file created by a "save-binary" command.
"intersect"
combines the current set of transducers, or some selection thereof, into a single automaton.
"lex-test"
tests the compiled rules; you enter lexical strings, and the rules generate surface strings.
"lex-test-file"
test the compiled rules; reads lexical strings from the input file, generates surface strings and writes them to the output file.
"list-rules"
prints the names of the current rules and their sizes (# of states x # of equivalence classes) if the rules have been compiled.
"pair-test"
tests the compiled rules; you enter a lexical string and a surface string for acceptance or rejection.
"pair-test-file"
test the compiled rules; reads pairs of lexical strings and surface strings from the input file, for acceptance or rejection; results to to a file.
"read-grammar"
parses a file of two-level rules for the compiler.
"redo"
reads the most recently parsed rule file and recompiles the rules if compilation was done or attempted in the previous round.
"resolve"
is an on/off switch. When it is ON, the compiler looks for and resolves certain types of conflicts between rules.
"save-binary"
stores compiled, intersected, or installed transducers in a file in binary format.
"save-tabular"
stores compiled, intersected, or installed transducers in a file in tabular format.
"show <rule>"
displays the transducer of a compiled or installed <rule> in tabular format.
"show-rules"
displays all compiled or installed rules in tabular format.
"storage"
tallies the memory used by the compiler.
"switches"
displays the status of "resolve", "time", and "trace" switches.
"time"
is an on/off switch that provides timing information for some operations.
"trace"
is an on/off switch that provides information about intermediate stages of rule parsing and rule compilation.


APPENDIX 2

A sample grammar: Finnish consonant gradation

The purpose of this appendix is to present a two-level description for a problem that has a satisfactory solution within the two-level paradigm but enough complexity to make it a little challenging. The example chosen is Finnish consonant gradation.

Section 1 gives some linguistic background and data that the rules have to account for. Section 2 is a two-level grammar for the problem. Some of the rules were used as examples in the main part of the paper. Section 3 is a trace of a compiler run. It shows that this grammar relies heavily on the ability of the compiler to resolve rule conflicts. The intersection of the nine rules yields a single 64 state transducer. The timing statistics reflect the performance of the compiler on a Macintosh Powerbook 100.

1. Consonant gradation in Finnish

The term consonant gradation refers to a set of alternations that involve the voiceless stops k, p, and t in Finnish. Historically, their alternate, "weak," forms occur in closed syllables, but in modern Finnish the conditions are in part morphological. The weakening of single and geminate consonants must originally have been a simple phonetic alternation, but the current pattern is more complex. Our grammar has seven general rules and two that deal with specific exceptional words. The rules abstract away from reality in that here the alternation is controlled only by the phonological environment. We also ignore the fact that many words that have recently entered the language are exempted from gradation.

The set of possible lexical:surface correspondences involving k, p, and t are listed below.

k k:0 k:g k:j k:'
p p:v p:m
t t:d t:n t:l t:r
The surface g in the pair k:g represents a part of a geminate velar nasal, written as ng in standard Finnish orthography. The surface apostrophe is an orthographic marker indicating a syllable boundary.

The following table illustrates the alternations of single consonants and consonant clusters. The order of the examples corresponds to the order of rules in the following section. The strong forms are singular partitive forms of nouns; the weak forms are singular genitives. The ik~j alternation involves just two words (aika time, poika boy); the vowel alternation in the weak grade of ruoka food is limited to just this word. It is optional, hence the use of => as the operator in the final rule.

stem + a stem + n
Strong grade Weak grade Gloss Alternation
    
sikaa sian pig k~0
papua pavun bean p~v
sotaa sodan war t~d
    
kukkaa kukan flower kk~k
loppua lopun end pp~p
mattoa maton carpet tt~t
    
tiukua tiu'un chime k~'
pukua puvun dress k~v
kurkea kurjen crane k~j
    
vankia vangin prisoner nk~ng
kumpua kummun hill mp~mm
rantaa rannan shore nt~nn
    
iltaa illan evening lt~ll
partaa parran beard rt~rr
    
aikaa ajan time ik~j(exception)
ruokaa ruoan/ruuan food uok~uo/uu(exception)

2. A two-level grammar for consonant gradation



! GRADATION.TXT -- a set of rules for Finnish consonant gradation
! Copyright (c) 1987 by Lauri Karttunen

Alphabet

   a b c d e f g h i j k l m n o p q r s t u v x y %{ %} w z ':0 :';

Sets

   VclessStop = k p t ;
   Cons = VclessStop b c d f g h j l m n ng r s v x z ' ;
   Vowel = a e i o u y %{ %} ;
   Labial =  p b m ;
   Velar = g k ;
   Liquid = l r ;
   Nasal = m n ;
   HighLabial = u y ;

Definitions

   ClosedOffset =  :Cons [Cons: | #: ] ;

Rules

  "Consonant gradation"    ! Single k, p, and t are realized as
                           ! zero, v, and d, respectively, in the
                           ! beginning of a closed syllable.

     Cx:Cy <=>  h | Liquid | Vowel: _ Vowel ClosedOffset;
                where Cx in (k p t)
                      Cy in (0 v d)
                matched ;

   "Geminate gradation"           ! Geminate stops are shortened to
                                  ! single stops in a closed syllable.

     Cx:0 <=>  Cx _ Vowel ClosedOffset ;
               where Cx in VclessStop ;


  "Gradation after nasals"  ! Stops assimilate to the preceding
                            ! nasal consonant in the weak grade.

     Cx:Cy <=> Cz _ Vowel ClosedOffset ;
               where Cx in (k p t)
                     Cy in (g m n)
                     Cz in (n m n)
               matched ;
 
  "Gradation of k after VV"   ! k weakens to a syllable boundary
                              ! marker between a pair of identical
                              ! vowels following another vowel

     k:' <=> Vowel Vx _ Vx ClosedOffset ; where Vx in Vowel ;

 
  "Gradation of k between u/y"   ! k weakens to v between u's
                                 ! and y's.

     k:v <=> Cons Vx _ Vx ClosedOffset ;
             where Vx in HighLabial ;

  "Gradation of k after liquids or h"   ! k weakens to j before e
                                        ! after a liquid or h.

     k:j <=> Liquid | h _ e: ClosedOffset ;

  "Gradation of t after liquids"   ! t assimilates to the preceding
                                   ! liquid in the weak grade.

     t:Cx  <=>  Cx _ Vowel ClosedOffset  ; where Cx in Liquid ;

  "Weak grade of poika, aika"   ! Exceptionally, poja.., aja..
                                ! instead of *poia.., *aia.. in the
                                ! weak grade of poika, aika.

     i:j <=> #: [p o | a ] _ k: a: ClosedOffset ;


  "Weak grade of ruoka"   ! Exceptionally, either ruoa.. or ruua..
                          ! in the weak grade of ruoka.
                          ! The operator is => rather than <=> because
                          ! o can also be realized as itself here.

     o:u => #: r u _ k: a: ClosedOffset ;

3. From rules to transducers

This log file was produced running the compiler on a Macintosh Powerbook 100. On this low-end machine, the compilation phase takes 55 seconds; intersecting the resulting nine transducers to a single 64 state automaton takes 7 seconds. The minimum memory requirement for this job on a Macintosh is 650 kilobytes. On a Sparcstation 2, the compilation time is reduced to 3 seconds.


Command (C-Q = quit): read-grammar
Input file: gradation.txt
reading from 'gradation.txt'...
Alphabet... Sets... Definitions... Rules...
  "Consonant gradation" "Geminate gradation" "Gradation after
nasals"
  "Gradation of k after VV" "Gradation of k between u/y"
  "Gradation of k after liquids or h" "Gradation of t after
liquids"
  "Weak grade of poika, aika" "Weak grade of ruoka"
Command (C-Q = quit): time
  Timing is ON.
Command (C-Q = quit): compile
Compiling "gradation.txt"
Expanding... Analyzing...
Compiling definitions:
  "ClosedOffset"
Compiling rule components:
  "Consonant gradation" "Geminate gradation" "Gradation after
nasals"
  "Gradation of k after VV" "Gradation of k between u/y"
  "Gradation of k after liquids or h" "Gradation of t after
liquids"
  "Weak grade of poika, aika" "Weak grade of ruoka"
Compiling rules:
  3 cases of "Consonant gradation"
>>> Resolving a => conflict with respect to 'k:0'
    between "Consonant gradation"
        and "Geminate gradation"
>>> Resolving a <= conflict with respect to 'k:0' vs.
'k:''
    between "Consonant gradation"
        and "Gradation of k after VV"
    by giving precedence to "Gradation of k after VV"
>>> Resolving a <= conflict with respect to 'k:0' vs.
'k:v'
    between "Consonant gradation"
        and "Gradation of k between u/y"
    by giving precedence to "Gradation of k between u/y"
>>> Resolving a <= conflict with respect to 'k:0' vs.
'k:j'
    between "Consonant gradation"
        and "Gradation of k after liquids or h"
    by giving precedence to "Gradation of k after liquids or
h"
>>> Resolving a <= conflict with respect to 't:d' vs.
't:l'
    between "Consonant gradation"
        and "Gradation of t after liquids"
    by giving precedence to "Gradation of t after liquids"
>>> Resolving a <= conflict with respect to 't:d' vs.
't:r'
    between "Consonant gradation"
        and "Gradation of t after liquids"
    by giving precedence to "Gradation of t after liquids"
    k:0 <=> 2 1
    p:v <=>
    t:d <=>
  3 cases of "Geminate gradation"
    k:0 <=> 2 1
    p:0 <=>
    t:0 <=>
  3 cases of "Gradation after nasals"
    k:g <=>
    p:m <=>
    t:n <=>
  "Gradation of k after VV"
    k:' <=> 8 7 6 5 4 3 2 1
  "Gradation of k between u/y"
    k:v <=> 2 1
  "Gradation of k after liquids or h"
    k:j <=>
  2 cases of "Gradation of t after liquids"
    t:l <=>
    t:r <=>
  "Weak grade of poika, aika"
    i:j <=>
  "Weak grade of ruoka"
    o:u =>
Done.
55 sec
Command (C-Q = quit): list-rules
"Consonant gradation" 13 x 11
"Geminate gradation" 18 x 15
"Gradation after nasals" 11 x 13
"Gradation of k after VV" 30 x 16
"Gradation of k between u/y" 19 x 9
"Gradation of k after liquids or h" 9 x 9
"Gradation of t after liquids" 11 x 11
"Weak grade of poika, aika" 12 x 11
"Weak grade of ruoka" 8 x 11
Command (C-Q = quit):
Command (C-Q = quit): show "Consonant gradation"

    a  b  h  k  p q i:j k:0 p:v p:0 #:0
 1: 2  1  2  6  1 1  2           1   1
 2: 2  1  2 13  9 1  2   3   3   9   1
 3. 4
 4.    5  8  7  5    8
 5.    1  2  6  1                1   1
 6: 2  1  2  6  1 1  2   3       1   1
 7.    1  2  6  1        3       1   1
 8.    1  2 13  9        3   3   9   1
 9: 10 1  2  6  1 1  2           1   1
10: 2 12 12 11 11 1  12  3   3   9   1
11: 10            1  2
12: 2             1  2
13: 10 1  2  6  1 1  2   3       1   1
Equivalence classes:
((a e i o u y { } o:u) (b c d f g j m n s v x z k:j k:v k:' t:l
t:r)
 (h l r) (k) (p t k:g p:m t:n) (q w) (i:j) (k:0) (p:v t:d)
 (p:0 t:0) (#:0 ':0))
Command (C-Q = quit): intersect
Intersect all rules? (y/n) [y]: y
Name the intersection: Gradation
Composing common alphabet...Expanding...Intersecting...
64 states, 34 equivalence classes, 974 arcs.
7 sec
Command (C-Q = quit): list-rules
"Gradation" 64 x 34
Command (C-Q = quit):
4. Testing the result

Because the rules were intersected at the end of the previous section, 'lex-test' and 'pair-test' now operate on a single transducer:

Command (C-Q = quit): lex-test

Lexical string ('q' = quit): kurken
                             kurjen
#:0
k
u
r
k:j
e
n
#:0

Lexical string ('q' = quit): ruokan
                             ruuan
#:0
r
u
o:u
k:0
a
n
#:0

                             ruoan
#:0
r
u
o
k:0
a
n
#:0

Lexical string ('q' = quit): q

Command (C-Q = quit): pair-test

Lexical string ('q' = quit): pukun
Surface string ('q' = quit): puvun
#:0
p
u
k:v
u
n
#:0
ACCEPTED

Lexical string ('q' = quit): pukun
Surface string ('q' = quit): pu0un
#:0
p
u
k:0
u
REJECTED: "Gradation" fails in state 33.

Lexical string ('q' = quit): aputton
Surface string ('q' = quit): aput0on
#:0
a
p
u
t
t:0
REJECTED: "Gradation" fails in state 38.
Running pair tests on the result of an intersection is generally not useful because, if the output is not correct, it may be difficult to see which rule is responsible for the error. Our purpose here is just to demonstrate that the final result is correct for the Finnish data in Section 1.