Introduction to CFSM API - Stanford University

40 downloads 140 Views 310KB Size Report
python and ruby interfaces mentioned in Section 5 use dynamic libraries. The ...... If the value is 1, a timer is started for operations that might take a while to ...
Introduction to CFSM API Lauri Karttunen Palo Alto Research Center, 3333 Coyote Hill Road, Palo Alto, CA 94304 [email protected] http://www.parc.xerox.com/istl/members/karttune

1

Introduction

This article documents the Application Programming Interface (api) to libcfsm, a collection of c utilities for creating and applying finite-state networks. It is the core of several stand-alone applications such as xfst, lexc, lookup and tokenize, described in the 2003 book by Kenneth R. Beesley and Lauri Karttunen [1]. A cd containing the four applications came with the Beesley & Karttunen book. These tools are available under a license for research and teaching purposes. The libcfsm library also includes some of the facilities included in fst, the “big sister” of the xfst application, that are not available in the book version. The cfsm api directory that this document is a part of has the following structure: cfsm api/ top-level directory cfsm api/doc documentation cfsm api/include contains a single header file cfsm api.h, included as an appendix at the end of this document. cfsm api/linux32 contains a bin and a lib directory for 32 bit Linux. cfsm api/linux64 contains a bin and a lib directory for 64 bit Linux. cfsm api/macosx contains a bin and a lib directory for MacOS X. cfsm api/solaris contains a bin and a lib directory for Sun Solaris. cfsm api/windows contains a bin and a lib directory for for 32 bit Windows. cfsm api/src contains the source code and a Makefile for six demo applications that illustrate the use of the cfsm api: hello, apply, pmatch, piglatin, tokenize and commands. The five platform directories each have an empty bin directory and a lib directory containing the static library libcfsm.a specific to the platform (on Windows libcfsm.lib). Dynamic libraries can be provided if they are needed. The python and ruby interfaces mentioned in Section 5 use dynamic libraries. The 32 bit Linux library was compiled on a machine running RedHat 9, the 64 bit version was compiled under Fedora 6. The Solaris version of libcfsm.a was made on a Solaris 2.8 machine running SunOS 5.8. The Macintosh library was compiled on an Intel Macintosh running version 10.5.4 (Leopard) of MacOS X. The demo applications for the Mac compiled with the library are universal binaries. They will run on any Macintosh running 10.3 or newer version of MacOS X. The Windows version was built on Visual Studio 6.0.

2

2

Demo Applications

The header file, cfsm api.h tries to explain the purpose of every data structure and function prototype it contains in comments but it is not an easy read. It is very long and written for the c compiler rather than for a human reader. The presentation is constrained by the compiler requirement that nothing can be referred to until it has been defined. For that reason the file begins with trivial type definitions for all kinds of constants and auxiliary data structures such as stacks and heaps that are of no interest to an application level programmer who mostly needs to know about the functions that come at the very end of the file. In order to provide a gentle introduction to the cfsm library, the cfsm api/src directory includes a subdirectory for six demo applications: hello, apply, pmatch, piglatin, tokenize and commands. Each subdirectory contains a commented source code and a Makefile that will compile the particular application on any of the supported platforms. Some of the subdirectories also contain a test subdirectory with materials that are used for testing the application. We will discuss each of them briefly in the order of increasing sophistication. Any application using cfsmlib must include the following bits of code: #include "cfsm_api.h" int main(int argc, char **argv) { FST_CNTXTptr cntxt = initialize_cfsm(); /* Application-specific code goes here. */ reclaim_cfsm(); }

The call to initialize cfsm() creates, initializes and returns a large structure called cfsm context that most of the other functions in the library depend on. As some functions take a pointer to the context as one of the arguments, it is often useful to have a variable bound to it. In any case, the function get default context() will return the context created by initialize cfsm(). The function reclaim fsm() frees all the data structures and memory allocated to the default context. Each application source directory has a Makefile for creating and testing the particular application. All of them start off with the same incantation: OPSYS := $(shell uname) MACHINE := $(shell uname -m) VERSION = $(shell grep VERSION_STRING version.h |\ sed -e ’s/^[^\"]*"//’ -e ’s/"//’) STATIC_LIB_EXT = a ifeq ($(OPSYS),Darwin) LOCAL_CFLAGS = -DDarwin -fno-common -no-cpp-precomp -D_M_IX86\ -D__M_IX86

3 READLINE_LIB = -lreadline OS_PATH = ../../macosx SHARED_LIB_EXT = dylib else READLINE_LIB = -lreadline -ltermcap SHARED_LIB_EXT = so ifeq ($(MACHINE),i686) LOCAL_CFLAGS = -DUnix -DLinux -U__GNUC__ -D_M_IX86 -D__M_IX86 OS_PATH = ../../linux32 else ifeq ($(MACHINE),x86_64) LOCAL_CFLAGS = -DUnix -DLinux -U__GNUC__ -D_M_IX86_64\ -D__M_IX86_64 OS_PATH = ../../linux64 else LOCAL_CFLAGS = -DUnix -DSVR4 -Dsparc -D_sparc OS_PATH = ../../solaris endif endif endif

The purpose of this section is to determine the operating system and the architecture of the machine so that the rest of the Makefile can call the c compiler with the appropriate compiler flags and link to the correct version of libcfsm.a. Each Makefile contains the following targets and dependencies: clean: display: all: test: all install: all

The command make clean deletes all files generated by the c compiler. The command make display shows the values of the variable set by the initial block of the Makefile. For example, on an Intel Macintosh make display produces the following output: MACHINE = i386 OPSYS = Darwin SHARED_LIB_EXT = dylib STATIC_LIB_EXT = a OS_PATH = ../../macosx VERSION = 1.0.0

where OS PATH shows the path to the appropriate version of cfsmlib.a and VERSION is the version number provided by the application’s version.h file. All of them list it as version 1.0.0. The command make all compiles the application. The command make test compiles the application if it has not already been done and runs the actions specified under test. The command make install installs the application into the bin directory of the specified platform.

4

Let us now introduce each of the demo applications, starting with the most trivial one, the “Hello World” as a finite-state transducer. We recommend viewing the explanations below side-by-side with the actual code. In the following, we ignore for the most part print statements that just provide tracing information to the user about what the application is doing. 2.1

Hello

This application compiles a regular expression that associates the string ”Hello World!” with its French counterpart ”Bonjour le Monde !” (the space before the ! is required in French). The regular expression formalism used here and elsewhere in this document is explained in Chapters 2 and 3 of the Beesley & Karttunen book. net = read_regex("{Hello World!}:{Bonjour le Monde !}");

The transducer returned by the read regex() function is the same that the command read regex would produce in the xfst application. The English string is on the upper side of the transducer, the French string on the lower side. The command page = new_page();

creates a page object for formatted output. A page has a notion of indentation and a right margin. A default page object created by new page() has zero indentation and the right margin at 72. The size of a page is incremented as needed. The command words_to_page(net, UPPER, DONT_ESCAPE, page);

outputs the words on the upper side of the net onto the page without looking for characters that might need a special treatment (dont escape). The command print_page(page, stdout);

displays the contents of the page on stdout: Hello World! The sequence of statements: reset_page(page); words_to_page(net, LOWER, DONT_ESCAPE, page); print_page(page, stdout);

display the lower side language of the transducer: Bonjour le Monde !. The next command creates an apply context structure for applying the transducer net to its “upper-side” strings as input and the lower-side strings as output, that is, from English to French. applyer = init_apply(net, UPPER, cntxt);

Having created an applyer, we can now call apply to string() to apply the network to an input to obtain an output string, if any:

5 output = apply_to_string("Hello World!", applyer); if (output) puts(output);

The puts() command prints Bonjour le Monde !. Switching the input side from upper to lower: switch_input_side(applyer);

lets us to translate Bonjour le Monde ! into Hello World! : output = apply_to_string("Bonjour le Monde !", applyer); if (output) puts(output);

prints Hello World! again. Finally, the commands reset_page(page); network_to_page(net, page); print_page(page, stdout);

show the structure of the network: Sigma: B H M W d e j l n o r u " " ! Size: 14. Flags: deterministic, pruned, minimized, epsilon_free, loop_free Arity: 2 s0: H:B -> s1. s1: e:o -> s2. s2: l:n -> s3. s3: l:j -> s4. s4: o -> s5. s5: " ":u -> s6. s6: W:r -> s7. s7: o:" " -> s8. s8: r:l -> s9. s9: l:e -> s10. s10: d:" " -> s11. s11: !:M -> s12. s12: 0:o -> s13. s13: 0:n -> s14. s14: 0:d -> s15. s15: 0:e -> s16. s16: 0:" " -> s17. s17: 0:! -> fs18. fs18: (no arcs)

The Sigma of the network is its symbol alphabet, all the single or multicharacter symbols that may appear as an arc label or as a component of a pair label such as H:B. The Flags line lists some of the properties of the network. Arity: 2 means that the network is a transducer, that is, it encodes a string-to-string relation rather than a simple language. The states are numbered in a topological

6

order, that is, s0 is the initial state of the network and the numbers of the other states reflect their distance from the start state. State-to-state transitions, called arcs or edges, are listed in the state where they originate. Each arc has a label and a destination state. For example, H:B -> s1 is an arc with H:B as the label and s1 as its destination. Final states such as fs18 have names starting with f. 2.2

Apply

In the apply directory, the command make test compiles the source file apply.c and calls the resulting executable with two arguments: ./apply test/FrenchEnglish.fst test/english.txt test/french.txt

The three files are in the subdirectory test. The first one, FrenchEnglish.fst, is a transducer produced in the test directory by the stand-alone fst application with the command fst -f english-french.script. This scripts creates a transducer that translates between English and French numerals from one to nine hundred ninety-nine. French numerals are on the upper side of the transducer, English numerals on the lower-side of the transducer. The two other text files in the test directory, english.txt and french.txt contain a few numerals from each of the two languages. The apply program loads the transducer specified by the first argument net = load_net(argv[1]);

and opens the first text file as input stream. It creates an apply context with the statement applyer = new_applyer(net, NULL, input_stream, LOWER, cntxt);

The new applyer() function creates an apply context and initializes it either for an input string (second argument) or for an input stream (third argument). One of the two must be specified and the other one must be null. The working core of the apply program is the while loop in the auxiliary routine process input: static void process_input(APPLYptr applyer, FST_CNTXTptr cntxt) { STRING_BUFFERptr input, output; input = FST_string_buffer(cntxt); while ((output = next_apply_output(applyer))) { if (APPLY_in_stream(applyer) != stdin) { print_string_buffer(input, stdout); putchar(’\n’); } if (output->pos > 0) { print_string_buffer(output, stdout); putchar(’\n’); } else if (!applyer->end_of_input) puts("???\n\n"); } }

7

The while loop processes the input line-by-line and keeps applying the transducer until the input stream ends. The operation uses two string buffer objects, one for input, the other for output. A string buffer is a container for a string whose length is adjusted automatically. One of the strings, input, is obtained from FST string buffer(cntxt) where the function that parses the input line saves a copy of it. The first if-statement checks whether the input is coming from stdin. If not, the input string is displayed to stdout. If the user is directly typing the input on a console there is no reason to echo it back. It makes sense to show the input string only if it is coming from a file. The second string buffer, output, is returned by the next apply output() function.. The second if-statement checks whether any output was produced, that is, whether the output position of the string buffer is non-zero. In that case, the output string is displayed. If not, unless we are at the end of the input, three question marks are printed to indicate that the input was not recognized. Here is an example of the kind of output that is produced: seventeen dix-sept sixty-nine soixante-et-neuf soixante-neuf

After the first input stream is exhausted, the apply routine opens the second text file, switches the direction of application and runs the process again but now mapping French numerals into English. A simple exercise in the use of cfsm api is to translate the fst script in apply/test into a program based on cfsm api.h and cfsmlib.a 2.3

PMatch

One of the recent additions to parc’s finite-state techology is pattern matching. Section 3.2 discusses the pattern matching algorithm in more detail. This demo illustrates the basic idea. A pattern may be defined as a list of words and phrases or by a regular expression. The example illustrates the use of both techniques. The test file contains a pattern file patterns.fst, produced by the script patterns.script that processes Actor.txt, Dictator.txt and Movie.txt. When these three list files are compiled by patterns.script, every name on the list of actors, for example, Gracy Kelly, becomes a path in patterns.fst that leads to the final transition \Actor:0 where 0 stands for epsilon. The pattern matching transducer also contains paths that lead social security numbers such as 123-45-6789 to the final transition \SocSecNum:0 and phone numbers terminating with \PhoneNum:0. When the pattern-matching apply algorithm finds a match for an input such as Gracy Kelly or 123-45-6789, it comes to a closing xml tag on the output side of the last lower-side epsilon transition, it can record this fact in several ways. The default output behavior is to wrap the matching string inside matching xml

8

brackets: Grace Kelly. Another option is standoff markup. That is, the program records the byte position of the beginning of the match, the length of the match and the actual string matched. This is the option chosen in the demo. The command make test loads test/patterns.fst and opens an input stream for test/input.txt. It then initializes a pattern matcher with the statement applyer = new_pattern_applyer(net, NULL, input_stream, NULL, LOWER, cntxt);

Here the first null argument indicates that there is no input string, the input comes from input stream. The second null argument indicates that the output is written into stdout rather than into a file. The statement output = next_pattern_output(applyer);

processes the entire input file: My social security number is 123-45-6789. You can call me at (650) 812-4567. Grace Kelly and Gary Grant starred in ”To Catch a Thief”. Charlie chaplin played Hitler in The Great Dictator. producing the following output: 29|11|123-45-6789| 62|14|(650) 812-4567| 78|11|Grace Kelly| 94|10|Gary Grant| 117|16|To Catch a Thief| 136|15|Charlie chaplin| 159|6|Hitler| 169|18|The Great Dictator| Pattern # of matches --------------------------- 3 1 2 1 1 ---------------------------Total: 8

The first line of the output indicates that, starting at byte position 29, there is a sequence of bytes of length 11, namely 123-45-6789, that matches the pattern. More about pattern matching in Section (3.2). 2.4

Tokenize

A tokenizer is a program that chops a text into words, punctuation symbols, numbers and other kinds of sequences that should be recognized as a unit. A

9

typical tokenizer maps an input file into an output file where each line contains a token. Alternatively, a special token boundary symbol such as tb might be introduced instead of a newline. Either way, standard tokenizers are deterministic. But there are many instances where it is difficult to decide where to put the token boundary. For example, the final period in Dr. might be the end of a sentence and not part of the abbreviation, or it might be a conflation of two periods. In some cases a final exclamation point might be a part of the token rather than a sentence terminator (Yahoo! ). Sophisticated tokenizers need to encode ambiguity in their output. The tokenize application returns tokens not as strings but as small networks. If there are alternative tokenizations, the network will contain more than one path. For example, if the string Dr. could be processed either as Dr.TB or as DrTB.TB, the tokenizer returns the network in Figure (1). Individual token net-

TB 0

D

1

r

2

3

.

.

4

TB

5

Fig. 1. Two tokenizations for Dr.

works can be concatenated to encode a large number of alternative tokenizations for a sentence in a very compacted form. The tokenize application creates a tokenizer with the statement: tok = new_tokenizer(net, NULL, input_stream, single_to_id("TB"), single_to_id("FAIL\nTOKEN"));

The core of the tokenize application is the while-loop while ((net = next_token_net(tok))) { words_to_page(net, UPPER, DONT_ESCAPE, page); print_page(page, stdout); reset_page(page); free_network(net); }

The call to next token net(tok) keeps producing new token networks until the input has been exhausted. The token network used in this demo is created by the script token.script in the test directory. The crucial statement that introduces the ambiguity of final periods and exclamation points comes at the very end of the script: read regex [

~$[TB] # .o. Token @-> ... TB # .o. SP -> 0 || TB _ # .o. [%.|%!] (->) TB ... || Char invert net

No TB on the input side! Insert TB after a maximal token. Kill spaces at token boundaries. _

[.#. | TB]

] ; # Optional TB

10

In the composition of the replace rules above, the upper side is presumed to be the input side. The invert net command flips the final transducer around because in morphological analyzers the lower side is traditionally the input side. 2.5

PigLatin

The PigLatin application illustrates the use of functions in parc’s regular expression language. A function is defined as a regular expression with any number of parameters. For example, a function for reduplicating a string with a period inserted in the middle is defined in fst with the statement: define Redup(X) [X %. X ];

Given this definition of Redup(X), the fst regular expression compiler interprets Redup({pig})as pig.pig. Another type of definition in the fst application is that of lists. For example, the statements list C [b c d f g h j k l m n p q r s t v w x y z]; list V [a e i o u];

define C as a set of consonants and V as a set of vowels. Using lists and other tricks in parc’s regular expression language, it is possible to define a complex function such as PigLatin(X) that translates any word into PigLatin mapping, for example, pig into igpay. The definition of lists and functions is available in the cfsm api. The purpose of this demo application is to show how to make use of them with PigLatin as the example. In the c interface, the reduplication function above would be defined as follows: define_function("Redup(X)", "[X \".\" X]");

The application defines PigLatin(X) with the help of five auxiliary functions, applies the result to pig and enters into a loop asking user to input more words to translate into PigLatin. Here is the output from the make test command: Defined Defined defined defined defined defined defined defined

list C list V Redup(X). AddW(X). DelCons(X). TailToAy(X). DelMiddle(X). PigLatin(X).

The word for ’pig’ in PigLatin is: igpay Enter English words line by line. I will translate them to Pig Latin. Exit with ^D (^Z on Windows). cow owcay ^D ebyay

11

where ebyay is bye in PigLatin.

2.6

Commands

Unlike the other demo applications, this application has no particular theme. It is a test bed for various features of the cfsm api. Among other things, it illustrates how to compile regular expressions and how to make simple networks using the functions make empty net(), add state to net(), and add arc to state(). It shows some calculus operations such as repeat net(), negate net() and minus net and similar operations on alphabets such as minus alph(). It illustrates optimization functions such as optimize arcs() and share arcs() that can reduce the size of a network. The comments in the cfsm api try to explain the technical details and the purpose of all the functions included in the interface.

3

New Features

This section describes some previously undocumented network operations that are available in the cfsm api and in fst, the more powerful “big sister” of xfst documented in the Beesley&Karttune book [1]. The first subsection describes additions to the fst regular expression language: lists, functions and new types of special symbols. The second subsection documents a pattern matching facility that applies any number of user-defined patterns in parallel to an input text. The third subsection describes a number of optimization techniques that change the physical representation of the network to either make it smaller in memory or to make the application of the network faster at the cost of making it larger. We describe these features as they are used in the fst application and point to the corresponding functions in the cfsm api interface. 3.1

Lists, List and Insert Flags, Functions

It has always been possible in xfst to define networks. A statement such as define Vowel a|e|i|o|u ;

compiles the regular expression a|e|i|o|u and binds the symbol Vowel to the resulting network. Given this definition, the regular expression compiler will substitute for the symbol Vowel the network the network it has been defined as. For example, the expression Vowel^2 compiles into the same network as [a|e|i|o|u]^2. The network encodes all two-vowel strings. fst and the current version of xfst have two new types of definitions: lists (symbol sets) and functions. There are also two new types of special symbols: list flags and insert flags.

12

Lists A list definition binds a name to an alphabet. For example, in fst the command list Vowel a|e|i|o|u ;

compiles the regular expression a|e|i|o|u and binds the symbol Vowel to the sigma alphabet of the resulting network. The corresponding function in libcfsm is define regex list(). Because the list is obtained from the sigma of the network, replacing a|e|i|o|u by [a e i o u] in the above list definition would have no effect. The networks are different but the sigma alphabets are the same. It is often convenient to define lists using Perl-like range expressions. For example, list Letter "A-Za-z" ;

defines Letter as the set of 52 ascii letters. A range expression such as "A-Za-z" must be in double quotes. The regular expression compiler interprets "A-Za-z" as the union of the symbols in the range. Range expressions may be of the form "\uHHHH-\uHHHH" where H are hexadecimal numbers. For example, the range "\u0391-\u03A1\u03A3-\u03A9" includes all uppercase Greek letters, equivalently, in utf-8 mode "A-PΣ-Ω".1 Table 1 below enumerates fifteen symbol lists that are predefined as part of the initialization of cfsm. They include symbols in Latin-1 and its extension in Microsoft’s cp 1252 (“Windows Latin-1”). Name all alnum alpha ascii cntrl digit graph lower punct upper space windows labr rabr

Description the union of the other lists alphanumeric symbols alphabetic symbols ascii symbols control characters 0-9 non-whitespace symbols lowercase alphabetic symbols punctuation symbols uppercase alphabetic symbols whitespace symbols the 27 non-Latin-1 symbols in cp 1252 left angle bracket right angle bracket Table 1. Predefined Lists

Any predefined list can be used in a regular expression to designate the union of the symbols on the list. For example, the multicharacter symbol digit is equivalent to "0"|1|2|3|4|5|6|7|8|9 and to "0-9". 1

The range of uppercase Greek letters must be defined as two subranges because the lowercase letter σ has an alternate word-final form that has no uppercase counterpart. There is no "\u03A2".

13

List Flags In addition to being interpreted as predefined unions, list symbols have another use that takes advantage of the fact that list symbols denote symbol alphabets rather than networks. If Vowel is defined as the list [a|e|i|o|u], any of the listed vowels matches the special symbol @L.Vowel@ in the apply routines in fst and in the current xfst. The interpretation given to @L.Vowel@ is “any member of the list Vowel.” The opposite symbol, @X.Vowel@, means “excluding all members of the list Vowel.” The @L. @ and @X. @ constructions are similar to [ ] and [^ ] in Perl regular expressions except that lists may contain multicharacter symbols.2 The advantage of “list flags” such as @L.Vowel@ is that they reduce the size of networks. For example, a network that encodes any sequence of two vowels can be defined concisely as "@L.Vowel@"^2. This regular expression compiles into a network consisting of two arcs and three states: 0

@L.Vowel@

1

@L.Vowel@

2

Fig. 2. The language of vowel pairs encoded by "@L.Vowel"^2

An equivalent network defined as Vowel^2 in Figure 3 has 10 arcs:

0

a

a

e

e

i

1

i

o

o

u

u

2

Fig. 3. The language of vowel pairs encoded as Vowel^2

List flags are interpreted properly by all the apply routines but they are treated as ordinary labels in network operations. Some operations such as concatenation and union work fine on networks that contain list flags but other operations such as composition(.o.), intersection (&) and minus (-) do not produce the expected results. For example, "@L.Vowel" - a does not yield a network that accepts any vowel except for a. One way to construct such a network in fst is to first eliminate the list flag from the first operand with the command eliminate flag Vowel. Eliminating the @L.Vowel@ label converts the network in Figure 2 to the one in Figure 3. Another option is to define a new a-less list of vowels: list Vowel2 Vowel - a; 2

List flags are similar in appearance to the flag diacritics discussed in Chapter 7 of the Beesley & Karttunen book but the similarity stops there. Flag diacritics are a special type of epsilon symbols that check the value of an attribute. List flags are not treated as epsilon transitions, they either match or don’t match a given input symbol.

14

Given the list definitions of Vowel and Vowel2, the label @L.Vowel2@ matches any vowel except a. Insert flags Insert flags are special symbols of the form @I.D@ where D must be a symbol that has been bound to a network with the define command in fst or with the define net() function in libcfsm. For example, if D is defined as follows: define D {the}|{a}|{an}|{this}|{that}|{those}|{these};

the symbol @I.D@ matches any of these English articles and determiners. Insert flags are particularly useful in pattern definitions that contain several instances of the same large component. For example, because the set of first names in English is the same as the set of middle names, a definition such as define PersonName FirstName (" " FirstName) " " LastName;

includes two copies of the FirstName network. Because the middle name component is optional, PersonName includes strings such as John Adams and John Henry Adams. A more concise definition, define PersonName "@I.FirstName@" (" " @I.FirstName@) " " LastName;

does not physically incorporate the network of first names at all. When the apply routine encounters an arc labeled with an insert flag, it “pushes” into the defined network and, on success, “pops up” to continue the match on the same level as before. The definitions of terms referred to by an insert flag may themselves contain inserts, as the following example shows. define define define define define

D {the}|{a}|{an}|{this}|{that}|{those}|{these}; N {city}|{girl}|{side}|{ocean}; P {with}|{of}|{from}|{on}; NP "@I.D@" " " "@I.N@" (" " "@I.PP@"); PP "@I.P@" " " "@I.NP@";

Here the definition of NP refers to the definitions of D, P and PP by way of insert flags, and the definition of PP refers back to the definition of NP, and vice versa. The PP pattern includes strings such as with the girl and on this side that contain a simple noun phrase, and it also includes strings such as with the girl from the city, on this side of the ocean and with the girl from the city on the ocean where the NP contains one or more embedded PPs. This is an example of an rtn, a recursive transition network. Just as list flags, insert flags are properly handled by just a few network operations such as concatenation and union. Their utility is in enabling the apply routines to operate on smaller networks. The fst command eliminate flag also works on insert flags. The corresponding libcfsm function is eliminate flag(). For example, eliminate flag D, eliminate flag N and eliminate flag N remove the @I.D@, @I.N@ and @I.P@ arcs from the definitions of NP and PP by “splicing in” the network in question. Because the definitions of NP and PP refer to each other, in this case neither one can be eliminated without resurrecting the other.

15

Functions Functions are regular expression macros that construct a particular regular expression from the arguments, evaluate it, and return the resulting network. For example, as we saw in section 2.5, in fst the command define Redup(X) [X %. X ];

defines a function that reduplicates its argument and marks the middle of the string with a period. The corresponding libcfsm call is define_function("Redup(X)","X %. X");

Given this definition of Redup(X), the regular expression compiler interprets Redup(p i g) as p i g . p i g.3 A function can have any number of parameters, separated by commas. For example, define Wrap(X, Y) Y X Y;

yields a function that maps Wrap({abc}, %") into "abc". Every parameter must occur at least once in the regular expression defining the function. Section 2.5 shows how function definitions can use other defined functions to produce complex behavior. There is a large number of predefined functions for case conversion and symbol manipulation. They are listed in Table 2. The optional side argument is either U for upper, L for lower or B for both sides. The side argument makes a difference only for transducers. The default is B. Here are some examples of built-in case conversions and symbol manipulations. With the exception of the last example, the output consists of strings of single character symbols. UpCase({new york}) UpCase(a:b) UpCase(a:b, L) OptUpCase(a:b) Cap({new york}) OptCap({new york}) Explode(NP) Implode(N,P)

==> ==> ==> ==> ==> ==> ==> ==>

NEW YORK A:B a:B A:B, a:b New York New York, New york, new York, new york NP (two single-character symbols) NP (a multicharacter symbol)

The case conversion functions work on all alphabets that make a distinction between upper and lowercase letters, including Greek, Cyrillic, etc. The symbol manipulation functions can be useful in manufacturing labels. For example, the EndTag(X) function define EndTag(X) Implode(""):0 ;

converts EndTag(Person) into the label :0 with a final xml tag on the upper side and an epsilon on the lower side. Labels of this type are used in pattern matching. 3

The function definition is bound to the a multicharacter symbol consisting of the function name and the opening left paren, Redup( in this case. There must not be any whitespace between the name and the left paren as they are parsed as a single symbol.

16 UpCase(X [,side])

Returns a network that recognizes the alluppercase versions of X. If side U or L is specified, then only the indicated side of the transducer is uppercased. OptUpCase(X [,side]) Same as for UpCase(X [,side]) but optionally uppercases each symbol in X. DownCase(X [,side]) Returns a network that recognizes the alllowercase versions of X. If side U or L is specified, then only the indicated side of the transducer is lowercased. OptDownCase(X [,side]) Same as for DownCase(X [,side]) but optionally lowercases each symbol in X. Cap(X, [,side]) Same as UpCase(X [,side]) but with initialuppercasing only, with all other characters lowercased. OptCap(X, [,side]) Same as Cap(X, [,side]), but optionally. AnyCase(X, [,side]) Returns a network that contains X with all case variations. Explode(X) Where X would normally be parsed as a multicharacter symbol foo, returns the exploded [f o o], which is a concatenation of three separate symbols. Implode(x[,y [,z ...]]) Takes multiple symbol or string arguments and implodes them into a single multicharacter symbol. Table 2. Predefined Functions

3.2

Pattern Matching

The left-to-right, longest match replace operator, =>, can be used to mark instances of a regular language in the text. For example, all occurrences of a particular name such as Sara Lee can be marked with a transducer compiled from the regular expression {Sara Lee} @=> ... || Lim _ Lim ;

where Lim is any whitespace or punctuation symbol, or the beginning or the end of text. Karttunen et al. [2] give many examples where the technique works well in recognizing and marking simple types of named entities such as phone numbers, Social Security numbers and dates. The reason for the success is that the pattern languages in these cases were small. The left-to-right, longest match principle becomes computationally very expensive when the size of the pattern components increases because the => operator encodes the left-to-right, longest match constraint using the state and arc space of the network. If we try to define a name recognizer by an expression such as FirstName " " LastName @=> ... || Lim _ Lim ;

17

where FirstName and LastName contain thousands of names, this approach quickly becomes impractical because of the size of the resulting network. This section presents an algorithm that achieves the same result as marking transducers but scales up much better. Let us call it the pmatch algorithm. It is based on three ideas. 1. The left-to-right longest-match principle is implement in the apply algorithm. 2. Only the end of a pattern needs to be marked. 3. The same string is a match for more than one pattern. To illustrate these ideas, let us consider a very simple case. We know that Sara Lee is a name of a company but there are also many people named Sara Lee. We would like to recognize all instances of Sara Lee and tag them both as a person and as a company. The definitions are given in (4). (4) define define define define define

CTag "":0; PTag "":0; Persons {Sara Lee}; Companies {Sara Lee}; Patterns Companies CTag | Persons PTag;

The resulting Patterns network consist of a linear path for the string Sarah Lee leading to a penultimate state that has to arcs, one has the label :0, the other is labeled :0.4 Both of them lead to a final empty state. The basic pmatch routine Let us assume that the pmatch algorithm is applying the Patterns network to the input He works for Sara Lee. The algorithm starts at the beginning of the input and at the start state of the Patterns network and tries to match the the first symbol, H of the input against the arcs of the current pattern state. If it finds a match, it advances to the arc’s destination state and to the next symbol in the input string. If it fails to find a match, as the case is here, it writes the H into the output buffer and advances to the next symbol, e. Before starting the matching process, pmatch checks the left context at its current position. The default requirement is that a pattern should start from the beginning of a string or after a non-alphanumeric symbol. Because the e and the following space do not meet the starting condition, they are appended to the output buffer without any attempt to match them. In the case at hand, the matching attempts fail until the process reaches the letter S. From there on, the input matches the path leading to the penultimate state. At that point we are at the end of the input string but both of the tag arcs yield match because they have an epsilon on the input side. Having reached a final state over a tag arc with no input left to check, pmatch now takes note of the fact that the next input symbol, the period, satisfies the default ending condition. It reads off the tag on the output side of the label, creates the corresponding initial xml tag on the fly 4

The 0 on the lower (right) side of the double label represents , the empty string.

18

and inserts it into the output buffer. Depending on the order of the tag arcs in the penultimate state, the start tag is now either or . Let us assume the former. In that case pmatch first inserts the initial tag into the output buffer and copies the string that it has match into the output buffer terminating with the closing tag. At that point the output buffer contains the string He works for Sarah Lee. When pmatch processes the second closing tag, it takes note of the fact that it already has one analysis for the string and wraps the second pair of initial and final tags around the first pair. The final output of the process is He works for Sara Lee. Assume now that we add two new names, Sara and Lee to the Persons list in (4). Given the same input as before, we now get a successful match at the point where the output buffer contains the string He works for Sara . But in this case the final state of the pattern network has an outgoing arc for space. Because the pmatch algorithm always looks for the longest match, it ignores this preliminary result and tries for a longer match. At the point where it comes to the and tags, the preliminary output gets overwritten and the final output is what we just saw. Because the pmatch algorithm never starts a search in the middle of another search, it will not try to find a match for Lee in this case. Consequently, strings that are substrings of some successfully matched longer string are always passed over. Observations on the pmatch algorithm Compared to the marking approach described in the beginning of this section the pmatch algorithm has many advantages. Enforcing the longest match regimen in the apply routine is much more efficient than hardwiring the constraint into the transducer when the pattern is made up of large subcomponents. Using the last arc of a pattern to encode the type of entity allows different patterns share structure. For example, the and interpretations of Sara Lee follow the same path in the network up to final tag arc. Output Options The default output mode in fst for pattern matching is to wrap xml tags around each match of the pattern. In this section we cover briefly the other output options. The output mode is controlled by five interface variables, defaults in parentheses: mark-patterns (on), locate-patterns (off), delete-patterns (off), extract-patterns (off). default Wrap xml tags around the strings that match a pattern, for example, Grace Kelly. stand-off markup Leave the original text unmodified. Produce an output file that indicates for each match its beginning byte position in the file, for example, 78|11|Grace Kelly|. Set locate-patterns on. extraction Extract from the file all the strings that match some pattern. Output them with their tags. For example, Grace Kelly. Ignore all the rest. Set locate-patterns off, Set extract-patterns on.

19

redaction Ignore strings that match some pattern. Print the rest. Set extractpatterns off, set delete-patterns on. CFSM API Interface to Pattern Matching The header file cfsm api.h contains the following functions for pattern matching: new pattern applyer() Creates and initializes a pattern applyer for a string or for an input stream and an output stream. make pattern applyer() Creates and initializes a pattern applyer for a string or for an input file and an output file. If the output file is NULL, the output goes into an internal string buffer. next pattern output() Calls apply patterns() and returns the pointer to the output buffer. apply patterns() Applies a pattern applyer created by new pattern applyer(). pattern match counts to page() Writes to a page the numbe of matches for each pattern obtained by the last application of a pattern applyer. See the function prototypes and the comments in cfsm api.h for more information about these functions. In the cfsm api the output options are controlled by six macro settings: iy count patterns Keep a count of how many matches each pattern gets. Default is 1. iy delete patterns Delete the matching strings, keep everything else. Default is 0. iy extract patterns Extract the matches, suppress everything else. Default is 0. iy locate patterns Produce standoff markup. Default is 0. iy mark patterns Wrap xml tags around the match. Default is 1. iy need separators A separator character is required at the start and at the end of a match. Default is 1. 3.3

Optimizations

The internal representation of states and arcs in fst can be modified in several ways to make reduce the size or to improve the speed of applications. The effect of any of the techniques described in this section are highly variable. Networks that are very dense in terms of the arcs/state ratio can sometimes be dramatically reduced in size by encoding arc space in a different way even if the state count goes up. Sparse network tend to receive less benefit if any at all. This section describes briefly some of these optimizations. All the operations discussed here are lossless and reversible. The optimized networks can be used for apply and display operations but most calculus operations are implemented only for standard networks.

20

Optimizations for size reduction label set reduction Compute equivalence classes of arc labels and represent each class by one label only. Two labels belong to the same class if they always occur together and and their arcs always have the same destination state. arc optimization A heuristic method for reducing the number of arcs, typically at the cost of adding more states. arc sharing Let some arcs be shared by more than one state. Arc sharing presupposes arc optimization. compaction Minimize the size of state and arc structures. Achieves the best compression but slows down the application speed significantly. Optimizations for speed vectorization In the standard representation, a state consists of a list of arcs. An arc has a label, a pointer to a destination state and a pointer to the next arc. Most fst algorithms work only on the standard representation. A vectorized state has no arcs. Instead it has a vector of destination pointers that accessed directly by symbol labels. In typical applications, dense states (states with long arc lists) get vectorized while sparse states remain in the standard representation. Size vs. speed Table 3 illustrates the effect of the manipulations discussed above on the size of a network and the speed of the application. The network in question is a “sentence breaking” transducer that introduces sentence boundaries. Its large size is due to the long lists of abbreviations, numerical expressions and other expressions that are exceptions to general punctuation rules. The application speed was measured on a 710K text file. Representation Standard, no optimization Label set reduction Arc optimization Arc sharing Compaction Vectorization

States/Arcs 5302/596017 5302/28033 8364/83301 5302/75679 5302/596017 5302/5312

Size Speed 7.2 Mb 23 sec. 3.5 Mb 11 sec. 1.1 Mb 12 sec 994 Kb 11 sec 518 Kb 3 min 3 sec 6,9 Mb 2 sec

Table 3. Comparison of the Effect of Various Optimizations to Size and Speed

In this particular case, Arc optimization with Arc sharing is the best option if size is important. Vectorization is the best option if speed is what matters most. Whether or not a state gets vectorized depends on the min num arcs parameter set in the function call. In this experiment the setting was 50, that is states with

21

50 or more outgoing arcs were vectorized, the others remained in the standard linked arc-list representation. On this setting, the size of the partially vectorized network was even a little smaller than the same network in standard format. If the min num arcs parameter is lowered to 10, the speed is increased but the size will grow beyond the size of the standard network representation. The cfsm api contains the following ten functions related to optimization: vectorize states() Vectorizes all states that have at least min num arcs. unvectorize net() Restores vectorized states to normal form. optimize arcs() Heuristically tries to reduce the number of arcs even at the cost of adding states. unoptimize arcs() Undoes the work of optimize arcs() share arcs() Makes an arc-optimized network even smaller by letting arcs be shared by two or more states. unshare arcs() Undoes the work done by arc optimization and arc sharing. reduce labelset() Partitions the label alphabet into equivalence classes and represents each class by a single label. unreduce labelset() Undoes the work of reduce labelset(). compact net() Compacts a network into a form that has no state or arc structures. uncompact net() Undoes the work of compact net. See the function prototypes and the comments in cfsm api.h for more information about these functions.

4

A Bit of History

The current cfsm code is the third generation of finite-state implementations created at parc and at xrce, Xerox Research Centre Europe. The beginnings of the system are in the Lisp code written by Ronald M. Kaplan in the early 1980s. Martin Kay and Kimmo Koskenniemi contributed ideas for morphological analysis with finite-state transducers. Lauri Karttunen ported Kaplan’s Interlisp code into Common Lisp around 1987 and extended it with new operations. The current c version was started around 1990 by Karttunen, Todd Yampol and Kenneth R. Beesley (then working for Microlytics). Many people have contributed to it over the years. At xrce in the 1990s, the main contributors were Pasi Tapanainen, Herv´e Poirier, Andr´e Kempe, Tam´as Ga´al, Caroline Privault and Jean-Marc Coursimault. The replace operators were introduced into the regular expression calculus at xrce by Karttunen and Kempe in the mid 1990s. The facilities for utf-8 encoded Unicode input/output were engineered by Karttunen and Coursimault around 2002, the internal string representations were in Unicode from the very beginning. At parc, where cfsm is currently maintained and developed, Hadar Shemtov, John Maxwell and Robert D. Cheslow have helped in the effort. The newest layer of functionality in cfsm is pattern matching. The first commercial user of parc’s finite-state software was Microlytics in the 1980s, followed by Inxight (now part of sap) in the 1990s. During that period,

22

parc and Inxight had a close relationship with a Lingsoft, a Finnish company cofounded by Kimmo Koskenniemi. Traces of this alliance still show up in several places in the credits of the latest release of Microsoft Word (2008), for example, German speller: Copyright(c) Lingsoft, 2004. All rights reserved. Two-Level Compiler: Copyright (c) Xerox Corporation 1994. The Two-Level Compiler, still in use to produce the German spell checker for Word, was originally developed by Kaplan, Koskenniemi and Karttunen in Lisp around 1987, reimplemented in c with help from Beesley in the early 1990s. Because the Two-Level Compiler was co-developed by Koskenniemi, Lingsoft has a perpetual right to market networks produced with the Two-Level Compiler without any royalty to xerox or parc. The latest cfsm licensee is Powerset, a San Francisco search company recently acquired by Microsoft.

5

Other Interfaces

At parc, Timothy Maxwell has produced a Python interface to cfsmlib called python fsm; at Powerset, Stuart Robinson has made an interface for Ruby called rubygem fsm. These interfaces use cfsm api.h with a dynamic version of cfsmlib.

References 1. Beesley, K.R., Karttunen, L.: Finite State Morphology. CSLI Publications, Palo Alto, CA (2003) 2. Karttunen, L., Chanod, J.P., Grefenstette, G., Schiller, A.: Regular expressions for language engineering. Journal of Natural Language Engineering 2(4) (1996) 305–328

23

A

Alphabetic List of Functions

ARCptr add_arc_to_state(NETptr net, STATEptr start, id_type id, STATEptr dest, void *user_pointer, int big_arc_p); STATEptr add_state_to_net(NETptr net, int final_p); int add_string_property(NETptr net, char *attribute, char *value); ALPHABETptr alph_add_to(ALPHABETptr alph, id_type new_id, int keep_p); ALPHABETptr alph_remove_from(ALPHABETptr alph, id_type id, int keep_p); NETptr alphabet_net(ALPHABETptr alph); int alphabet_to_page(ALPHABETptr alph, PAGEptr page); int append_fat_str_to_buffer(FAT_STR fs, STRING_BUFFERptr str_buf); int append_label_to_buffer(id_type id, int escape_p, STRING_BUFFERptr str_buf); int append_string_to_buffer(char *str, STRING_BUFFERptr str_buf); int append_to_lab_vector(id_type lab, LAB_VECTORptr lab_vect); int append_to_vector(void *object, VECTORptr vector); int apply_patterns(APPLYptr pattern_applyer); char *apply_to_string(char *input, APPLYptr applyer); void assure_buffer_space(int length, STRING_BUFFERptr str_buf); ALPHABETptr binary_to_label(ALPHABETptr alph); char *cfsmlib_build(void); char *cfsmlib_version(void); void char_to_page(char c, PAGEptr page); void compact_net(NETptr net); NETptr compose_net(NETptr upper, NETptr lower, int keep_upper_p, int keep_lower_p); NETptr concat_net(NETptr net1, NETptr net2, int keep_net1_p, int keep_net2_p); NETptr contains_net(NETptr net, int keep_p); ALPHABETptr copy_alphabet(ALPHABETptr alph); FAT_STR copy_fat_string(FAT_STR fs); NETptr copy_net(NETptr net); NETptr crossproduct_net(NETptr upper, NETptr lower, int keep_upper_p, int keep_lower_p); int decrement_lab_vector(LAB_VECTORptr lab_vect); int define_function(char *fn_call, char *regex); int define_net(char *name, NETptr net, int keep_p); int define_regex_list(char *name, char *regex); int define_regex_net(char *name, char *regex); int define_symbol_list(char *name, ALPHABETptr alph, int keep_p); NETptr eliminate_flag(NETptr net, char *name, int keep_p); NETptr epsilon_net(void); FAT_STR fat_strcat(FAT_STR s1, FAT_STR s2); FAT_STR fat_strcpy(FAT_STR s1, FAT_STR s2); void fat_string_to_page(FAT_STR fs, PAGEptr page); void fat_string_to_page_esc(FAT_STR fs, char *esc, PAGEptr page); char *fat_to_thin_str(FAT_STR fat_str, char *thin_str, int with_esc); PAGEptr file_info_to_page(PAGEptr page); PAGEptr flags_to_page(NETptr net, PAGEptr page); void float_to_page(float f, int watch_rm, PAGEptr page); int fprint_fat_string (FILE *outfile, FAT_STR fs);

24 void free_alph(ALPHABETptr alph); void free_alph_iterator(ALPH_ITptr alph_it); void free_applyer(APPLYptr applyer); void free_arc(ARCptr arc); void free_arc_iterator(ARCITptr arc_it); void free_network(NETptr net); void free_nv_and_nets(NVptr nv); void free_nv_only(NVptr nv); void free_page(PAGEptr page); void free_state(STATEptr state); void free_string_buffer(STRING_BUFFERptr str_buf); void free_tokenizer(TOKptr tok); void free_vector(VECTORptr vector); FST_CNTXTptr get_default_context(void); FST_CNTXTptr get_default_context(void); NETptr get_net(char *name, int keep_p); char *get_string_property(NETptr net, char *attribute); ALPHABETptr get_symbol_list(char *name, int keep_p); id_type id_pair_to_id(id_type upper_id, id_type lower_id); LABELptr id_to_label(id_type id); NETptr ignore_net(NETptr target, NETptr noise, int keep_target_p, int keep_noise_p); void increment_lab_vector(LAB_VECTORptr lab_vect); APPLYptr init_apply(NETptr net, int side, FST_CNTXTptr cfsm_cntxt); void init_apply_to_stream(FILE *stream, APPLYptr apply_context); void init_apply_to_string(char *input, APPLYptr apply_context); ARCITptr init_arc_iterator(NETptr net, ARCITptr arc_it); FST_CNTXTptr initialize_cfsm(void); void initialize_string_buffer(STRING_BUFFERptr str_buf); IntParPtr int_parameters(void); int int_print_length(long i); void int_to_page(long i, int watch_rm, PAGEptr page); ALPHABETptr intersect_alph(ALPHABETptr alph1, ALPHABETptr alph2, int keep_alph1_p, int keep_alph2_p); NETptr intersect_net(NETptr net1, NETptr net2, int reclaim_net1_p, int reclaim_net_p); NETptr invert_net(NETptr net, int keep_p); int keep_net2_p); NETptr kleene_plus_net(void); NETptr kleene_star_net(void); int lab_vector_element_at(id_type *lab, int pos, LAB_VECTORptr lab_vect); int label_length(LABELptr label, int escape_p); NETptr label_net(id_type id); ALPHABETptr label_to_binary(ALPHABETptr alph); void label_to_page(id_type id, int escape_p, int watch_rm, PAGEptr page); PAGEptr label_vector_to_page(LAB_VECTORptr lab_vect, PAGEptr page, int escape_p, char *sep); PAGEptr labels_to_page(NETptr net, PAGEptr page); NETptr lenient_compose_net(NETptr upper, NETptr lower, int keep_upper_p, int keep_lower_p);

25 int load_defined_nets(char *filename); NETptr load_defined_net(char *name, char *filename); NETptr load_net(char *filename); NVptr load_nets(char *filename); int longest_string_to_page(NETptr net, int side, PAGEptr page); id_type lower_id(id_type id); NETptr lower_side_net(NETptr net, int keep_p); ALPHABETptr make_alph(int len, int type); APPLYptr make_applyer(char *fst_file, char *in_string, char *in_file, int input_side, FST_CNTXTptr cfsm_cntxt); NETptr make_empty_net(void); STRING_BUFFERptr make_fat_str_buffer(int length); FAT_STR make_fat_string(int length); LAB_VECTORptr make_lab_vector(int length); NVptr make_nv(int len); PAGEptr make_page(int size, int indent, int rm); APPLYptr make_pattern_applyer(char *fst_file, char *in_string, char *in_file, char *out_file, int input_side, FST_CNTXTptr cfsm_cntxt); STRING_BUFFERptr make_string_buffer(int length); TOKptr make_tokenizer(char *fst_file, char *in_string, char *in_file, char *token_bound, char *fail_token); VECTORptr make_vector(int length); int minimize_net(NETptr net); ALPHABETptr minus_alph(ALPHABETptr alph1, ALPHABETptr alph2, int keep_alph1_p, int keep_alph2_p); NETptr minus_net(NETptr net1, NETptr net2, int keep_net1_p, int keep_net2_p); NETptr negate_net(NETptr net, int keep_p); NETptr net(char *name); NVptr net2nv(NETptr net); ALPHABETptr net_labels(NETptr net); ALPHABETptr net_sigma(NETptr net); PAGEptr net_size_to_page(NETptr net, PAGEptr page); PAGEptr network_to_page(NETptr net, PAGEptr page); APPLYptr new_applyer(NETptr net, char *string, FILE *stream, int input_side, FST_CNTXTptr cfsm_cntxt); PAGEptr new_page(void); void new_page_line(PAGEptr page); APPLYptr new_pattern_applyer(NETptr net, char *in_string, FILE *in_stream, FILE *out_stream, int input_side, FST_CNTXTptr cfsm_cntxt); TOKptr new_tokenizer(NETptr token_fst, char *in_string, FILE *in_stream, id_type token_boundary, id_type fail_token); id_type next_alph_id(ALPH_ITptr alph_it); STRING_BUFFERptr next_apply_output(APPLYptr applyer); ARCptr next_iterator_arc(ARCITptr arc_it, void ** next, int *last_p); STRING_BUFFERptr next_pattern_output(APPLYptr pattern_applyer); NETptr next_token_net(TOKptr tok);

26 NETptr null_net(void); void nv_add(NETptr net, NVptr nv); NETptr nv_get(NVptr nv, int pos); void nv_push(NETptr net, NVptr nv); NETptr one_plus_net(NETptr net, int keep_p); void optimize_arcs(NETptr net); NETptr optional_net(NETptr net, int keep_p); NETptr other_than_net(NETptr net, int keep_p); NETptr pair_net(char *upper, char *lower); id_type pair_to_id(char *upper, char *lower); PAGEptr pattern_match_counts_to_page(APPLYptr pattern_applyer, PAGEptr page); void print_alph(ALPHABETptr alph, FILE *stream); int print_label(id_type id, FILE *stream, int escape_p); void print_net(NETptr net, FILE *stream); void print_page(PAGEptr page, FILE * stream); int print_string_buffer(STRING_BUFFERptr str_buf, FILE *stream); NETptr priority_union_net(NETptr net1, NETptr net2, int side, int keep_net1_p, int keep_net2_p); PAGEptr properties_to_page(NETptr net, PAGEptr page); NETptr read_lexc(char *filename); int read_net_properties(NETptr net, char *file); NETptr read_prolog(char *filename); NETptr read_regex(char *regex_str); NETptr read_spaced_text(char *filename); NETptr read_text(char *filename); void reclaim_cfsm(void); void reclaim_lab_vector(LAB_VECTORptr lab_vect); void reduce_labelset(NETptr net); int remove_string_property(NETptr net, char *attribute); NETptr repeat_net(NETptr net, int min, int max, int keep_p); void reset_alph_iterator(ALPH_ITptr alph_it); void reset_lab_vector(LAB_VECTORptr lab_vect); void reset_page(PAGEptr page); void reset_vector(VECTORptr vector); int restart_tokenizer(TOKptr tok, char *string, FILE *stream); NETptr reverse_net(NETptr net, int keep_p); int save_defined_nets(char *filename); int save_net(NETptr net, char *filename); int save_nets(NVptr nv, char *filename); int set_char_encoding(FST_CNTXTptr cntxt, int code); void set_error_function(void (*fn)(char *message, char *function_name, int code)); void set_warning_function(void (*fn)(char *message, char *function_name, int code)); int share_arcs(NETptr net); int shortest_string_to_page(NETptr net, int side, PAGEptr page); PAGEptr sigma_to_page(NETptr net, PAGEptr page); id_type single_to_id(char *name); void spaces_to_page(int n, int watch_rm, PAGEptr page);

27 ALPH_ITptr start_alph_iterator(ALPH_ITptr alph_it, ALPHABETptr alph); void start_arc_iterator(ARCITptr arc_it, void *state, void** next, int *last_p); PAGEptr storage_info_to_page(PAGEptr page); NETptr string_to_net(char *str, int byte_pos_p); void string_to_page(char *str, int watch_rm, PAGEptr page); NETptr substitute_label(id_type id, ALPHABETptr labels, NETptr net, int keep_p); NETptr substitute_net(id_type id, NETptr insert, NETptr target, int keep_insert_p, int keep_target_p); NETptr substitute_symbol(id_type id, ALPHABETptr list, NETptr net, int keep_p); NETptr substring_net(NETptr net, int keep_p); void switch_input_side(APPLYptr cntxt); ALPHABETptr symbol_list(char *name); PAGEptr symbol_list_to_page(char *name, PAGEptr page); NETptr symbol_net(char *sym); void symbol_to_page(FAT_STR name, PAGEptr page); int test_alph_member(ALPHABETptr alph, id_type id); int test_equal_alphs(ALPHABETptr alph1, ALPHABETptr alph2); int test_equivalent(NETptr net1, NETptr net2); int test_intersect(NETptr net1, NETptr net2); int test_lower_bounded(NETptr net); int test_lower_universal(NETptr net); int test_non_null(NETptr net); int test_sublanguage(NETptr net1, NETptr net2); int test_upper_bounded(NETptr net); int test_upper_universal(NETptr net); PAGEptr time_to_page(long start, long end, PAGEptr page); void uncompact_net(NETptr net); int undefine_net(char *name); int undefine_symbol_list(char *name); ALPHABETptr union_alph(ALPHABETptr alph1, ALPHABETptr alph2, int keep_alph1_p, int keep_alph2_p); NETptr union_net(NETptr net1, NETptr net2, int keep_net1_p, int keep_net2_p); void unoptimize_arcs(NETptr net); void unreduce_labelset(NETptr net); NETptr unshare_arcs(NETptr net, int reclaim_p); int unvectorize_net(NETptr net); void update_net_labels_and_sigma(NETptr net); id_type upper_id(id_type id); NETptr upper_side_net(NETptr net, int keep_p); int vector_element_at(void **element, int pos, VECTORptr *vector); int vectorize_states(NETptr net, int min_num_arcs); int watch_margin(PAGEptr page, int next_size); PAGEptr words_to_page(NETptr net, int side, int escape_p, PAGEptr page); int write_net_properties(NETptr net, char *file); int write_prolog(NETptr net, char *filename); int write_spaced_text(NETptr net, char *filename);

28 int write_text(NETptr net, char *filename); NETptr zero_plus_net(NETptr net, int keep_p);

29 /* $Id: cfsm_api.h $ */ /* Copyright (c) 2008 by the Palo Alto Research Center. All rights reserved */ /********************************************************************* ** ** CFSM_API.H ** Lauri Karttunen ** Palo Alto Research Center ** June 2008 ** *********************************************************************/ /* This header file documents the data structures, constants and function prototypes that are made available in libcfsm. This is the only header file needed for the library. The order of presentation of structure and function definitions is fixed by the needs of a C compiler that must have a definition for every constant and data type before it is used in another definition. A human reader might find it useful to skip the beginning and jump right into the section on FUNCTION PROTOTYPES. */ #ifndef C_FSM_API #define C_FSM_API

#ifdef __cplusplus extern "C" { #endif /* __cplusplus */ #include /***************************************************** * DATA STRUCTURES and DEFINITIONS *****************************************************/ #ifndef BIT_DEFINED #define BIT_DEFINED typedef unsigned short bit; #endif #ifndef BYTE_DEFINED #define BYTE_DEFINED typedef unsigned char byte; #endif #ifndef TRUE #define TRUE 1 #endif #ifndef FALSE

30 #define FALSE 0 #endif #ifndef EPSILON #define EPSILON 0 #endif

/* The symbol ID of the epsilon symbol. */

#ifndef OTHER #define OTHER 1 #endif

/* The symbol ID for an unknown symbol. */

/******************** * CONSTANTS *******************/ enum enum enum enum enum enum enum enum enum enum

enum

symbol_pair {UPPER=0, LOWER=1, BOTH_SIDES=2}; visit_marks {NOT_VISITED=0, IN_PROCESS=1, DONE=2}; escape_p {DONT_ESCAPE=0, ESCAPE=1}; obey_flags_p {DONT_OBEY=0, OBEY=1}; keep_p {DONT_KEEP=0, KEEP=1}; char_encoding {CHAR_ENC_UNKNOWN=0, CHAR_ENC_UTF_8=1, CHAR_ENC_ISO_8859_1=2}; alph_types {BINARY_VECTOR=0, LABEL_VECTOR=1}; watch_rm {DONT_WATCH_RM=0, WATCH_RM=1}; record_byte_pos {DONT_RECORD, RECORD}; flag_action {NO_ACTION=0, CLEAR_SETTING=1, POSITIVE_SETTING=2, NEGATIVE_SETTING=3, UNIFY_TEST=4, DISALLOW_TEST=5, REQUIRE_TEST=6, FAIL_ACTION=7, INSERT_SUBNET=8, SET_TO_ATTR=9, EQUAL_ATTR_TEST=10, LIST_MEMBER=11, APPLY_TRANSDUCER=12, APPLY_FUNCTION=13, EXCLUDE_LIST=14}; data_types {Unknown=0, Network=1, Alphabet=2, Integer=3, Other=4};

/************** * SIZES **************/ #define #define #define #define

int16 uint16 int32 uint32

typedef typedef typedef typedef

short unsigned short int unsigned int

long unsigned long unsigned int uint32

#define MAX_LV 24 /*

LONG; ULONG; UTF32; id_type;

-- maximum number of bits in a label ID */

#define ID_EOS ((unsigned) (1 length (X)->pos (X)->lines (X)->string

33

STRING_BUFFERptr make_string_buffer(int length); /* Makes a new thin string buffer */ STRING_BUFFERptr make_fat_str_buffer(int length); /* Makes a new fat string buffer. */ void free_string_buffer(STRING_BUFFERptr str_buf); /* Frees a string buffer of either type. */ void initialize_string_buffer(STRING_BUFFERptr str_buf); /* Initializes a string buffer of either type. */ int append_string_to_buffer(char *str, STRING_BUFFERptr str_buf); /* Appends a thin string to a thin string buffer. Returns the length of the string in the buffer. */ int append_label_to_buffer(id_type id, int escape_p, STRING_BUFFERptr str_buf); /* Appends the label of the id to a thin string buffer. Returns the length of the string in the buffer. If escape_p is ESCAPE, special symbols such as newline symbols and tabs are escaped using standard Unix conventions. */ int append_fat_str_to_buffer(FAT_STR fs, STRING_BUFFERptr str_buf); /* Appends a fat string to a fat string buffer. Returns the length of the fat string in the buffer. */ void assure_buffer_space(int length, STRING_BUFFERptr str_buf); /* Makes sure that the string buffer has length amount of space left. */ int print_string_buffer(STRING_BUFFERptr str_buf, FILE *stream); /* Prints the string of the string buffer into the stream. If the string is a fat string, it will be printed as a a UTF-8 or Latin-1 encoded C string depending on the character encoding mode. Returns the length of the printed string. */ /****************** * LAB_VECTOR *****************/ typedef struct LAB_VECTOR { int length; int pos; id_type *array; } LAB_VECTORtype, *LAB_VECTORptr; #define LAB_VECTOR_length(X) #define LAB_VECTOR_pos(X) #define LAB_VECTOR_array(X)

(X)->length (X)->pos (X)->array

34

/* Label vectors are used to store sequences of label IDs */ LAB_VECTORptr make_lab_vector(int length); /* Creates a label vector of the given length. */ void reclaim_lab_vector(LAB_VECTORptr lab_vect); int append_to_lab_vector(id_type lab, LAB_VECTORptr lab_vect); /* Appends a label to the next position in the label vector and increments the position counter. The length of the vector is adjusted if needed. Returns the new value of the position counter. */ int lab_vector_element_at(id_type *lab, int pos, LAB_VECTORptr lab_vect); /* Sets *lab to the label in the given position. Returns 0 on success, 1 if the position is outside the filled part of the vector. */ void increment_lab_vector(LAB_VECTORptr lab_vect); /* Increments the position counter of the label vector. */ int decrement_lab_vector(LAB_VECTORptr lab_vect); /* Decrements the position counter of the label vector. Return 0 on success, 1 if the new position would be less than 0. */ void reset_lab_vector(LAB_VECTORptr lab_vect); /* Resets the position counter of lab_vect to 0. */ typedef struct LAB_VECTOR_TABLE LAB_VECTOR_TABLEtype, *LAB_VECTOR_TABLEptr; /****************** * VECTOR * ******************/ typedef struct VECTOR { int length; int pos; void **array; } VECTORtype, *VECTORptr; /* Vectors are for storing sequences of objects of any kind. */ #define VECTOR_length(X) #define VECTOR_pos(X) #define VECTOR_array(X)

(X)->length (X)->pos (X)->array

typedef struct VECTOR_ENUMERATOR VECT_ENUMtype, *VECT_ENUMptr; typedef struct VECTOR_TABLE VECTOR_TABLEtype, *VECTOR_TABLEptr; VECTORptr make_vector(int length); /* Makes a vector of the given length. */

35

void free_vector(VECTORptr vector); int append_to_vector(void *object, VECTORptr vector); /* Appends object to the next position in the vector and increments the position counter. Allocates more vector space if needed. Returns the new value of the position counter. */ void reset_vector(VECTORptr vector); /* Sets the position counter of vector to 0. */ int vector_element_at(void **element, int pos, VECTORptr *vector); /* Sets *element to the object in the given position. Returns 0 on success, 1 if pos is greater or equal to the value of the vector’s position counter. */ typedef struct LAB_RING LAB_RINGtype, *LAB_RINGptr; /************************* * LABEL ***************************/ typedef struct TUPLE { id_type *labels; /* a pair of label IDs */ id_type inverse; /* it is equal to ID_NO_SYMBOL for labels of arity 1 and for tuples whose inverse has not been computed; otherwise it contains the unique ID of the inverse tuple. */ } TUPLEtype, *TUPLEptr; #define TUPLE_labels(X)

(X)->labels

typedef struct FLAG_DIACRITIC { int action; id_type attribute; id_type value; } FLAG_DIACRtype, *FLAG_DIACRptr; #define FLAG_DIACR_action(X) #define FLAG_DIACR_attribute(X) #define FLAG_DIACR_value(X)

(X)->action (X)->attribute (X)->value

typedef struct LABEL { id_type id; /* Unique ID for a symbol or symbol pair. */ id_type other_id; /* Place to store some related label ID. */ short arity; /* 1 for atomic label, 2 for pairs */ void *data; /* a cache for storing some information about the label such as type or print name */ FLAG_DIACRptr flag; /* NULL for labels that are not flag diacritics. */ int convertable; /* TRUE if case conversion makes sense. */

36 int expands_other; int data_type;

/* TRUE if covered by OTHER */ /* The type of data stored in the data field: 0 = Unknown, 1 = Network, 2 = Alphabet, 3 = Integer, 4 = Other */

union { FAT_STR name; /* Name of an atomic label. */ TUPLEptr tuple; /* The tuple of an fstpair. */ } content; } LABELtype, *LABELptr; /* There are two types of labels. Atomic labels have arity 1 and tuple labels have arity 2. The content field of an atomic label is its name represented as a fat string. The content field of a tuple label consists of a pair of label IDs for its stomic components. */ #define #define #define #define #define #define #define #define #define #define

LABEL_id(X) LABEL_other_id(X) LABEL_arity(X) LABEL_data(X) LABEL_flag(X) LABEL_name(X) LABEL_tuple(X) LABEL_convertable(X) LABEL_expands_other(X) LABEL_data_type(X)

(X)->id (X)->other_id (X)->arity (X)->data (X)->flag (X)->content.name (X)->content.tuple (X)->convertable (X)->expands_other (X)->data_type

int print_label(id_type id, FILE *stream, int escape_p); /* Prints label correspondig to the ID to a stream. If escape_p is DONT_ESCAPE, the label is printed literally. If escape_p is ESCAPE, special symbols such as newline symbols and tabs are printed in double quotes. In UTF8 mode (default), non-ASCII symbols are printed as UTF8 strings, in Latin-1 mode symbols outside the Latin-1 region are printed in the format "\uXXXX" where XXXX is the hex value of the Unicode code point. */ #define fstpair_upper(X) #define fstpair_lower(X)

TUPLE_labels(LABEL_tuple(X))[UPPER] TUPLE_labels(LABEL_tuple(X))[LOWER]

/************************* * LABEL ID MAP ***************************/ typedef struct LABEL_ID_MAP LABEL_ID_MAPtype, *LABEL_ID_MAPptr; /* A data structure structure for associating label names (fat strings) and the corresponding integer IDs. It contains a hash table that maps label names to their IDs and an array of labels. The label for an ID, for example 321, is located at the position 321 in the label array. The maximum number of label IDs used to be 65535, it is now 16777214 (2^24 -2). When the default label map is initialized, all the printable ASCII characters get a label and an ID that is the same as the integer value of the

37 character. For example, the symbol ’A’ has the ID 65. Labels for other symbols and symbol pairs are created on demand. Special symbols that have fixed label IDs include EPSILON (ID 0) and OTHER (ID 1), the unknown symbol. Because symbol names are recorded as fat strings, they do not depend on the character encoding mode (utf-8 or iso-8859-1). */

id_type single_to_id(char *name); /* Converts the name string into a fat string and returns the corresponding symbol ID for the atomic label. Creates a new label and a new label ID, if the label does not already exist. Returns the label ID. In UTF-8 mode, it is assumed that the name is a UTF8-coded string. Generates a warning message if the string is not a valid UTF8-string. In Latin-1 mode the name is processed as a Latin-1 string. Any Unicode character may be represented in the format "\uXXXX" where X is a hex character. For example, single_to_id("\u20AC") returns an ID for the Euro symbol. */ id_type pair_to_id(char *upper, char *lower); /* Returns the ID of the tuple label with upper and lower as the two components. The names are processed as either UTF8 strings or Latin-1 strings depending on the mode. If the names are equal strings, the result is a single label because A:A is treated as equivalent to A. */ id_type id_pair_to_id(id_type upper_id, id_type lower_id); /* Returns the ID of the tuple label upper_id and lower_id as the two components. The names are processed as either UTF8 strings or Latin-1 strings depending on the mode. If upper_id and lower id are identical, the result is identical to them as well. */ LABELptr id_to_label(id_type id); /* Returns the label corresponding to the id. */ id_type upper_id(id_type id); /* Returns the upper id of a tuple label or the id itself if id refers to an atomic label. */ id_type lower_id(id_type id); /* Returns the lower id of a tuple label or the id itself if id is an atomic label. */ /******************* * RANGE *******************/ typedef struct RANGE_RECORD RANGEtype, *RANGEptr; /*******************

38 * MATCH_TABLE *******************/ typedef struct MATCH_TABLE MATCH_TABLEtype, *MATCH_TABLEptr; /********************** * ALPHABET *********************/ typedef struct ALPHABET { int len; /* # of ALPH_items positions in use */ int max; /* actual size of ALPH_items */ id_type *items; /* Label IDs (LABEL VECTOR), 0s and 1s (BINARY_VECTOR)*/ bit type:8; /* 0 = BINARY_VECTOR, 1 = LABEL_VECTOR */ bit in_use:8; /* 0 = not in use, 1 = in use */ } ALPHABETtype, *ALPHABETptr; /* An alphabet is a list of label IDs. The list can be represented in two ways, as a binary vector or as list of label IDs. A binary vector is a list of zeros and ones where ones indicate that the ID corresponding to the position in the vector is a member of the alphabet. For example, if the position 65 in the vector contains 1, then symbol with the ID 65 (the letter ’A’) is a member of the alphabet. The sigma alphabet of a network is kept in binary format for quick membership checking. The label alphabet of a network is maintained as a label vector. */ #define #define #define #define #define #define

ALPH_type(X) ALPH_items(X) ALPH_item(X,Y) ALPH_len(X) ALPH_max(X) ALPH_in_use(X)

(X)->type (X)->items (X)->items[(Y)] (X)->len (X)->max (X)->in_use

ALPHABETptr make_alph(int len, int type); /* Returns an alphabet of the specified length and type (either LABEL_VECTOR or BINARY_VECTOR). */ ALPHABETptr copy_alphabet(ALPHABETptr alph); /* Returns a copy of the alphabet. */ void free_alph(ALPHABETptr alph); /* Reclaims the alphabet. */ void print_alph(ALPHABETptr alph, FILE *stream); /* Prints the alphabet into the stream. */ /********************** * ALPHABET ITERATOR *********************/

39 typedef struct ALPH_ITERATOR { int pos; int len; int type; id_type *items; } ALPH_ITtype, *ALPH_ITptr; /* An alphabet iterator returns the members of an alphabet one after an another. The alphabet may be in BINARY or LABEL format. */ #define #define #define #define

ALPH_IT_pos(X) ALPH_IT_type(X) ALPH_IT_items(X) ALPH_IT_len(X)

(X)->pos (X)->type (X)->items (X)->len

ALPH_ITptr start_alph_iterator(ALPH_ITptr alph_it, ALPHABETptr alph); /* Returns an initialized alphabet iterator for the given alphabet. If the first argument is NULL, a new alphabet iterator is created. */ id_type next_alph_id(ALPH_ITptr alph_it); /* Returns the next member of the alphabet. If there are no more unseen IDs in the iterator, the return value is ID_NO_SYMBOL (16777215). */ void reset_alph_iterator(ALPH_ITptr alph_it); /* Resets the alphabet iterator to the first symbol ID. */ void free_alph_iterator(ALPH_ITptr alph_it); /* Frees the alphabet iterator only, not the alphabet that it iterates on. You need not call this function if the iterator has been statically allocated. */ /*********************** * ARC_VECTOR ***********************/ typedef struct ARC_VECTOR AVtype, *AVptr; /********************* * CH_NODE *********************/ typedef struct CH_NODE CH_NODEtype, *CH_NODEptr; /********************* * PARSE_TABLE object *********************/

*

typedef struct PARSE_TABLE PARSE_TBLtype, *PARSE_TBL; /******************* * PROPERTY LIST *

40 *******************/ /* The property list of a network is a list of attribute-value pairs. The attributes are represented as fat strings, the values may be diffent types of objects including strings, integers and lists. The property list is used to store the regular expression that was used to define the network. It also stores the defined list symbols that the network refers to. For example, the following sequence of commands define_regex_list("Vowel", "a e i o u"); define_regex_list("VoicelessStop", "k p t"); define_regex_net("Test","\"@L.VoicelessStop@\" \"@L.Vowel@\""); save(net("Test"), "test.net") causes the test net to be saved with the following property list: DEFINITION: "%"@L.VoicelessStop@%" %"@L.Vowel@%"" DEFINED_LISTS: ( ( "VoicelessStop" "k" "p" "t" ) ( "Vowel" "a" "e" "i" "o" "u" ) ) when the saved net is loaded from a file, the list definitions are restored from the property list. */ typedef struct IO_SYMBOL IO_SYMBOLtype, *IO_SYMBOLptr; typedef struct IO_SYMBOL_PACKAGE IO_SYMBOL_PACKAGEtype, *IO_SYMBOL_PACKAGEptr; typedef struct BYTE_BLOCK BYTE_BLOCKtype, *BYTE_BLOCKptr; typedef struct SEQUENCE SEQUENCEtype, *SEQUENCEptr; typedef struct OBJECT

OBJECTtype, *OBJECTptr;

typedef struct PROP { FAT_STR attribute; OBJECTptr value; struct PROP *next; } PROPtype, *PROPptr; #define PROP_attr(X) #define PROP_val(X) #define next_prop(X)

(X)->attribute (X)->value (X)->next

/***************** * ARC *****************/ typedef struct ARC { struct STATE *destination; bit type_bit : 1; bit userflag1 : 1; bit visit_mark : 2; bit big_arc_flag : 1;

/* Destination state */

41 bit userflag2 : 2; bit in_use: 1; id_type label : MAX_LV; struct ARC *next; } ARCtype, *ARCptr;

/* Label ID */ /* Pointer to the next arc */

/* An arc consists of a pointer to a destination state, a label ID, a pointer to the next arc and various bit flags. Arcs are allocated from a global arc heap. When an arc is freed, it is put on the freelist of the heap. The in_use bit is 1 when the arc is in use, 0 when it has been freed. */

#define #define #define #define #define #define #define #define #define

ARC_type_bit(X) ARC_userflag1(X) ARC_visit_mark(X) ARC_big_arc_flag(X) ARC_userflag2(X) ARC_in_use(X) ARC_label(X) ARC_destination(X) ARC_next(X)

(X)->type_bit (X)->userflag1 (X)->visit_mark (X)->big_arc_flag (X)->userflag2 (X)->in_use (X)->label (X)->destination (X)->next

typedef struct BIG_ARC { struct STATE *destination; bit type_bit : 1; bit userflag1 : 1; bit visit_mark : 2; bit big_arc_flag : 1; bit userflag2: 2; bit in_use: 1; id_type label: MAX_LV; struct BIG_ARC *next; void *user_pointer; } BIG_ARCtype, *BIG_ARCptr; /*

A "big arc" is an arc with an extra user_pointer field. If the big_arc_flag is 1 instead of 0, the arc has that extra field. A common use for big arcs is to record for each symbol in a network its starting byte position in a text file. */

#define ARC_user_pointer(X)

(X)->user_pointer

void free_arc(ARCptr arc); /* Returns the arc to the global arc or big-arc heap. */ /****************** * STATE ******************/ typedef struct STATE {

42 union { struct ARC *set; struct ARC_VECTOR *vector; } arc; bit type_bit : 1; bit final : 1; bit deterministic : 1; bit vector_p : 1; bit visit_mark : 8; bit userflag2 : 2; bit is_virtual: 1; bit in_use : 1; struct STATE *next; void *client_cell; } STATEtype, *STATEptr;

/* First arc of the state */ /* Vector of destination states */

/* 0 = normal state, 1 = vectorized state */

/* Pointer to the next state */

/* A state contains a set of arcs leading other other states, a pointer to the next state, a set of bit flags, a client_cell pointer that is used by various algorithms to store temporary information. The arc set of a state may be in one of two formats. The standard format that most cfsm algorithms expect is a list of arcs. In this case, state->arc points to the first arc of the state and each arc points to its successor, or to NULL in the case of the last arc. Alternatively, the arc set is represented by a structure that contains a vector of destination states. The vector format takes up more space than the standard format but it is faster to process because it provides random access to destination states by label IDs. States are allocated from a global state heap. The in_use flag is 1 when a state is in use and 0 when a state is returned to the heap. */ void free_state(STATEptr state); /* Returns the state to the global state heap. */ #define #define #define #define #define #define #define #define #define #define

STATE_type_bit(X) STATE_visit_mark(X) STATE_final(X) STATE_deterministic(X) STATE_vector_p(X) STATE_arc_set(X) STATE_arc_vector(X) STATE_userflag2(X) STATE_is_virtual(X) STATE_in_use(X)

/********************* * NETWORK *********************/ typedef struct NETWORK {

(X)->type_bit (X)->visit_mark (X)->final (X)->deterministic (X)->vector_p (X)->arc.set (X)->arc.vector (X)->userflag2 (X)->is_virtual (X)->in_use

43 ALPHABETptr labels; /* Label alphabet as LABEL_VECTOR */ ALPHABETptr sigma; /* Sigma alphabet as BINARY_VECTOR */ struct { bit deterministic:1; bit pruned:1; bit completed:1; bit minimized:1; bit epsilon_free:1; bit sorted_states:1; bit loop_free:1; bit twol_net:1; bit visit_marks_dirty:1; bit names_matter:1; bit shared_arc_lists:1; bit has_arc_user_pointer:1; bit closed_sigma:1; bit start_state_final:1; bit obsolete1:1; bit compacted:1; bit obsolete2:1; bit obsolete3:1; bit mark:1; bit u_flag_diacr:1; bit l_flag_diacr:1; bit obsolete4:1; bit obsolete5:1; bit sorted_arcs:1; bit reduced_labelset:1; bit obsolete6:1; bit is_virtual:1; bit is_arc_optimized:1; bit in_use:1; bit has_arc_vectors:1; bit linear_bounded_upper:1; bit linear_bounded_lower:1; } flags; int16 arc_label_arity; id_type defined_as; id_type range_len; LABEL_ID_MAPptr label_map; RANGEptr uprange_map; RANGEptr downrange_map; MATCH_TABLEptr upmatch_table; MATCH_TABLEptr downmatch_table; ALPHABETptr recode_key; ALPHABETptr decode_key; ALPHABETptr unreduce_key; union { STATEptr state; unsigned char *loc;

44 } start; union { STATEptr states; void *block; } body; HEAPptr arc_vector_heap; int arc_vector_len; PROPptr networkprops; PARSE_TBL upper_parse_table; PARSE_TBL lower_parse_table; ULONG num_states; ULONG num_arcs; ULONG block_size; ALPHABETptr flag_register; void *client_cell; void *mmap_handle; size_t mmap_size; } NETtype, *NETptr;

/* First state of a standard network */ /* Arc block of a compacted network */

/* First network property */

/* Number of states */ /* Number of arcs */

/* A network is a large data structure. Many of the fields such as parse and match tables are used to cache data that is used by the apply routines. The body of a standard network is a pointer to the first state of a state list. Each state points to its successor on the list. The body of a compacted network is a pointer to the block of memory encoding the arcs and states in a space-efficient way. Most cfsm algorithms work only on standard networks. A network can be optimized in several ways either to save space or to increase the application speed. If the has_arc_vectors flag is 1, some of the states of the network are in the vectorized format. Other optimization flags are arc_optimized, reduced_labelset, and shared_arc_lists. A network has a unique start_state. The start state does not have to be the first state of the list pointed to by body.arcs. Networks are allocated from a global heap. The in_use flag indicates whether a network is in use or whether it has been freed. The num_arcs field contains the number of arcs; The num_states field keeps track of the number of states in the network. */ #define #define #define #define #define #define #define #define #define #define #define #define #define

NET_deterministic(X) NET_pruned(X) NET_completed(X) NET_minimized(X) NET_epsilon_free(X) NET_sorted_states(X) NET_loop_free(X) NET_visit_marks_dirty(X) NET_names_matter(X) NET_shared_arc_lists(X) NET_has_arc_user_pointer(X) NET_closed_sigma(X) NET_start_state_final(X)

(X)->flags.deterministic (X)->flags.pruned (X)->flags.completed (X)->flags.minimized (X)->flags.epsilon_free (X)->flags.sorted_states (X)->flags.loop_free (X)->flags.visit_marks_dirty (X)->flags.names_matter (X)->flags.shared_arc_lists (X)->flags.has_arc_user_pointer (X)->flags.closed_sigma (X)->flags.start_state_final

45 #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define

NET_twol_net(X) NET_compacted(X) NET_mark(X) NET_u_flag_diacr(X) NET_l_flag_diacr(X) NET_sorted_arcs(X) NET_reduced_labelset(X) NET_Kaplan_compressed(X) NET_is_virtual(X) NET_optimized(X) NET_in_use(X) NET_has_arc_vectors(X) NET_linear_bounded_upper(X) NET_linear_bounded_lower(X) NET_arc_label_arity(X) NET_num_arcs(X) NET_num_states(X) NET_labels(X) NET_sigma(X) NET_recode_key(X) NET_decode_key(X) NET_unreduce_key(X) NET_start_state(X) NET_start_loc(X) NET_states(X) NET_arc_block(X) NET_arc_vector_heap(X) NET_arc_vector_len(X) NET_range_len(X) NET_label_map(X) NET_properties(X) NET_upper_parse_table(X) NET_lower_parse_table(X) NET_uprange_map(X) NET_downrange_map(X) NET_upmatch_table(X) NET_downmatch_table(X) NET_mmap(X) NET_mmap_size(X)

(X)->flags.twol_net (X)->flags.compacted (X)->flags.mark (X)->flags.u_flag_diacr (X)->flags.l_flag_diacr (X)->flags.sorted_arcs (X)->flags.reduced_labelset (X)->flags.Kaplan_compressed (X)->flags.is_virtual (X)->flags.is_arc_optimized (X)->flags.in_use (X)->flags.has_arc_vectors (X)->flags.linear_bounded_upper (X)->flags.linear_bounded_lower (X)->arc_label_arity (X)->num_arcs (X)->num_states (X)->labels (X)->sigma (X)->recode_key (X)->decode_key (X)->unreduce_key (X)->start.state (X)->start.loc (X)->body.states (X)->body.block (X)->arc_vector_heap (X)->arc_vector_len (X)->range_len (X)->label_map (X)->networkprops (X)->upper_parse_table (X)->lower_parse_table (X)->uprange_map (X)->downrange_map (X)->upmatch_table (X)->downmatch_table (X)->mmap_handle (X)->mmap_size

NETptr make_empty_net(void); /* Returns a skeleton network structure with an empty sigma and label_alphabets but without an initial state. Use either null_net() or epsilon_net() to create a minimal network with an initial state. */ NETptr copy_net(NETptr net); /* Returns a copy of the network. */ int minimize_net(NETptr net);

46 /* Destructively minimizes the network using Hopcroft’s algorithm. Returns 0 on success and 1 on error. As a prelimnary step to minimization, the network is first pruned, epsilons are removed and the network is determinized. Minimization can only be done on standard networks, not on networks that have been compacted or vectorized. */ void free_network(NETptr net); /* Returns the network to the global network heap. */ void print_net(NETptr net, FILE *stream); /* Prints the states and arcs of the network into the stream. */ STATEptr add_state_to_net(NETptr net, int final_p); /* Adds a new state to the network. If final_p is non-zero, the state is final. Returns the new state on success, NULL on failure. */ ARCptr add_arc_to_state(NETptr net, STATEptr start, id_type id, STATEptr dest, void *user_pointer, int big_arc_p); /* Creates a new arc from start to dest with the label id unless it would duplicate an existing arc. Does not add a looping EPSILON arc. The start and dest states must already exist in the network. The network must be a standard network, not vectorized or optimized. Updates the sigma and label alphabets of the network. If big_arc_p is non-zero, the new arc will have a user_pointer field. Returns the arc on success, NULL on failure. */ int read_net_properties(NETptr net, char *file); /* Reads a list of attribute value pairs from the file and adds them to the networks property list. For example, NETWORKNAME: "Number-to-numeral converter" LARGEST_NUMBER: 99999 If file is NULL, the input is obtained from stdin. */ int write_net_properties(NETptr net, char *file); /* Writes the networks property list into a file or to stdout if file is NULL. */ int add_string_property(NETptr net, char *attribute, char *value); /* Adds the attribute:value pair to the network’s property list. Any previous value for the attribute is freed and replaced by the new value. Returns 0 on success, 1 on error. Both the attribute and the value are copied and converted to fat strings, they can be freed by the calling function if they have been malloc’d. */ char *get_string_property(NETptr net, char *attribute); /* Returns the value of the attribute on the property list of the net, or NULL if it is not found. The value is a freshly allocated C string. It should be freed by the calling function when it

47 is not needed anymore. */ int remove_string_property(NETptr net, char *attribute); /* Removes the attribute and its value from the property list of the network. Returns 0 on success, 1 or error. */ /***************** * NET VECTOR * *****************/ typedef struct NET_VECTOR { int len; NETptr *nets; } NVtype, *NVptr; /* A data structure for storing one or more networks. The positions in a net vector are counted starting from 0. Thus NV_net(nv, 0) refers to the first network in the net vector nv. */ #define NV_len(X) (X)->len #define NV_nets(X) (X)->nets #define NV_net(X,Y) (X)->nets[(Y)] NVptr make_nv(int len); /* Retuns a net vector of the specified length. */ NVptr net2nv(NETptr net); /* Wraps the net inside a net vector of length 1 and returns the vector. */ NETptr nv_get(NVptr nv, int pos); /* Returns the net in the given position in the net vector nv. Safer than the NV_net(nv,pos) macro because it makes sure that 0 >= pos < NV_len(nv). Returns NULL on error. */ void nv_push(NETptr net, NVptr nv); /* Pushes the net into the beginning of the net vector increasing its length by 1. */ void nv_add(NETptr net, NVptr nv); /* Appends the net into the end of the net vector increasing its length by 1. */ void free_nv_only(NVptr nv); /* Frees the net vector but not any of the nets it contains. */ void free_nv_and_nets(NVptr nv); /* Frees both networks contained in the vector and the net vector itself. */ typedef struct LOCATION_IN_NET LOCATIONtype, *LOCATIONptr;

48 typedef struct IO_SEQUENCE IO_SEQtype, *IO_SEQptr; typedef struct IO_SEQUENCE_TABLE IO_SEQ_TABLEtype, *IO_SEQ_TABLEptr; /************************** * STANDARD FILE HEADER * **************************/ /* Binary files created with save_net() start with a file header that records the creation date and other information in encrypted form. Some information is recorded as clear text: file date and a copyright string. */ typedef struct STANDARD_HEADER STANDARD_HEADER, *STANDARD_HEADERptr; /***************** * FST_CONTEXT *****************/ typedef struct INTERFACE_PARAMETERS { struct regex { int lex_errors; int lex_max_errors; } regex; struct command_line { int quiet; int obey_ctrl_c; int stop; int want_deps; } command_line; struct alphabet { int print_pairs; int print_left; int read_left; int unicode; int recode_cp1252; } alphabet; struct general { int sort_arcs; int verbose; int completion; int stack; int name_nets; int minimal; int quit_on_fail;

49 int assert; int show_escape; int sq_final_arcs; int sq_intern_arcs; int recursive_define; int recursive_apply; int compose_flag_as_special; int need_separators; int max_context_length; int vectorize_n; int fail_safe_composition; } general; struct optimization { int in_order; } optimization; struct io { int print_sigma; int print_space; int obey_flags; int mark_version; int retokenize; int show_flags; int max_state_visits; int count_patterns; int delete_patterns; int extract_patterns; int locate_patterns; int mark_patterns; int license_type; int char_encoding; int use_memory_map; int use_timer; } io; struct parameters { int interactive; } parameters; struct sequentialization { int final_strings_arcs ; int intern_strings_arcs ; int string_one; } seq ; } IntParType, *IntParPtr; /* Interface parameters control many aspects of the cfsm application. For example, setting general.verbose to 0 suppresses the printing of all messages except for those do to an error. Interface parameters

50 are part of a the larger C_FSM_CONTEXT data structure. */ typedef struct ERROR_STREAM { #ifdef Linux /* Darwin and Solaris don’t support open_memstream() */ size_t buffer_size; char *buffer; FILE *memstream; #else size_t dummy; /* To keep size of struct same in both systems */ FILE *tempfile; char *buffer; #endif } ERROR_STREAM, *ERROR_STREAMptr; /* The error stream is used to print messages when a runtime error occurs. */ /***************** * PAGE *****************/ typedef struct PAGE_OBJECT { int line_pos; int cur_pos; int line_no; int indent; int rm; int size; char *string; char *eol_string; char indent_char; } PAGEtype, *PAGEptr; /* Page objects are for storing formatted output. They are conceived as a sequence of lines with indentation and a right margin. The page writing routines keep track of the line position and insert an eol_string before the right margin is exceeded. The default eol_string is the default eol_string of the CFSM_CONTEXT, "\n". The size of the page grows as needed. */ #define PAGE_line_pos(X) (X)->line_pos #define PAGE_cur_pos(X) (X)->cur_pos #define PAGE_line_no(X) (X)->line_no #define PAGE_indent(X) (X)->indent /* PAGE_rm is the width of the page in columns. If PAGE_rm is -1, the page has no right margin. */ #define PAGE_rm(X) (X)->rm #define PAGE_size(X) (X)->size #define PAGE_string(X) (X)->string #define PAGE_eol_string(X) (X)->eol_string #define PAGE_indent_char(X) (X)->indent_char

51

int watch_margin(PAGEptr page, int next_size); /* Inserts an eol string, if adding next_size to the line position exceeds the right margin of the page. */ int int_print_length(long i); /* Returns the number of digits in the print representation of an integer. */ int label_length(LABELptr label, int escape_p); /* Returns the number of bytes in the print representation of the label. If escape_p is ESCAPE, certain symbols such as newline characters are measured with surrounding double quotes. */ PAGEptr new_page(void); /* Returns a new page with IY_INDENT indentation and IY_RIGHT_MARGIN as the right margin. The values of these macros are determined by the CFSM_CONTEXT. See below. */ PAGEptr make_page(int size, int indent, int rm); /* Returns a new page with specified settings. */ void free_page(PAGEptr page); /* Reclaims the page. */ void reset_page(PAGEptr page); /* Resets the position and line coounters to zero. */ void print_page(PAGEptr page, FILE * stream); /* Prints the page into the stream. */ void new_page_line(PAGEptr page); /* Inserts the eol_string at the current position on the page. */ void char_to_page(char c, PAGEptr page); /* Appends the character to the page. */ void int_to_page(long i, int watch_rm, PAGEptr page); /* Writes the integer to the page. If the second argument is WATCH_RM, an eol_string is inserted first if needed to avoid exceeding the right margin. If the second argument is DONT_WATCH_RM, the digits are written without watching the margin. */ void float_to_page(float f, int watch_rm, PAGEptr page); /* Writes a floating point number to the page. */ void spaces_to_page(int n, int watch_rm, PAGEptr page); /* Writes n spaces to the page. */ void string_to_page(char *str, int watch_rm, PAGEptr page); /* Writes a C string to the page. */

52

void fat_string_to_page(FAT_STR fs, PAGEptr page); /* Writes a fat string to the page. */ void fat_string_to_page_esc(FAT_STR fs, char *esc, PAGEptr page); /* Writes a fat string to the page. Characters on the esc list are printed with escapes. */ void symbol_to_page(FAT_STR name, PAGEptr page); /* Writes a symbol name to the page with escapes for the following characters: ’0’ (literal zero), ’?’ (literal question mark), ’%’ (literal percent sign), ’ ’, ’\t’, ’\n’. */ void label_to_page(id_type id, int escape_p, int watch_rm, PAGEptr page); /* Writes a label to the page, either a single symbol or a pair of symbols separated fy a colon. */ PAGEptr labels_to_page(NETptr net, PAGEptr page); /* Writes the label alphabet of the network to the page and returns the page. If the page argument is NULL, a new page created. */ PAGEptr sigma_to_page(NETptr net, PAGEptr page); /* Writes the sigma alphabet of the network to the page and returns the page. If the page argument is NULL, a new page is created. */ PAGEptr network_to_page(NETptr net, PAGEptr page); /* Writes the states and arcs of the network to the page and returns the page. If the page argument is NULL, a new page is created. */ PAGEptr words_to_page(NETptr net, int side, int escape_p, PAGEptr page); /* Writes paths of the network to the page and returns the page. If the page argument is NULL, a new page is created. The side argument can be UPPER, LOWER, or BOTH. If the network is circular, a loop is traversed just once and the site of the loop is marked with three dots. The output format is contolled by the macros IY_OBEY_FLAGS, IY_SHOW_FLAGS and IY_PRINT_SPACE, and IY_PRINT_PAIRS. See the explanations below. */ int alphabet_to_page(ALPHABETptr alph, PAGEptr page); /* Writes the alphabet to the page. Returns the number of items in the alphabet. */ PAGEptr flags_to_page(NETptr net, PAGEptr page); /* Writes the network status flags to the page and returns the page. For example, Flags: deterministic, pruned, minimized, epsilon_free, loop_free. If the page argument is NULL, a new page is created. */ PAGEptr properties_to_page(NETptr net, PAGEptr page); /* Writes the network property list to the page. and returns the

53 page. If the page argument is NULL, a new page is created. */ PAGEptr net_size_to_page(NETptr net, PAGEptr page); /* Writes the size of the network to the page and returns the page. For example, 660 bytes. 4 states, 3 arcs, 1 path. Label Map: Default. If the page argument is NULL, a new page is created. */ PAGEptr time_to_page(long start, long end, PAGEptr page); /* Prints the difference between end and start times to the page in terms of seconds, minutes, and hours. and returns the page. If the page argument is NULL, a new page is created. */ PAGEptr label_vector_to_page(LAB_VECTORptr lab_vect, PAGEptr page, int escape_p, char *sep); /* Prints the labels corresponding to the label IDs in the label vector to the page and returns the page. If the page argument is NULL, a new page is created. */ PAGEptr symbol_list_to_page(char *name, PAGEptr page); /* Writes the members of the list defined as name to the page and returns the page as the value, or NULL if an error occurs. If the page argument is NULL, a new page is creaed. */ PAGEptr file_info_to_page(PAGEptr page); /* Writes the information to the page about the last network file that was either loaded or saved and returns the page. If the page argument is NULL a new page is created(). */ PAGEptr storage_info_to_page(PAGEptr page); /* Writes the information to the page about the storage used for states, arcs, and other managed data structures. If the page argument is NULL a new page is created(). */ int longest_string_to_page(NETptr net, int side, PAGEptr page); /* Writes to the page the longest string in the network, that is the string on the longest non-looping path from the start state to a final state. The side must be UPPER or LOWER. Epsilons are ignored. Returns the length of the string on success, -1 on error. Not implemented for vectorized or compacted networks. */ int shortest_string_to_page(NETptr net, int side, PAGEptr page); /* Writes to the page the shortest string in the network, that is the string on the shortest path from a start state to a final state. The side must be UPPER or LOWER. Epsilons are ignored. Returns the length of the string on success, -1 on error. Not implemented for vectorized or compacted networks. */ typedef struct CFSM_CONTEXT { int mode ;

54 int reclaimable; char *copyright_string ; /* COPYRIGHT_OWNER */ int compose_strategy ; int execution_error; int in_character_encoding ; /* CHAR_ENC_UTF_8 or CHAR_ENC_ISO_8859_1 */ int out_character_encoding ; /* CHAR_ENC_UTF_8 or CHAR_ENC_ISO_8859_1 */ ERROR_STREAM errorstream; IntParPtr interface; struct temporary_buffers { STRING_BUFFERptr string_buffer; STRING_BUFFERptr fat_str_buffer; PAGEptr page_buffer; LAB_VECTORptr lab_vector; } temp_bufs ; struct flag_parameters { int keep_p ; int determinize_p ; int minimize_p ; int prune_p ; int reclaim_p ; int embedded_command_p; } flags ; struct pretty_print_parameters { int cur_pos ; int indent ; int line_pos ; char *output_buffer ; int output_buffer_size ; int right_margin ; char *eol_string ; } pretty_print ; struct path_index_data { int max_path_index_pos ; int path_index_incr ; int path_index_pos ; long int *path_index_vector ; } index ; struct parse_parameters_and_data { int ignore_white_space_p ; int zero_to_epsilon_p ; int input_seq_size ; id_type *input_seq ; id_type *lower_match ; id_type *match_table ;

/* cur_pos */

/* output_buffer */ /* OUTPUT_BUFFER_SIZE */

/* /* /* /*

MAX_PATH_INDEX_POS */ PATH_INDEX_INCR */ PATH_INDEX_POS */ PATH_INDEX_VECTOR */

/* /* /* /*

WORD_STRING_SIZE */ INPUT_SEQ */ LOWER_MATCH */ MATCH_TABLE */

55 id_type *upper_match ; int obsolete_parse_tables ; } parse; struct bin_io_parameters_and_data { int altchain_p ; int status_bar_p ; int32 status_bar_increment ; uint32 arc_count ; byte cur_byte ; STANDARD_HEADERptr last_header; STANDARD_HEADERptr next_header; STATEptr cur_state ; STATEptr *state_stack ; char **attributes ; int attribute_count ; } bin_io ; struct define_data { HASH_TABLEptr net_table; HASH_TABLEptr set_table; } define; } FST_CNTXT, *FST_CNTXTptr;

/* UPPER_MATCH */ /* PTBL_OBSOLETE */

/* /* /* /* /* /* /* /* /* /* /*

ALTCHAIN_P */ DISPLAY_READ_STATUS_BAR */ STATUS_BAR_INCREMENT */ ARC_COUNT */ CUR_BYTE */ LAST_HEADER */ NEXT_HEADER */ CUR_STATE */ STATE_STACK */ STANDARD_ATTRIBUTES */ STANDARD_ATTRIBUTE_COUNT */

/* DEF_TABLE */

/* CFSM_CONTEXT is a large data structure that holds the interface parameters and many other types of data required by a cfsm application. An application based on this API should first call initialize_cfsm() to allocate a context structure. The structure is freed by reclaim_cfsm(). Many of the fields in the context data structure can be accessed using macros such as IY_PRINT_PAIRS and IY_EOL_STRING. See below for a complete list and explanations. In the fst application, many of these parameters can be reset from the command line with the ’set’ command. */ #define #define #define #define

FST_mode(X) FST_reclaimable(X) FST_copyright_string(X) FST_execution_error(X)

(X)->mode (X)->reclaimable (X)->copyright_string (X)->execution_error

#define FST_compose_strategy(X)

(X)->compose_strategy

#define #define #define #define

(X)->temp_bufs.string_buffer (X)->temp_bufs.fat_str_buffer (X)->temp_bufs.page_buffer (X)->temp_bufs.lab_vector

FST_string_buffer(X) FST_fat_str_buffer(X) FST_page_buffer(X) FST_lab_vector(X)

#define FST_keep_p(X) #define FST_determinize_p(X) #define FST_interactive_p(X)

(X)->flags.keep_p (X)->flags.determinize_p (X)->flags.interactive_p

56 #define #define #define #define #define #define #define #define #define #define #define #define #define #define

FST_last_errors_p(X) FST_lex_errors_p(X) FST_minimize_p(X) FST_name_nets_p(X) FST_obey_flags_p(X) FST_prune_p(X) FST_sq_final_strings_arcs(X) FST_sq_intern_strings_arcs(X) FST_sq_string_onelong(X) FST_reclaim_p(X) FST_recode_cp1252(X) FST_unicode_p(X) FST_verbose_p(X) FST_embedded_command_p(X)

(X)->flags.last_errors_p (X)->flags.lex_errors_p (X)->flags.minimize_p (X)->flags.name_nets_p ((X)->interface).obey_flags_p (X)->flags.prune_p (X)->flags.sq_final_strings_arcs (X)->flags.sq_intern_strings_arcs (X)->flags.sq_string_onelong (X)->flags.reclaim_p (X)->flags.recode_cp1252 (X)->flags.unicode_p (X)->flags.verbose_p (X)->flags.embedded_command_p

#define #define #define #define #define #define #define

FST_cur_pos(X) FST_indent(X) FST_line_pos(X) FST_output_buffer(X) FST_output_buffer_size(X) FST_right_margin(X) FST_eol_string(X)

(X)->pretty_print.cur_pos (X)->pretty_print.indent (X)->pretty_print.line_pos (X)->pretty_print.output_buffer (X)->pretty_print.output_buffer_size (X)->pretty_print.right_margin (X)->pretty_print.eol_string

#define FST_max_path_index_pos(X) #define FST_path_index_pos(X) #define FST_path_index_vector(X)

(X)->index.max_path_index_pos (X)->index.path_index_pos (X)->index.path_index_vector

#define #define #define #define #define #define #define #define

FST_ignore_white_space_p(X) FST_zero_to_epsilon_p(X) FST_input_seq_size(X) FST_input_seq(X) FST_lower_match(X) FST_match_table(X) FST_upper_match(X) FST_pars_tbl_obsolete(X)

(X)->parse.ignore_white_space_p (X)->parse.zero_to_epsilon_p (X)->parse.input_seq_size (X)->parse.input_seq (X)->parse.lower_match (X)->parse.match_table (X)->parse.upper_match (X)->parse.obsolete_parse_tables

#define #define #define #define #define #define #define #define #define #define #define

FST_altchain_p(X) FST_status_bar_p(X) FST_status_bar_increment(X) FST_arc_count(X) FST_cur_byte(X) FST_last_header(X) FST_next_header(X) FST_cur_state(X) FST_state_stack(X) FST_attributes(X) FST_attribute_count(X)

(X)->bin_io.altchain_p (X)->bin_io.status_bar_p (X)->bin_io.status_bar_increment (X)->bin_io.arc_count (X)->bin_io.cur_byte (X)->bin_io.last_header (X)->bin_io.next_header (X)->bin_io.cur_state (X)->bin_io.state_stack (X)->bin_io.attributes (X)->bin_io.attribute_count

#define FST_defined_nets(X) #define FST_defined_sets(X)

(X)->define.net_table (X)->define.set_table

57

FST_CNTXTptr get_default_context(void); /* Returns a pointer to the structure allocated and initialized by initialize_cfsm(). */ int set_char_encoding(FST_CNTXTptr cntxt, int code); /* Sets the character encoding mode of the cntxt. The code must be either CHAR_ENC_UTF_8 or CHAR_ENC_ISO_8859_1. Returns 0 on success, 1 on error. */ #define IY_PRINT_PAIRS (int_parameters())->alphabet.print_pairs /* If the value is 1, the apply routines display both the input and the output side of the labels matching the input. By default only the output side is shown. Default is 0. */ #define IY_PRINT_SIGMA (int_parameters())->io.print_sigma /* If the value is 1, the sigma of a network is printed when the network is printed. Default is 1. */ #define IY_PRINT_SPACE (int_parameters())->io.print_space /* If the value is 1, a space is printed to separate the symbols in the output of many display commands. Default is 0. */ #define IY_MAX_STATE_VISITS (int_parameters())->io.max_state_visits /* The setting IY_MAX_STATE_VISITS determines the number of times the same state can be visited along a path. If the value is 1, loops are ignored. If the value is 2, a looping path is traversed just one time. Default is 1. */ #define IY_MINIMIZE_P (int_parameters())->general.minimal /* The setting of IY_MINIMIZE_P is used by many network operations to decide whether the result should be minimized. If the value is 1, the function minimize_net() is called. If the value is 0, the result is not minimized. Default is 1. */ #define IY_RECURSIVE_DEFINE (int_parameters())->general.recursive_define /* If the value is 1, definitions such as define_regex_net("A", "a | A b") yields a the left-recursive language a b* instead of the union of a and Ab. Default is 0. */ #define IY_VERBOSE (int_parameters())->general.verbose /* If the value is 1, cfsm prints reports about its activities such as opening and closing of files. If the value is 0, no messages other than error messages are printed. Default is 1. */ #define IY_OBEY_FLAGS (int_parameters())->io.obey_flags /* If the value is 1, flag diacritic constraints are enforced. If the value is 0, flag diacritic symbols are treated as epsilons. Default is 1. */

58 #define IY_SHOW_FLAGS (int_parameters())->io.show_flags /* If the value is 1, flag diacritic symbols are displayed in the output. Default is 0. */ #define IY_COUNT_PATTERNS (int_parameters())->io.count_patterns #define IY_DELETE_PATTERNS (int_parameters())->io.delete_patterns #define IY_EXTRACT_PATTERNS (int_parameters())->io.extract_patterns #define IY_LOCATE_PATTERNS (int_parameters())->io.locate_patterns #define IY_MARK_PATTERNS (int_parameters())->io.mark_patterns #define IY_NEED_SEPARATORS (int_parameters())->general.need_separators /* The setting of these five parameters controls the output of pattern matching. If IY_COUN_PATTERNS is 1, the number of instances of each matching pattern is recorded. Default is 1. If IY_DELETE_PATTERNS is 1, matching strings are omitted from the output. Default is 0. If IY_EXTRACT_PATTERNS is 1, only the matching string are shown in the output. Default is 0. If IY_LOCATE_PATTERNS is 1, pattern matching produces standoff markup. (See the pmatch demo.) The default is 0. If IY_MARK_PATTERNS is 1, matching strings are enclosed between XML tags in the ouput, for example, John Smith. Default is 1. If IY_NEED_SEPARATORS is 1, pattern matching starts and ends at a separator character such as a space or a punctuation symbol. Default is 1. */ #define IY_MAX_CONTEXT_LENGTH (int_parameters())->general.max_context_length /* Specifies the maximum length of a left context in pattern matching. Default is 64. */ #define IY_QUIT_ON_FAIL (int_parameters())->general.quit_on_fail /* If the value is 1, the cfsm application quits on error unless it is in an interactive mode. */ #define IY_VECTORIZE_N (int_parameters())->general.vectorize_n /* This variable specifies the minimum number of arcs a state must have in order to be vectorized by vectorize_net(). Default is 50. */ #define IY_USE_TIMER (int_parameters())->io.use_timer /* If the value is 1, a timer is started for operations that might take a while to complete. */ #define IY_CHAR_ENCODING (int_parameters())->io.char_encoding /* Determines the character encoding. The value must be either CHAR_ENC_UTF_8 or CHAR_ENC_ISO_8859_1. In UTF8 mode, the string input functions read and the string output functions write strings in the UTF-8 format. In the ISO-8859-1 mode, the input functions assume that the input strings are Latin-1 strings except that the non-ISO-8859-1 symbols in Microsoft’s CP1252 ("Windows Latin-1") are tolerated and quietly mapped to the proper Unicode symbols. */ #define IY_EOL_STRING (get_default_context())->pretty_print.eol_string /* The end-of-line string for some printing applications. Default is "\n". */

59

#define IY_INDENT (get_default_context())->pretty_print.indent /* Indentation for new_page(). Default is 0. */ #define IY_RIGHT_MARGIN (get_default_context())->pretty_print.right_margin /* Right marging for new_page(). Default is 72. */ /******************************** * COMPACT NETWORK CONFIGURATION * *********************************/ typedef struct COMPACT_CONFIG COMP_CONFtype, *COMP_CONFptr; /************************* * ARC_ITERATOR object * *************************/ typedef struct ARC_ITERATOR ARCITtype, *ARCITptr; ARCITptr init_arc_iterator(NETptr net, ARCITptr arc_it); /* Initializes an arc iterator suitable for the particular type of net. If the second argument is NULL, a new arc iterator is created. */ void start_arc_iterator(ARCITptr arc_it, void *state, void** next, int *last_p); /* Initializes an arc iterator for a particular state. The last two arguments are needed to allow a single iterator to be used in a recursive descent through states. */ ARCptr next_iterator_arc(ARCITptr arc_it, void ** next, int *last_p); /* Returns a standard arc even from a state that has been vectorized and thus has no standard arcs. */ void free_arc_iterator(ARCITptr arc_it); /* Reclaims the memory allocated to the arc iterator. */ /***************** * APPLY CONTEXT *****************/ typedef struct APPLY_CONTEXT { int reclaimable; NETptr net1; /* Net to be applied */ NETptr net2; /* Second network for bimachines -- not used now */ NVptr net_vector; int side; /* Input side: LOWER or UPPER */ int out_side; /* Output side: UPPER or LOWER */ int obey_flags_p; /* 1 = obey flag diacritics, 0 = don’t obey */ int print_space_p; /* Separate output symbols by a space */ int show_flags_p; /* Show flag diacritics in the output */ int flags_p; /* 1 = Network has flag diacritics */ int recursive_p;

60 int eol_is_eof_p; int next_input_line_p; int need_separators_p; int count_patterns_p; int delete_patterns_p; int extract_patterns_p; int locate_patterns_p; int one_tag_per_line_p; int mark_patterns_p; int max_context_length; int in_pos; int end_pos; int nv_pos; int level; int depth; int num_inputs; char *eol_string; int end_of_input; int max_recursion_depth;

/* /* /* /* /* /* /* /* /* /*

separators required in apply_patterns(). count pattern matches delete material that matches delete material that does not match locate begin end and tag print pattern locations on seperate lines mark patterns with tags maximal length of left and right context input position */ end of pattern match */

*/ */ */ */ */ */ */ */

/* number of processed inputs */ /* defaults to "\n" */ /* end of input file or string */ /* apply_patterns is implemented as a recursion */ /* The role of this variable is to bound this */ /* recursion to avoid stack overflow (mainly */ /* for very general patterns that can go */ /* arbitrarily long). */ /* If set to -1 it won’t be bounded */ PARSE_TBL parse_table; /* Maps input symbol to a symbol ID */ int (*next_symbol_fn)(id_type *, void *); /* fetches the next input ID */ void (*write_buffer_fn)(void *); /* Function to write into out_buffer */ id_type (*in_fn)(id_type); /* lower_id() or upper_id() */ id_type (*out_fn)(id_type); /* upper_id() or lower_id() */ MATCH_TABLEptr match_table; unsigned char *input; /* current input string */ unsigned char *remainder; /* remaining part of the input string */ FILE *in_stream; /* input stream */ FILE *out_stream; /* output stream */ void *in_data; void *out_data; int out_count; /* output counter */ void (*output_fn)(void *cntxt); /* output function */ LAB_VECTORptr in_vector; /* vector for storing input IDs */ LAB_VECTORptr mid_vector; LAB_VECTORptr out_vector; /* vector for storing output IDs */ LAB_VECTOR_TABLEptr in_table; LAB_VECTOR_TABLEptr out_table; ALPHABETptr sigma; ALPHABETptr prev_sigma; VECTORptr host_net_vector; ALPHABETptr flag_register; LAB_VECTORptr flag_vector; LAB_VECTORptr tag_vector; VECTORptr arc_vector;

61 VECTORptr state_vector; VECTORptr destination_vector; VECTORptr start_vector; VECTORptr task_vector; VECTOR_TABLEptr pos_table; STRING_BUFFERptr out_buffer; STRING_BUFFERptr save_buffer; void *hyper_unit; unsigned long file_pos; LAB_VECTORptr other_than_vector; IO_SEQptr in_seq; IO_SEQptr out_seq; IO_SEQ_TABLEptr input_table; IO_SEQ_TABLEptr output_table; /* Net traversal call-back functions: */ void* (*start_state_fn)(NETptr, void**, int*); id_type (*label_from_arc_fn)(NETptr, void**, int*, int*); void (*next_arc_fn)(NETptr, void**, int); void* (*destination_fn)(NETptr, void**); STATEptr solution_tree;

/* For storing the final result in a tree instead of the table. */

LAB_RINGptr input_ring; LOCATIONptr location_heap; /* Working space for apply_vectorized_network */ HEAPptr task_heap; STACKptr task_stack; STATEptr state; ARCITptr arc_it; } APPLYtype, *APPLYptr;

/* Heap for iterative_apply_patterns() */ /* Stack for iterative apply_patterns() */

/* An apply context is a very large data structure that is initialized for various types of apply operations such as apply_network(), apply_patterns(), and compose_apply(). The input to an apply operation is a string or a stream or a table of label vectors. The output is collected into a string buffer, into an array of label vectors or compiled into a network. The applied network may be of different types: a standard network, an optimized network, a network with a reduced labelset, a network containing vectorized states, a compacted network, a bimachine. The APPLY_CONTEXT data structure contains data fields for all the different flavors of apply. Any particular apply operation will only have use for some of them. */ #define APPLY_reclaimable(X) #define APPLY_net1(X) #define APPLY_net2(X)

(X)->reclaimable (X)->net1 (X)->net2

62 #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define

APPLY_net_vector(X) APPLY_end_of_input(X) APPLY_side(X) APPLY_out_side(X) APPLY_obey_flags_p(X) APPLY_print_space_p(X) APPLY_show_flags_p(X) APPLY_flags_p(X) APPLY_recursive_p(X) APPLY_eol_is_eof_p(X) APPLY_eol_string(X) APPLY_next_input_line_p(X) APPLY_in_pos(X) APPLY_end_pos(X) APPLY_nv_pos(X) APPLY_parse_table(X) APPLY_next_symbol_fn(X) APPLY_write_buffer_fn(X) APPLY_in_fn(X) APPLY_out_fn(X) APPLY_in_stream(X) APPLY_out_stream(X) APPLY_in_data(X) APPLY_out_data(X) APPLY_out_count(X) APPLY_output_fn(X) APPLY_match_table(X) APPLY_input(X) APPLY_remainder(X) APPLY_in_vector(X) APPLY_sigma(X) APPLY_host_net_vector(X) APPLY_prev_sigma(X) APPLY_mid_vector(X) APPLY_out_vector(X) APPLY_in_table(X) APPLY_out_table(X) APPLY_flag_register(X) APPLY_flag_vector(X) APPLY_tag_vector(X) APPLY_arc_vector(X) APPLY_state_vector(X) APPLY_dest_vector(X) APPLY_task_vector(X) APPLY_out_buffer(X) APPLY_save_buffer(X) APPLY_hyper_unit(X) APPLY_other_than_vector(X) APPLY_in_seq(X) APPLY_out_seq(X)

(X)->net_vector (X)->end_of_input (X)->side (X)->out_side (X)->obey_flags_p (X)->print_space_p (X)->show_flags_p (X)->flags_p (X)->recursive_p (X)->eol_is_eof_p (X)->eol_string (X)->next_input_line_p (X)->in_pos (X)->end_pos (X)->nv_pos (X)->parse_table (X)->next_symbol_fn (X)->write_buffer_fn (X)->in_fn (X)->out_fn (X)->in_stream (X)->out_stream (X)->in_data (X)->out_data (X)->out_count (X)->output_fn (X)->match_table (X)->input (X)->remainder (X)->in_vector (X)->sigma (X)->host_net_vector (X)->prev_sigma (X)->mid_vector (X)->out_vector (X)->in_table (X)->out_table (X)->flag_register (X)->flag_vector (X)->tag_vector (X)->arc_vector (X)->state_vector (X)->destination_vector (X)->task_vector (X)->out_buffer (X)->save_buffer (X)->hyper_unit (X)->other_than_vector (X)->in_seq (X)->out_seq

63 #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define #define

APPLY_input_table(X) APPLY_output_table(X) APPLY_start_state_fn(X) APPLY_label_from_arc_fn(X) APPLY_next_arc_fn(X) APPLY_destination_fn(X) APPLY_solution_tree(X) APPLY_need_separators_p(X) APPLY_max_context_length(X) APPLY_input_ring(X) APPLY_location_heap(X) APPLY_count_patterns_p(X) APPLY_delete_patterns_p(X) APPLY_extract_patterns_p(X) APPLY_locate_patterns_p(X) APPLY_one_tag_per_line_p(X) APPLY_mark_patterns_p(X) APPLY_file_pos(X) APPLY_level(X) APPLY_depth(X) APPLY_num_inputs(X) APPLY_max_recursion_depth(X) APPLY_start_vector(X) APPLY_pos_table(X) APPLY_task_heap(X) APPLY_task_stack(X) APPLY_state(X) APPLY_out_pos(X) APPLY_arc_it(X)

(X)->input_table (X)->output_table (X)->start_state_fn (X)->label_from_arc_fn (X)->next_arc_fn (X)->destination_fn (X)->solution_tree (X)->need_separators_p (X)->max_context_length (X)->input_ring (X)->location_heap (X)->count_patterns_p (X)->delete_patterns_p (X)->extract_patterns_p (X)->locate_patterns_p (X)->one_tag_per_line_p (X)->mark_patterns_p (X)->file_pos (X)->level (X)->depth (X)->num_inputs (X)->max_recursion_depth (X)->start_vector (X)->pos_table (X)->task_heap (X)->task_stack (X)->state (X)->out_vector->pos (X)->arc_it

/***************** * TOKENIZER *****************/ /* A tokenizer is an object that applies a tokenizing transducer to a string returning a network containing one or more possible tokenization of a section of the input string. For example, "Dr." could be a single token, an abbreviation of a title as in "Dr. No." It could also be another kind of single token, an abbreviaton "drive" as in "Mulholland Dr." A sentence final abbreviation adds to the ambiguity because it loses the final period in front of a sentence-terminating period, as in "We met at Mulholland Dr." A tokenizer applies the tokenizer network to a string or a stream in breadth-first mode pursuing all alternatives in parallel. At pinch points where all the alternative paths come together into single state, it returns a network representing all the possible tokenizations of the input string up to that point. */ typedef struct TOKENIZER TOKtype, *TOKptr;

64 /* The data structure for a tokenizer. */ TOKptr new_tokenizer(NETptr token_fst, char *in_string, FILE *in_stream, id_type token_boundary, id_type fail_token); /* Returns a new tokenizer using token_fst either for the input in in_string or the file in_stream.One of the two arguments, in_string or in_stream, must be NULL, the other one must be specified. The tokens returned by the function are terminated by a token boundary or fail_token in the case the token_fst fails to accept some section of the input. This can only happen if he lower-side language of the tokenizer fst is not the universal sigma-star language.The token boundary symbol must appear only on the upper side of the tokenizing transducer unless it is "\n".Returns a new tokenizer or NULL if an error occurs. */ TOKptr make_tokenizer(char *fst_file, char *in_string, char *in_file, char *token_bound, char *fail_token); /* Returns a new tokenizer obtained by calling new_tokenizer() with the expected arguments. One of the two arguments, in_string or in_file, must be NULL, the other one must be a string. The token_bound string is a token separator, such as "\n" or a special symbol such as "TB". The fail_token is a string such as "FAILED_TOKEN" to be used in extremis when all the alternative paths the tokenizer has pursued have failed. This can only happen if he lower-side language of the tokenizer fst is not the universal sigma-star language. Returns a new tokenizer or NULL if an error occurs. */ NETptr next_token_net(TOKptr tok); /* Returns the next token network or NULL when the input has been exhausted. */ int restart_tokenizer(TOKptr tok, char *string, FILE *stream); /* Restarts the tokenizer tok on a new input string. */ void free_tokenizer(TOKptr tok); /* Frees the tokenizer tok and all its contents. */ /***************************************************** * FUNCTION PROTOTYPES *****************************************************/ /* Initialization and reclamation of CFSM context */ FST_CNTXTptr initialize_cfsm(void); /* Allocates and initializes the CFSM_CONTEXT structure. Any application using this API needs to call this function before calling any cfsm functions or macros. */

65 void reclaim_cfsm(void); /* Releases all the memory allocated to the CFSM_CONTEXT and all other data structures declared within the context such as stacks, heaps, networks, states, arcs and alphabets. */ FST_CNTXTptr get_default_context(void); /* Returns a pointer to the structure allocated by initialize_cfsm(). This function is called implicitly by macros such as IY_EOL_STRING. */ IntParPtr int_parameters(void); /* Returns a pointer to the interface parameters component of the cfsm_context. This function is called implicitly by macros such as IY_CHAR_ENCODING. */ /* Binary output functions */ int save_net(NETptr net, char *filename); /* Saves a single net in the Xerox proprietary binary format. Returns 0 on success and an error code if something goes wrong. */ int save_nets(NVptr nv, char *filename); /* Saves any number of nets packaged into a net vector. Returns 0 on success and an error code if something goes wrong. A net vector is created calling make_nv(n) where n is the number of slots in the vector. The statement NV_net(nv, 0) = net; puts net into the first slot of the nv. */ int save_defined_nets(char *filename); /* Saves all the networks that have been bound to a name using either the define_net() or define_regex_net() function. Return 0 on success and an error code on failure. When the nets are loaded with the load_defined_nets() command, the definitions are restored. The definitions are not restored if the networks are loaded with the load_nets() command instead. */ /* Binary input functions */ NETptr load_net(char *filename); /* Loads a single network from the file. If the file contains more than one network, only the first one is loaded. Returns the network ono success and NULL on error. */ NVptr load_nets(char *filename); /* Loads any number of networks from the file. Returns a net vector on success and NULL on error. */ int load_defined_nets(char *filename); /* Loads any number of networks saved by save_defined_nets(). Each network has on its property list the name it was defined as. The definitions are restored. Returns 0 on success, 1 on error. Prints a warning message if some of the networks have not been defined. */

66

NETptr load_defined_net(char *name, char *filename); /* Loads any number of networks saved by save_defined_nets() and restores the definitions. Returns a copy of the network with the given name if it is one of the networks or NULL in case the file does not contain a defined network with that name. */ /* Text input functions */ NETptr string_to_net(char *str, int byte_pos_p); /* Compiles a network from a string. If the byte_pos_p is non-zero, the arcs of the network will be big arcs that have in their user_pointer field the byte position of the first symbol in its label. If byte_pos_p is zero, byte positions are not recorded and the arcs are normal arcs. */ NETptr read_text(char *filename); /* Reads a text file line by line and converts each line to a path in a network. Returns the assembled network. For example, if the file consists of the two lines San Francisco London the resulting network is the same as compiled the one returned by read_regex("{San Francisco} | {London}); If the first line of the file is # -*- coding: iso-8859-1 -*- the file is processed as a Latin-1 file even in UTF-8 mode. Similarly, if the file begins with # -*- coding: utf-8 -*- it is processed as a utf-8 file regardless of the prevailing mode. The exclamation point, !, may be used instead of #. Any other lines starting with # or ! are ignored. */ NETptr read_spaced_text(char *filename); /* Reads a transducer or a net with multicharacter symbols or both from the file line by line. Adjacent lines are read as a single path with the first one processed as the upper side of the path and the second one as the lower side of the path. For example, the following pair of lines l e a v e +Verb +Past l e f t will be compiled by read_spaced_text() into the same path as produced by read_regex("[l e a v e %+Verb %+Past]:{left}"); Pairs of paths and single paths must be separated by an empty line: l e a v e +Verb +Past l e f t S a n %

F r a n c i s c o

c i t y +Noun +Pl c i t i e s White space characters are interpreted as separators between non-white space symbols except when preceded by %. Thus the fourth

67 line above gives the same result as read_regex({San Francisco}). The first line of the input line is checked for a possible character encoding declaration. See the comment on read_text() above. */ NETptr read_regex(char *regex_str); /* Compiles the regular expression string and returns the resulting network or a null fsm if an error occurs. See chapters 2 and 3 of the book Finite State Morphology by Kenneth R Beesley and Lauri Karttunen for a description of the Xerox regular expression formalism. */ NETptr read_lexc(char *filename); /* Compiles a file written in the lexc formalism. See Chapter 4 of the Beesley & Karttunen book about the lexc language. Returns the network or a null fsm on error. */ NETptr read_prolog(char *filename); /* Compiles a network expressed as Prolog style-clauses. For example, # -*- coding: utf-8 -*network(net_1b10d0). symbol(net_1b10d0, "a"). arc(net_1b10d0, 0, 1, "?"). arc(net_1b10d0, 0, 1, "b"). arc(net_1b10d0, 0, 1, "c"). arc(net_1b10d0, 0, 1, "d"). arc(net_1b10d0, 1, 2, "b":"c"). arc(net_1b10d0, 2, 3, "d":"0"). final(net_1b10d0, 3). is the Prolog-style representation of the network compiled with read_regex("\a b:c d:0"); The network(net_1b10d0) clause gives the network an arbitrary name that is the first component of every subsequent clause. The arc clauses are of the form arc(, , , ). A symbol such as "a" here that is part of the sigma alphabet of the network but does not appear as an arc label is declared explixitly as a symbol. State 3 is the only final state of the network. "?" denotes the unknown symbol, "0" stands for an epsilon. "%?" is the literal question mark, "%0" the digit xero. */ /* Text output functions */ int write_text(NETptr net, char *filename); /* Outputs a simple network in the text format expected by the read_text() function. (See above.) If the network is circular, the function aborts with an error message. If the network is a transducer or if it contains multicharacter symbols, the function aborts with and error message referring the user to the read_spaced_text() function. The return value is 0 on success, 1

68 on error. If the second argument is NULL, the output goes to stdout. */ int write_spaced_text(NETptr net, char *filename); /* Outputs a simple network or a transducer in the text format expected by the read_spaced_text() function. (See above.) If the network is circular, the function aborts with an error message. Returns 0 on success, 1 on failure. If the second argument is NULL, the output goes to stdout. */ int write_prolog(NETptr net, char *filename); /* Outputs a network as Prolog-style clauses in the format expected by read_prolog. (See comment above.) Returns 0 on success 1 on failure. */ /* Optimizations */ int vectorize_states(NETptr net, int min_num_arcs); /* Destructively modifies the network for speed at the cost of increasing the size. Every state with min_num_arcs or more has its arc list replaced by a vector that provides random access to the destination states of the original arcs. Vectorized networks can only be used for apply and pattern matching operations, not for calculus operations such as union and intersection. A vectorized network cannot be save without first unvectorizing it. Returns the number of the vectorized states. */ int unvectorize_net(NETptr net); /* Destructively modifies the network by restoring the original arc lists of all vectorized states. The network will become a standard network again. Returns the number of modified states. */ void optimize_arcs(NETptr net); /* Destructively modifies the network by a heuristic algorithm that tries to reduce the number of arcs while possibly increasing the number of states. An arc-optimized network cannot be used for calculus operations other than composition. It can be saved in the optimized format. */ void unoptimize_arcs(NETptr net); /* Destructively modifies an arc-optimized network turning it back to the standard format. */ int share_arcs(NETptr net); /* Destructively modifies an arc-optimized network by letting chains of arcs be physically shared by several states. It reduces the size and improve the application speed of an arc-optimized network. The arc-sharing operation can only be done with a network processed by optimize_arcs(). A network with shared arcs cannot be saved in that format. Returns 0 on success, 1 on failure. */

69

NETptr unshare_arcs(NETptr net, int keep_p); /* Undoes the work of share_arcs() by making a keep of the network. The returned network is a standard network, not an arc-optimized one. Hence the network can be saved into a file. The second argument should be DONT_KEEP if the input network can be reclaimed. If it should be kept, the second argument should be KEEP. */ void reduce_labelset(NETptr net); /* Destructively modifies the network for the purpose of reducing the number of arcs. It partitions the label alphabet of the network into equivalence classes and eliminates all arcs that are not labeled by the first symbol of some equivalence class. For example, in the network compiled with read_regex("[a|b|c] x [a|b|c]"); the b and c arcs can be eliminated and represented by the a arcs. A network with a reduced labelset can only be used for the apply operation. It can be saved into a file and loaded back preserving the equivalence class information. A network with a reduced labelset can be further optimized with optimize_arcs(). */ void unreduce_labelset(NETptr net); /* Destructively modifies the network by undoing the work of reduce_labelset(). */ void compact_net(NETptr net); /* Destructively modifies the network by compacting into a form that has no state or arc structures. This operation is sometimes referred to as "Karttunen compaction" to distinguish is from another compaction scheme known as "Kaplan compression" that is no longer available in a C implementation. Compaction reduces the size of the network in memory but it significantly slows the speed of application. The only operation possible on compacted networks is apply. Compacted networks can be saved into and loaded from a file. */ void uncompact_net(NETptr net); /* Desctructively modifies the network by undoing the effects of compact_net(). Converts the network into the standard format with standard arcs and states. */ /* Applying transducers to strings and streams */ APPLYptr init_apply(NETptr net, int side, FST_CNTXTptr cfsm_cntxt); /* Returns a pointer to an apply context initialized for a the net and the given input side, or NULL if an error occurs. This is the simplest of several functions that return an "apply context" object. The context it returns is not initialized for any input. To use it on a string, call apply_to_string(). */

70

char *apply_to_string(char *input, APPLYptr applyer); /* Returns the result of calling the applyer on the given string input, NULL on error. The applyer must have been initialized to work on a given side of a given network. A non-NULL result is terminated with the applyer’s end-of-line string, "\n" by default. If the application results in an empty string, the return value consist of the end-of-line string. If the input string is not recognized, the return value is an empty string without the end-of-line marker. The returned string is volatile memory. It will be overwritten by the next call. */ void switch_input_side(APPLYptr cntxt); /* Switches the input side of an applyer object from UPPER to LOWER and from LOWER to UPPER. */ APPLYptr new_applyer(NETptr net, char *string, FILE *stream, int input_side, FST_CNTXTptr cfsm_cntxt); /* Returns a pointer to an apply context that is initialized for applying the net to either a given string or to a given stream. If the input is from a string, the stream argument must be NULL, and vice versa. One of the two arguments must be non-NULL. The input side must be UPPER or LOWER. If the last argument is NULL the default context, the one created by initialize_cfsm() is chosen. Returns a pointer to the apply context, or NULL if an error occurs. */ APPLYptr make_applyer(char *fst_file, char *in_string, char *in_file, int input_side, FST_CNTXTptr cfsm_cntxt); /* Loads a network from the fst_file, opens the in_file if it is given, calls new_applyer() with these parameters and returns the apply context, or NULL if an error occurs. */ STRING_BUFFERptr next_apply_output(APPLYptr applyer); /* Returns a string buffer containing the output strings of the next application in the given apply context.If the input is from a string, the entire string is consumed. If the output is from a stream, the input consists of the the next line ignoring the final eol_string. Returns NULL if the input has been exhausted. Otherwise the return value is a pointer to the output buffer of the applyer containing any number of output strings, possibly none, for the last input, separated by the eol_string of the applyer (default = "\n"). The output can be displayed with print_string_buffer(). The next call to next_apply_output() will overwrite the previous output, so the calling function should either print the result immediately or keep it, for example, by calling string_to_page(STRING_BUFFER_string(next_apply_output(applyer)), page). */ APPLYptr new_pattern_applyer(NETptr net, char *in_string, FILE *in_stream,

71 FILE *out_stream, int input_side, FST_CNTXTptr cfsm_cntxt); /* Constructs pattern matching applyer that is initialized for applying the pattern net to either an input string or to an input stream. If the input is from a string, the stream argument must be NULL and vice versa. One of the two, in_string and in_stream, must be non-NULL. The output may be written into a stream or into and internal buffer. The function next_pattern_output() applies the patterns and returns the output buffer. In either case, the entire input is consumed by a single call to apply_patterns(). The input side must be UPPER or LOWER. If the last argument is NULL, the default contex, the one created by initialize_cfsm() is chosen. The output mode is controlled by six macros: IY_COUNT_PATTERNS, IY_DELETE_PATTERNS, IY_EXTRACT_PATTERNS, IY_LOCATE_PATTERNS, IY_MARK_PATTERNS, IY_NEED_SEPARATORS. (See the description above.) The pattern network should contain paths that end with a symbol apair of the type "":0 or 0:"" where the epsilon (0) is on the input side of the network. Returns a pointer to the pattern applyer, or NULL if an error occurs. */ APPLYptr make_pattern_applyer(char *fst_file, char *in_string, char *in_file, char *out_file, int input_side, FST_CNTXTptr cfsm_cntxt); /* Of the three file names, fst_file, in_file, out_file, only fst_file is obligatory. If in_file is NULL, in_string, must not be NULL, and vice versa. If the out_file argument is NULL, the output is written into an internal buffer. If the specified file or files are successfully opened, makes a call to new_pattern_applyer(). Returns either a pointer to a pattern applyer or NULL if an error occurs. */ STRING_BUFFERptr next_pattern_output(APPLYptr pattern_applyer); /* Calls apply_patterns() and returns a pointer to the pattern_applyer->save_buffer where the output of the pattern application is stored if an output file is not specified. */ int apply_patterns(APPLYptr pattern_applyer); /* Applies a pattern applyer created by new_pattern_applyer(). Returns 0. */ void init_apply_to_string(char *input, APPLYptr apply_context); /* Initializes an applyer object created by new_applyer() or new_pattern_applyer() for a new input string. */ void init_apply_to_stream(FILE *stream, APPLYptr apply_context); /* Initializes an applyer object created by new_applyer() or new_pattern_applyer() for a new input stream. */ PAGEptr pattern_match_counts_to_page(APPLYptr pattern_applyer, PAGEptr page); /* Prints to the page the number of matches for each pattern

72 obtained by the last application of the pattern applyer and returns the page. If the page argument is NULL, a new page is created. */ void free_applyer(APPLYptr applyer); /* Reclaims the apply context created by new_applyer() or new_pattern_applyer(). Does not reclaim the network it contains. */ /* Unary Tests */ int test_lower_bounded(NETptr net); /* Returns 1 if the lower side of the network has no epsilon loops, otherwise 0. */ int test_upper_bounded(NETptr net); /* Returns 1 if the upper side of the network has no epsilon loops, otherwise 0. */ int test_non_null(NETptr net); /* Returns 1 if the network is not a null fsm, that is, a network that has no reachable final state, otherwise 0. */ int test_upper_universal(NETptr net); /* Returns 1 if the upper side of the network is the universal (sigma-star) langugage that contains any string of any length, including the empty string, otherwise 0 */ int test_lower_universal(NETptr net); /* Returns 1 if the upper side of the network is the universal (sigma-star) langugage that contains any string of any length, including the empty string, otherwise 0 */ /* Binary network tests */ int test_equivalent(NETptr net1, NETptr net2); /* Returns 1 if net1 and net2 are structurally equivalent, otherwise 0. Two networks are structurally equivalent just in case they, have the same arity, the same sigma and label alphabet, the same number of arcs and states and equivivalent paths. If the arity is 1 and the networks are structurally equivalent, they encode the same language. If net1 and net2 are structurally equivalent transducers, they encode the same relation. If two transducers are not structurally equivalent, they may nevertheless encode the same relation by having epsilons in different places. The equivalence of transducers is no decidable in the general case. */ int test_sublanguage(NETptr net1, NETptr net2); /* Returns 1 if the language or relation of net1 is a subset of the language of relation of net2. The test is correct when net1 and net2 encode simple languages but it is not generally correct for

73 transducers for the reason explained above. */ int test_intersect(NETptr net1, NETptr net2); /* Returns 1 if the languages or relations encoded by net1 and net2 have strings or pairs of strings in common. The test is correct for simple networks but not generally correct for transducers for the reason explained above. */ /* Definitions */ int define_net(char *name, NETptr net, int keep_p); /* Binds the name to network. If the keep_p flag is KEEP, the name is bound to the copy of the network. The name can be used in a regular expression to splice in a copy of the defined network. Returns 0 on success, 1 on error. */ int define_regex_net(char *name, char *regex); /* Compiles the regular expression and binds the name to it by calling define_net(). Returns 0 on success, 1 on error. */ int undefine_net(char *name); /* Frees the network the name is bound to and unbinds the name. Returns 0 on success, 1 on error. */ NETptr get_net(char *name, int keep_p); /* Returns the network the name is bound to, or its copy if keep_p is KEEP. Returns NULL if the name is undefined. */ NETptr net(char *name); /* Returns get_net(name, DONT_KEEP). */ int define_regex_list(char *name, char *regex); /* Compiles the regular expression and binds the name to the sigma alphabet of the resulting network by calling define_symbol_list(). The rest of the network structure is reclaimed. Returns 0 on success, 1 on error. A list name can be used in a regular expression to refer to the union of the symbols it contains. For example, define_regex_list("Vowel", "a e i o u") binds Vowel to the alphabet containing the five vowels. Given this definition, read_regex("Vowel") is equivalent to read_regex("a|e|i|o|u"). Names that are bound to a list may also be used in so-called "list flags", special symbols of the form @L.name@ and @X.name@ where L means ’member of the list’ and X means ’excluding members of the list’. The apply operations recognize an arc labeled "@L.Vowel@" as standing for any member of the list Vowel. In contrast, calculus operations do not currently assign any special interpretation to list flags. */ int define_symbol_list(char *name, ALPHABETptr alph, int keep_p);

74 /* Binds the name to the alphabet, or to its copy if keep_p is KEEP. Returns 0 on success, 1 on error. */ ALPHABETptr get_symbol_list(char *name, int keep_p); /* Returns the alphabet the name is bound to, or its copy if keep_p is KEEP. Returns NULL if the name is not bound to an alphabet. */ ALPHABETptr symbol_list(char *name); /* Returns get_symbol_list(name, DONT_KEEP). */ int undefine_symbol_list(char *name); /* Unbinds the name and reclaims the alphabet it was bound to. Returns 0 on success and 1 on error. */ int define_function(char *fn_call, char *regex); /* Compiles the regular expression and binds it to the function call. For example, define_function("Double(X)," "X X"); creates a simple function that concatenates the argument to itself. Functions are used in regular expression. Given the definition of "Double(X)", read_regex("Double(a)"); is equivalent to read_regex("a a"). See the piglatin application for examples of more interesting function definitions. */ /* Primitive network constructors */ NETptr null_net(void); /* Returns a network consisting of a single non-final state. It encodes the null language, a language that contains nothing, not even the empty string. Equivalent to read_regex("\?"); */ NETptr epsilon_net(void); /* Returns a network consisting of a single final state. It encodes the language consisting of the empty string. Equivalent to read_regex("[]"); */ NETptr kleene_star_net(void); /* Returns a network consisting of a single final state with a looping arc for the unknown symbol. It encodes the universal ("sigma star") language. Equivalent to read_regex("?*"); */ NETptr kleene_plus_net(void); /* Returns a network that encodes the universal language minus the empty string. Equivalent to read_regex("?+"); */ NETptr label_net(id_type id); /* Returns a network that encodes the string or a pair of strings represented by the id. */

75

NETptr symbol_net(char *sym); /* Returns label_net(single_to_id(sym)); */ NETptr pair_net(char *upper, char *lower); /* Returns label_net(pair_to_id(upper, lower)); */ NETptr alphabet_net(ALPHABETptr alph); /* Returns the network that encodes the language of the union of the singleton languages or relations represented by the labels in the alphabet. */ /* Sigma and

Label alphabets */

/* The return values of net_sigma() and net_labels() are the actual alphabets of the network. They must not be modified directly, and they will not be up-to-date if the network is modified. */ ALPHABETptr net_sigma(NETptr net); /* Returns the network’s sigma alphabet. */ ALPHABETptr net_labels(NETptr net); /* Returns the network’s label alphabet. */ void update_net_labels_and_sigma(NETptr net); /* Updates the label and the sigma alphabet of the network and the network arity (1 or 2). After the update, the label label alphabet contains all and only labels that appear on some arc of the network. Any missing symbols are added to the sigma alphabet. */ /* Substitutions */ /* If the keep_p argument is KEEP, the argument networks are preserved unchanged. If the keep_p argument is DONT_KEEP, the argument networks are reclaimed or destructively modified. The alphabet arguments are preserved unchanged. */ NETptr substitute_symbol(id_type id, ALPHABETptr list, NETptr net, int keep_p); /* Replaces every arc that has id in its label by a set of arcs labeled by symbols created by replacing id by a member of the list. All the new arcs have the same destination as the original arc. If id is itself a member of the list, the original arc is reconstituted in the process. If the list is NULL or has no members, then all arcs that have id in their label are eliminated. The label and sigma alphabets are updated. If keep_p is KEEP, the operation is performed on a copy of the original network. Returns the modified network. */

76 NETptr substitute_label(id_type id, ALPHABETptr labels, NETptr net, int keep_p); /* Like substitute_symbol() except that id is treated as a label an not as a label component. For example, if id represents "a", then only arcs with "a" as the label are affected but arcs such as "a:b" do not get changed. If keep_p is KEEP, the operation is performed on a copy of the original network. Returns the modified network. */ NETptr substitute_net(id_type id, NETptr insert, NETptr target, int keep_insert_p, int keep_target_p); /* Replaces the arcs labeled with id in the target by splicing a keep of the insert network between the start state of the arc and its destination. If keep_insert_p is KEEP, the the insert network is not affected by the operation. If keep_p is DONT_KEEP, the insert network is reclaimed. The target network is destructively modified if keep_target_p is DONT_KEEP. If keep_target_p is KEEP, the operation is performed on a copy of the target network. Returns the resulting network. */

NETptr eliminate_flag(NETptr net, char *name, int keep_p); /* Eliminates all arcs that have name as an attribute of a flag diacritic such as @U.Case.Acc@ or as a list symbol in a list flag such as @L.Vowel@ or as a defined network in an insert flag such as @I.FirstName@. In the case of a flag diacritic such as @U.Case.Acc@, the function constructs a constraint network and composes it with net (or a copy of it) to enforce the constraint. In the case of a list or an insert flag, the function eliminates the arcs in question by splicing in a network. Returns the modified network or the copy of it if keep_p is KEEP. */ /* Alphabet operations */ /* If the keep_p argument is KEEP, the argument alphabets are preserved unchanged. If the keep_p argument is DONT_KEEP, the argument alphabets are reclaimed or destructively modified. If there is no keep_p flag, the operation is non-destructive. The alphabets may be of either of the two types, binary vectors or label alphabets. */ ALPHABETptr alph_add_to(ALPHABETptr alph, id_type new_id, int keep_p); /* Adds new_id to the alphabet. If keep_p is KEEP, the operation is made on a copy of alph. Returns the modified alphabet. */ ALPHABETptr alph_remove_from(ALPHABETptr alph, id_type id, int keep_p);

77 /* Removes id from the alphabet or from its copy if keep_p is KEEP. Returns the modified alphabet. */ ALPHABETptr union_alph(ALPHABETptr alph1, ALPHABETptr alph2, int keep_alph1_p, int keep_alph2_p); /* Returns the union of the two alphabets. If the keep_p arguments are DONT_KEEP the input alphabets are reclaimed. */ ALPHABETptr intersect_alph(ALPHABETptr alph1, ALPHABETptr alph2, int keep_alph1_p, int keep_alph2_p); /* Returns the intersection of the two alphabets. If the keep_p arguments are DONT_KEEP, the orignals are reclaimed. */ ALPHABETptr minus_alph(ALPHABETptr alph1, ALPHABETptr alph2, int keep_alph1_p, int keep_alph2_p); /* Returns a new binary alphabet containing all the IDs in alph1 that are not in alph2. The input alphabets are reclaimed unless the keep_p flags are KEEP. */ ALPHABETptr binary_to_label(ALPHABETptr alph); /* Converts the alphabet from binary to label format, if it is not in the label format already. */ ALPHABETptr label_to_binary(ALPHABETptr alph); /* Converts the alphabet from label to binary format, if it is not in the label format already. */ int test_equal_alphs(ALPHABETptr alph1, ALPHABETptr alph2); /* Returns 1 if alph1 and alph2 contain the same IDs, otherwise 0. */ int test_alph_member(ALPHABETptr alph, id_type id); /* Returns 1 if id is a member of the alphabet, otherwise 0. */ /* Network operations */ /* If the keep_X_p argument is KEEP, the corresponding network is preserved unchanged. If the keep_X_p argument is DONT_KEEP, the network X is reclaimed or destructively modified. Most network operations presuppose that the arguments are standard networks that have not been compacted, vectorized, or optimized. */ /* Unary operations. */ NETptr lower_side_net(NETptr net, int keep_p); /* Extracts the lower-side projection of the net. That is, every arc with a pair label is relabeled with the lower side id of the pair. Returns the modified network. The corresponding regular expression operator is .l. */

78

NETptr upper_side_net(NETptr net, int keep_p); /* Extracts the upper-side projection of the net. That is, every arc with a pair label is relabeled with the upper side id of the pair. Returns the modified network. The corresponding regular expression operator is the suffix .u. */ NETptr invert_net(NETptr net, int keep_p); /* Relabels every arc with a pair label by the inverted pair. For example, and x:y arc becomes a y:x arc. Returns the modified network. The corresponding regular expression operator is the suffix .i. */ NETptr reverse_net(NETptr /* Returns a network that or relation encoded by expression operator is

net, int keep_p); contains the mirror image of the language the net. The corresponding regular the suffix .r. */

NETptr contains_net(NETptr net, int keep_p); /* Returns a network of all paths that include at least one path from the input net. The corresponding regular expression operator is $. */ NETptr optional_net(NETptr net, int keep_p); /* Makes the start state of the net final thus adding the empty string to language of the network if it is not already there. Returns the modified network. The corresponding regular expression operator is ( ), round parentheses around the expression. */ NETptr zero_plus_net(NETptr net, int keep_p); /* Concatenates the net with itself any number of times. The resulting network accepts the empty string. Returns the modified network. The corresponding regular expression operator is the suffix *. */ NETptr one_plus_net(NETptr net, int keep_p); /* Like zero_plus_net except that the result does not accept the empty string unless the original net does. Returns the modified network. The corresponding regular expression operator is the suffix +. */ NETptr negate_net(NETptr net, int keep_p); /* The negate operation is defined only for networks that encode a language, that is, for networks with arity 1. The corresponding regular expression operator is the prefix ~. */ NETptr other_than_net(NETptr net, int keep_p); /* Returns the network that contains all the single symbol strings except the ones in the net. The correspoding

79 regular expression operator is the prefix \. */ NETptr shuffle_net(NETptr net1, NETptr net2, int keep_net1_p, int keep_net2_p); /* Returns a network that accepts every string formed by shuffling together (interdigitating) one string from each of the input languages. For example, if net1 accepts the string "ab" and net2 accepts the string "xy", the shuffle net accepts "abxy", "axby", "axyb", "xaby", "xayb", "xyab". If keep_p is KEEP, the network is not affected. If keep_p is DONT_KEEP, the network is reclaimed. */ NETptr substring_net(NETptr net, int keep_p); /* Returns a network that accepts every substring of the strings in the input network. For example, if net contains "cat", the substring net contains "cat", "ca", "c", "at", "a", "t" and the empty string "". If keep_p is DONT_KEEP, the input network is destructively modified, if keep_p is KEEP the operation is done on a copy of the input net. */ NETptr repeat_net(NETptr net, int min, int max, int keep_p); /* Returns a network that accepts strings that consist of at least min and at most max concatenations of strings in the language of net. If max is less than zero, there is no upper limit. If keep_p is DONT_KEEP, the input network is destructively modified, if keep_p is KEEP the operation is done on a copy of the input net. */ /* Binary network operations. */ NETptr concat_net(NETptr net1, NETptr net2, int keep_net1_p, int keep_net2_p); /* Returns the concatenation of net1 and net2, that is, a network in which every path in net1 is continued with every path in net2. If keep_net1_p is DONT_KEEP, net1 will be destructively modified and returned as the result. If keep_net2_p is DONT_KEEP, net2 will be used up and reclaimed. The corresponding regular expression for the concatenation operator is empty space between symbols. */ NETptr union_net(NETptr net1, NETptr net2, int keep_net1_p, int keep_net2_p); /* Returns the union of the two networks, that is, a network containg all the paths of the two networks. If keep_net1_p is DONT_KEEP, net1 will be destructively modified and returned as the result. If keep_net2_p is DONT_KEEP, net2 will be used up and reclaimed. The corresponding regular expression operator is |. */ NETptr intersect_net(NETptr net1, NETptr net2, int reclaim_net1_p, int reclaim_net_p); /* Returns a new network containing the paths that are both in net1

80 and net2. Intersection is not well-defined for transducers that contain epsilon symbols in symbol pairs such as a:0. If reclaim_net1_p or reclaim_net2_p is DONT_KEEP, the network will be reclaimed, otherwise it will remain. The corresponding regular expression operator is &. */ NETptr minus_net(NETptr net1, NETptr net2, int keep_net1_p, int keep_net2_p); /* Returns a network that contains all the paths in net1 that are not in net2. The minus operation is not well-defined for transducers that contain epsilon symbols. The minus operation can be used to produce a complement of a simple relation. For example, minus_net(read_regex("?:?"), read_regex("a:b"), DONT_KEEP) that maps any symbol to itself and to any other symbol except that the pair a:b is missing. The correspoding regular expression operator is -. */ NETptr compose_net(NETptr upper, NETptr lower, int keep_upper_p, int keep_lower_p); /* Returns the composition of the two networks. The corresponding regular expression operator is .o. */ NETptr crossproduct_net(NETptr upper, NETptr lower, int keep_upper_p, int keep_lower_p); /* Returns a network that pairs all the strings in the languages of the two networks with each other. If keep_p is DONT_KEEP the network is reclaimed. The corresponding regular expression operators are .x. (low binding preference) and : (high binding preference. */ NETptr ignore_net(NETptr target, NETptr noise, int keep_target_p, int keep_noise_p); /* Returns a network that is like the target except that every state of the network contains a loop that contains all the paths of the noise network. For example, ignore_net(read_regex("a b c"), symbol_net("x"), DONT_KEEP, DONT_KEEP); returns a language that contains the string "abc" and an infinite number of strings such as "axbcxb" that contain bursts of noise. The correspoding regular expression operator is /. */ NETptr priority_union_net(NETptr net1, NETptr net2, int side, int keep_net1_p, int keep_net2_p); /* Returns a network that represents the union of net1 and net2 that gives net1 preference over net2 on the given side (UPPER or LOWER). For example, if the side is UPPER and net1 consists of the pair a:b and net2 consists of the pairs a:c and d:e, the priority union of the two consists of the pairs a:b and d:e. The a:c pair from net2 is discarded because net1 has another pair with a on the upper side. The d:e pair from net2 is included because net1 has no competing mapping for the upper side d. The

81 corresponding regular expression operators are .p. for priority union on the LOWER side and .P. for UPPER priority union. */ NETptr lenient_compose_net(NETptr upper, NETptr lower, int keep_upper_p, int keep_lower_p); /* A function for experimenting with optimality theory (OT). The lenient composition of upper and lower is defined as follows: upper .O. lower = [[upper .o. lower] .P. upper] where .0. is the lenient compose operator,. .o. is ordinary composition and .P. is priority union. To make sense of this, think of upper as a transducer that maps each of the strings of the input language into all of its possible realization. In other words, upper is the composition of the input language with GEN. The lower network represents a constraint language that rules out some, maybe all of the outputs. The result of the lenient composition is a network that maps each input string to the outputs that meet the constraint if there are any, eliminating the outputs that violate the constraint. However, if none of the outputs of a given input meet the constraint, all of them remain. That is, lenient composition guarantees that every input has outputs. A set of ranked OT constraints can be implemented as a cascade of lenient compositions with the most higly ranked constraint on the top of the cascade. */ /* Error handling */ void set_error_function(void (*fn)(char *message, char *function_name, int code)); void set_warning_function(void (*fn)(char *message, char *function_name, int code)); #ifdef __cplusplus } #endif /* __ cplusplus */ #endif /* C_FSM_API */