Logic and Finite Automata Anuj Dawar Department of Computer Science University of Wales Swansea Swansea, SA2 8PP, U.K. e-mail: [email protected]

1 Introduction One of the main focuses of research in nite model theory is the close connections between logical de nability and computation. A large number of results are of the form: a class of structures is decidable in machine model M if, and only if, it is de nable in logic L. Very often, the machine model is a resource bounded Turing machine of some kind, and then the result is called a logical characterisation of a complexity class. Much of the material in the present lecture course is centred around such results. The historically rst such result, however, was the theorem of Buchi, which relates de nability in monadic second order logic to nite automata. This is the topic of the present lecture. The case of nite automata is historically interesting from another point of view. If, in the displayed statement above, we take a suciently broad interpretation of the word \logic", then we could view Kleene's equivalence theorem relating nite automata and regular expressions as an instance of this kind of result. The point here is that we have two ways of de ning a class | one is in terms of a computational procedure, and the other is purely descriptive | which turn out to have exactly the same expressive power. Under a narrower view, where we regard regular expressions as an algebraic, rather than a logical, characterisation of the languages accepted by nite automata, Buchi's theorem tells us that this class of languages still corresponds to a very natural level of de nability in logic. In Section 2 we review the basic notions of regular expressions, automata and logics that we need. We also examine how strings over an alphabet can be thought of as Notes for lecture presented at the Ninth European Summer School on Logic, Language and Information, Aix-en-Provence, August 1997.

1

relational structures interpreting formulas of predicate logic. Section 3 develops the tools we need to analyse the expressive power of monadic second order logic, and establishes that every language that can be de ned in this logic is regular. We look at the converse of this last result in Section 4, and nd that the existential fragment of monadic second order logic suces for expressing all regular languages. First order de nable languages are examined in Section 5. Finally, some applications and examples of these results are found in Section 6.

2 Background and De nitions We begin this section with a review of de nitions concerning regular expressions and nite automata, which will serve the purpose of xing notation for the rest of this paper. We then examine logics and ask how they can also be used to de ne languages. This will then motivate the questions regarding what languages are de nable in our logics.

2.1 Regular Languages and Automata

Fix a nite alphabet A. That is A is a nite collection of symbols. As usual, A denotes the set of all strings over A, i.e. the set of all nite sequences of elements of A. The empty string is denoted ". A language over A is some set L A . Given two languages L1 ; L2 A, we de ne the language L1L2 by:

L1 L2 = fxy 2 A j x 2 L1 and y 2 L2 g; where xy denotes the concatenation of strings x and y. For any language L, L0 is the language f"g and Lk+1 is the language Lk L, which is the same as LLk . The language L is de ned by [ L = Li: i2N

The collection of regular expressions over A is de ned inductively by the following rules: the strings ; and " are regular expressions; for any element a 2 A, the string a is a regular expression; if r and s are regular expressions, then so are (rs), (r + s) and (r). Each regular expression denotes a language over A. The semantics of the expressions are given by a function D, which maps every regular expression to its denotation. D is de ned inductively by the following rules: D(;) = ;; D(") = f"g; 2

D(a) = fag, for every a 2 A; D(rs) = D(r)D(s), D(r + s) = D(r) [ D(s) and D(r) = D(r). De nition 1 A language is regular if, and only if, it is the denotation of some regular expression.

Another way of de ning the regular languages is as the languages that are accepted by (non-deterministic) nite automata (NFA). An NFA over an alphabet A is a 4-tuple (Q; ; q0 ; F ), where Q is a non-empty, nite set of states; Q A Q is the transition relation; q0 2 Q is the start state; and F Q is the set of nal states. In case is functional, i.e. for each (q; a) 2 Q A, there is a unique q0 such that (q; a; q0) 2 , then we say that the automaton is a deterministic nite automaton (DFA). In order to de ne the language accepted by an NFA M , we extend the relation to a relation Q A Q de ned inductively as: (q; "; q0) 2 if, and only if, q0 = q; (q; wa; q0) 2 if, and only if, there is some q00 such that (q; w; q00) 2 and (q00; a; q0) 2 . Now, the language L(M ) accepted by the NFA M is de ned by:

L(M ) = fx j (q0; x; f ) 2 for some f 2 F g: And the following result relates this to regular expressions: Theorem 2 The following are equivalent for any language L: 1. L is accepted by a DFA. 2. L is accepted by an NFA. 3. L is regular.

3

2.2 Logic

While the formalisms considered in the previous section de ne sets of strings, in the framework of logic, we usually consider arbitrary structures. This arises historically from the fact that mathematical logic was developed for the purposes of formalising parts of mathematics, and the structures (or interpretations) considered as models of logical sentences were intended to capture the notion of a general (algebraic) structure. We consider a somewhat restricted version of such structures, de ned below. A signature (or vocabulary) is a nite sequence of relation and constant symbols: = (R1; : : : ; Rm ; c1; : : : ; cn) where, associated with each relation symbol Ri is an arity ai. A structure A over the signature is a tuple: A = (A; R1A; : : : ; RmA ; cA1 ; : : : ; cAn ); where A is a set, the universe of the structure A; each RiA is a relation over this set of arity ai, i.e. RiA Aa ; and each cAi is an element of A. Note, in particular, that we do not have function symbols in our vocabularies. Where it causes no confusion, we drop the superscript A from the interpretation of the relation and constant symbols. The formulas of rst order logic, over a given signature are built up from the symbols in , an in nite set of variables fv1 ; v2; : : :g and various logical symbols by the following rules: 1. Any variable x and any constant c is a term. 2. If R is a relation symbol of arity a, and t1 ; : : : ; ta are terms, then R(t1 ; : : : ; ta) is a formula. 3. If ' and are formulas, then so are ' ^ and :'. 4. If ' is a formula and x is a variable, then 9x' is a formula. The semantics of rst order formulas are given by the satisfaction relation, with which I assume familiarity, and I will not de ne here. The expressive power of rst order logic can be enhanced by allowing second order quanti ers. To construct formulas of second order logic we have, in addition to the symbols above, an in nite collection of relational variables fV1; V2; : : :g each with an associated arity. Now, the collection of formulas of second order logic is given by adding the following two rules to the ones given above: i

4

5. If X is a relation variable of arity a, and t1 ; : : : ; ta are terms, then X (t1 ; : : : ; ta) is a formula. 6. If X is relation variable and ' is a formula, then 9X' is a formula. The satisfaction relation between structures and formulas is easily extended to cover these cases. Finally, we de ne the formulas of monadic second order logic (or MSO, for short) to be those second order formulas in which all relational variables are unary. In other words, in MSO, we can quantify over sets of elements, but not over arbitrary relations. As usual, a sentence is a formula without free variables. Note that by introducing universal quanti ers, any MSO formula can be put into prenex normal form with all second order quanti ers preceding all rst order quanti ers. We say that a formula of MSO is existential just in case it is equivalent to a formula in prenex normal form involving only existential second order quanti ers. For any sentence ', we write Mod(') for the collection of structures A such that A j= '.

2.3 Strings as Relational Structures

In the next section we will ask what languages over an alphabet A can be expressed by a formula of MSO. In order to make sense of this question, we need to see in what sense a formula can determine a language. For this, we look at strings over A as relational structures in an appropriate vocabulary. For an alphabet A = fa1 ; : : : ; asg, let A be the signature (

1 Introduction One of the main focuses of research in nite model theory is the close connections between logical de nability and computation. A large number of results are of the form: a class of structures is decidable in machine model M if, and only if, it is de nable in logic L. Very often, the machine model is a resource bounded Turing machine of some kind, and then the result is called a logical characterisation of a complexity class. Much of the material in the present lecture course is centred around such results. The historically rst such result, however, was the theorem of Buchi, which relates de nability in monadic second order logic to nite automata. This is the topic of the present lecture. The case of nite automata is historically interesting from another point of view. If, in the displayed statement above, we take a suciently broad interpretation of the word \logic", then we could view Kleene's equivalence theorem relating nite automata and regular expressions as an instance of this kind of result. The point here is that we have two ways of de ning a class | one is in terms of a computational procedure, and the other is purely descriptive | which turn out to have exactly the same expressive power. Under a narrower view, where we regard regular expressions as an algebraic, rather than a logical, characterisation of the languages accepted by nite automata, Buchi's theorem tells us that this class of languages still corresponds to a very natural level of de nability in logic. In Section 2 we review the basic notions of regular expressions, automata and logics that we need. We also examine how strings over an alphabet can be thought of as Notes for lecture presented at the Ninth European Summer School on Logic, Language and Information, Aix-en-Provence, August 1997.

1

relational structures interpreting formulas of predicate logic. Section 3 develops the tools we need to analyse the expressive power of monadic second order logic, and establishes that every language that can be de ned in this logic is regular. We look at the converse of this last result in Section 4, and nd that the existential fragment of monadic second order logic suces for expressing all regular languages. First order de nable languages are examined in Section 5. Finally, some applications and examples of these results are found in Section 6.

2 Background and De nitions We begin this section with a review of de nitions concerning regular expressions and nite automata, which will serve the purpose of xing notation for the rest of this paper. We then examine logics and ask how they can also be used to de ne languages. This will then motivate the questions regarding what languages are de nable in our logics.

2.1 Regular Languages and Automata

Fix a nite alphabet A. That is A is a nite collection of symbols. As usual, A denotes the set of all strings over A, i.e. the set of all nite sequences of elements of A. The empty string is denoted ". A language over A is some set L A . Given two languages L1 ; L2 A, we de ne the language L1L2 by:

L1 L2 = fxy 2 A j x 2 L1 and y 2 L2 g; where xy denotes the concatenation of strings x and y. For any language L, L0 is the language f"g and Lk+1 is the language Lk L, which is the same as LLk . The language L is de ned by [ L = Li: i2N

The collection of regular expressions over A is de ned inductively by the following rules: the strings ; and " are regular expressions; for any element a 2 A, the string a is a regular expression; if r and s are regular expressions, then so are (rs), (r + s) and (r). Each regular expression denotes a language over A. The semantics of the expressions are given by a function D, which maps every regular expression to its denotation. D is de ned inductively by the following rules: D(;) = ;; D(") = f"g; 2

D(a) = fag, for every a 2 A; D(rs) = D(r)D(s), D(r + s) = D(r) [ D(s) and D(r) = D(r). De nition 1 A language is regular if, and only if, it is the denotation of some regular expression.

Another way of de ning the regular languages is as the languages that are accepted by (non-deterministic) nite automata (NFA). An NFA over an alphabet A is a 4-tuple (Q; ; q0 ; F ), where Q is a non-empty, nite set of states; Q A Q is the transition relation; q0 2 Q is the start state; and F Q is the set of nal states. In case is functional, i.e. for each (q; a) 2 Q A, there is a unique q0 such that (q; a; q0) 2 , then we say that the automaton is a deterministic nite automaton (DFA). In order to de ne the language accepted by an NFA M , we extend the relation to a relation Q A Q de ned inductively as: (q; "; q0) 2 if, and only if, q0 = q; (q; wa; q0) 2 if, and only if, there is some q00 such that (q; w; q00) 2 and (q00; a; q0) 2 . Now, the language L(M ) accepted by the NFA M is de ned by:

L(M ) = fx j (q0; x; f ) 2 for some f 2 F g: And the following result relates this to regular expressions: Theorem 2 The following are equivalent for any language L: 1. L is accepted by a DFA. 2. L is accepted by an NFA. 3. L is regular.

3

2.2 Logic

While the formalisms considered in the previous section de ne sets of strings, in the framework of logic, we usually consider arbitrary structures. This arises historically from the fact that mathematical logic was developed for the purposes of formalising parts of mathematics, and the structures (or interpretations) considered as models of logical sentences were intended to capture the notion of a general (algebraic) structure. We consider a somewhat restricted version of such structures, de ned below. A signature (or vocabulary) is a nite sequence of relation and constant symbols: = (R1; : : : ; Rm ; c1; : : : ; cn) where, associated with each relation symbol Ri is an arity ai. A structure A over the signature is a tuple: A = (A; R1A; : : : ; RmA ; cA1 ; : : : ; cAn ); where A is a set, the universe of the structure A; each RiA is a relation over this set of arity ai, i.e. RiA Aa ; and each cAi is an element of A. Note, in particular, that we do not have function symbols in our vocabularies. Where it causes no confusion, we drop the superscript A from the interpretation of the relation and constant symbols. The formulas of rst order logic, over a given signature are built up from the symbols in , an in nite set of variables fv1 ; v2; : : :g and various logical symbols by the following rules: 1. Any variable x and any constant c is a term. 2. If R is a relation symbol of arity a, and t1 ; : : : ; ta are terms, then R(t1 ; : : : ; ta) is a formula. 3. If ' and are formulas, then so are ' ^ and :'. 4. If ' is a formula and x is a variable, then 9x' is a formula. The semantics of rst order formulas are given by the satisfaction relation, with which I assume familiarity, and I will not de ne here. The expressive power of rst order logic can be enhanced by allowing second order quanti ers. To construct formulas of second order logic we have, in addition to the symbols above, an in nite collection of relational variables fV1; V2; : : :g each with an associated arity. Now, the collection of formulas of second order logic is given by adding the following two rules to the ones given above: i

4

5. If X is a relation variable of arity a, and t1 ; : : : ; ta are terms, then X (t1 ; : : : ; ta) is a formula. 6. If X is relation variable and ' is a formula, then 9X' is a formula. The satisfaction relation between structures and formulas is easily extended to cover these cases. Finally, we de ne the formulas of monadic second order logic (or MSO, for short) to be those second order formulas in which all relational variables are unary. In other words, in MSO, we can quantify over sets of elements, but not over arbitrary relations. As usual, a sentence is a formula without free variables. Note that by introducing universal quanti ers, any MSO formula can be put into prenex normal form with all second order quanti ers preceding all rst order quanti ers. We say that a formula of MSO is existential just in case it is equivalent to a formula in prenex normal form involving only existential second order quanti ers. For any sentence ', we write Mod(') for the collection of structures A such that A j= '.

2.3 Strings as Relational Structures

In the next section we will ask what languages over an alphabet A can be expressed by a formula of MSO. In order to make sense of this question, we need to see in what sense a formula can determine a language. For this, we look at strings over A as relational structures in an appropriate vocabulary. For an alphabet A = fa1 ; : : : ; asg, let A be the signature (