University of Southern California, Los Angeles, California 90007 In [3 ...

35 downloads 16368 Views 1MB Size Report
University of Southern California, Los Angeles, California 90007. In an attempt to provide a unified ... Computer Science Grant No. 430/402/809/4 and by the ...
JOURNAL OF COMPUTERAND SYSTEM$CIEI~'CES11, 86-117 (1975)

Context-Free

Grammar

Forms**

ARMIN CREMERS AND SEYMOUR GINSBURG

University of Southern California, Los Angeles, California 90007

In an attempt to provide a unified theory of grammars, a model is introduced which has two components. The first is a "grammar form," which provides the general structure of the productions in the grammars to be defined. The second is an "interpretation," which yields a specific grammar. By considering all interpretations, a family of grammars, intimately related to that of the grammar form, is obtained. Many of the well-known families of grammars occur as special instances. Attention is focused on the situation when the productions in the grammar form are context free. Necessary and sufficient conditions on a context-free grammar form are given in order for it to yield, respectively, exactly the finite languages, the regular sets, the linear context-free languages, and all the context-free languages. Each contextfree grammar form can be replaced by another, yielding the same family of languages, in which the underlying grammar is sequential. Of special interest to language theory is the fact that the family of languages obtained from each context-free grammar form is a full principal semi-AFL.

INTRODUCTION

In [3] an attempt was made to formalize a notion of "family of grammars" which would yield, as special instances, many of the well-known families of phrase structure grammars. T h e model introduced there consisted of a phrase structure grammar schema, essentially an underlying phrase structure grammar, together with a set of interpretations. Each interpretation gave rise to a phrase structure grammar related in a structural manner to the underlying phrase structure grammar. T h e collection of phrase structure grammars defined by the interpretations constituted the "family of grammars." While the model described in [3] is a start toward obtaining a unified theory of grammars, it suffers from two drawbacks: First, it is usually awkward to determine for a given grammar schema and a given phrase structure grammar, whether or not the grammar is obtained from one of the interpretations. And second, there are * Presented at the Second Colloquium on Automata, Languages and Programming, held at the University of Saarbruccken, July 29-August 2, 1974. t This rcsearcb was supported in part by the German Academic Exchange Service under Computer Science Grant No. 430/402/809/4 and by the National Science Foundation under Grant No. GJ-28787. Copyright 9 1975 by AcademicPress, Inc. All rights of reproduction in any form reserved.

86

CRAMZVf~ FOR~S

87

important well-known families of grammars which cannot be "closely" defined b y grammar schemata, e.g., the family of context-free grammars in binary normal form, i.e., grammars whose productions are of the form r w and r aft, w a terminal word and a, fl variables. T h e purpose of the present paper is to introduce a model for "families of grammars" which overcomes the above two difficulties and which is simple enough to permit mathematical analysis> T h e model presented here is a variant of that in [3], but is much simpler in nature. It involves two concepts, a " g r a m m a r form" and an "interpretation." A grammar form, the analog to a grammar schema of [3], is a phrase structure g r a m m a r (Y/', 5~', ~ , o) together with two sets V and 27, where 27 and V - - X are infinite, 27 _C V, (~" - - 5*) _C (V - - X), and 5~ _C27. Informally, the productions in ~ are the "skeleton productions" and determine the form of the productions in the grammars to be defined, while 27 and V - 27 consist of the terminal letters and variables, respectively, of the grammars to be defined. A n interpretation I, the analog to an interpretation in [3], consists of (i) a substitution t~ on ~//'* such that t~(a) is a finite subset of 27* for each element a in St',/~(~) is a finite subset of V - - 2~ for each element in ~r - - St', and #(~:) n/~(~) ---- ;g for all ~ and ~/, ~ 4: ~, in ~ - - 5P; and (ii) a phrase structure grammar G~ ~ ( V I , 271, P z , ox), where Pt is a subset of/z(~) = 0~Ing/z(zr), /z(~ ~ fl) being the set (u --+ v/u in/z(c~), v in/z(fl)}, l ~ and 271 are the sets of all symbols in V and 27, respectively, occurring in productions of Pt, and ot is in/~(cr). T h u s an interpretation consists essentially of a finite substitution /~ and a phrase structure grammar whose set of productions is a subset of the productions in ~ under the substitution tz and whose start variable is in/~(o). Hence a production ~ ~ fl in ~ gives rise in an interpretation to a (possibly empty) set of productions, with each variable in ~fl serving as a placemarker for variables in V - - Z and each terminal in c~fl serving as a placemarker for words in 27*. T h e condition/z(~) n #(~) = Z for ~: ~ ~ means that different variables in ~ - - c j represent different variables in P I . T h e net effect is that each production in Pt closely resembles some production in ~ . T h u s the grammars in the family of grammars defined by the interpretations are bound together by their structural closeness with the productions in ~ . By selecting different sets of interpretations, one obtains different sets of grammars, each closely related to the original grammar. T h e r e are many possible "reasonable" sets of interpretations, as for example, the set of all interpretations or the set of all ~-free interpretations, i.e., those interpretations in which the substitution/z has the property that tz(a) is c-free for each a in 27.a T h e emphasis here will be exclusively on the former, i.e., the set of all 1 The model as explicitly given here is for phrase structure grammars. The basic ideas of the model, however, are also suitable for many other kinds of "families of grammars," e.g., grammars based on a "tag system" of generation or grammars based on parallel generation, i.e., L-systems with nonterminals. 2 The set of all f-free interpretations is a reasonable set since it permits us to define (i) the context-sensitive grammars in "binary normal form," i.e., grammars in which each production

88

CREMERS AND GINSBURG

interpretations. This set appears to be the most important one, yielding among its instances, the family of right-linear context-free grammars, the family of linear context-free grammars, the family of binary normal form context-free grammars, and the family of phrase structure grammars in binary normal form, i.e., phrase structure grammars each of whose productions is of the form sr w, ~:---*(x]~, ~:v--+ ~q, or ~:v-§ ~, where w is a terminal word and sr v, c~, fl are variables. T h e paper itself is divided into five sections. Section 1 provides a rigorous definition of the model, i.e., the concepts of grammar form and interpretation. A decision procedure is then presented for determining whether or not, given an arbitrary grammar and an arbitrary grammar form, there exists at least one interpretation of the grammar form which defines the given grammar. T h e remaining sections concern only context-free grammar forms, i.e., the underlying grammar is context free. Section 2 provides necessary and sufficient conditions on a context-free grammar form in order that it yield precisely the finite languages, the regular sets, the linear context-free languages, and the family of all context-free languages, respectively. Results from this section are used to show that the counter languages cannot be obtained from a context free grammar form. I n Section 3, it is shown that each context-free g r a m m a r form can be replaced by an equivalent one (i.e., yielding the same family of languages) in which the underlying g r a m m a r is sequential. Section 4 is essentially concerned with establishing that the family of languages defined by each nontrivial context-free grammar form is a full s e m i - A F L . As a consequence, the deterministic context-free languages cannot be obtained from a context-free grammar form. Section 5 strengthens the result in Section 4. By a long, complicated argument it is shown that each nontrivial family defined by a context-free grammar form is a f u l l p r i n c i p a l s e m i - A F L . Moreover, the proof is effective and exhibits a full generator. This last result suggests that the families of languages defined by context-free grammar forms may have many interesting mathematical properties.

1. PHRASE STRUCTURE GRAMMAR FORMS T h e model discussed in the Introduction is now formally presented and several well-known families of grammars obtained as special instances. It is then shown ( T h e o r e m l . l ) that the problem of determining whether an arbitrary grammar comes from an arbitrary grammar form is solvable. As mentioned earlier, there are two basic notions needed to describe a family of

is of the form sr --* c~, ~v --~ aft, or ~ --~ w, where w is a non-E terminal word and ~, v, % ~ are variables; and (ii) the context-free grammars in Greibach binary normal form, i.e., those grammars in which each production is of the form ~r --~ w, ~ ~ w~, or ~ --* waft, where w is a non-E terminal word and r ~,/~ are variables.

GRAMMAR

FORMS

89

grammars, namely a grammar form and an interpretation. The precise statement of a g r a m m a r f o r m is: Dr:FINXTION. A (phrase structure)grammar f o r m is a 6 - t u p l e F = (1~, Z , ~ " , 5 a, ~ , or), where (i) (ii)

V is a n i n f i n i t e set o f a b s t r a c t s y m b o l s , Z is a n i n f i n i t e s u b s e t o f V" s u c h t h a t V - - Z is infinite, a n d

(iii) GF = ( ~ , SP, ~ , ~r), c a l l e d the f o r m grammar ( o f F ) , is a p h r a s e s t r u c t u r e g r a m m a r a w i t h c/, _C Z a n d ( ~ - - o f ) C_ (I," - - Z ) . W e shall a s s u m e t h r o u g h o u t t h a t V a n d Z are f i x e d i n f i n i t e s e t s s a t i s f y i n g c o n d i t i o n s (i) a n d (ii) a b o v e . All g r a m m a r f o r m s d i s c u s s e d will a l w a y s b e w i t h r e s p e c t to t h i s V and Z. For convenience, degenerate c j , .~, tr) are p e r m i t t e d .

grammar

forms,

i.e., g r a m m a r

forms

( V , X , 3r

Turning to a precise formulation of an interpretation we have DEFINITION.

A n interpretation o f a g r a m m a r

f o r m F = ( V , Z', ~ ,

6 ~, 3 ~, ~) is a

5 - t u p l e I = (/z, V I , Z t , P I , $1), w h e r e (1) tz is a s u b s t i t u t i o n a o n ~ ' * s u c h t h a t / z ( a ) is a finite s u b s e t o f Z * f o r e a c h e l e m e n t a in d r , / z ( ~ ) is a finite s u b s e t o f I/- - - Z f o r e a c h se in ~P~ - - S #, a n d

~(~) c~ ~(~) = for e a c h se a n d ~7, sr &- ~7, in "V" - - c j , (2)

/)1 is a s u b s e t o f / ~ ( 3 ~ = U,tn.§

where

~(~ --~ [3) = {u --~ v/u i n / z ( a ) , v in/z(fl)}, A phrase structure grammar is a 4-tuple G -- (Va, Z I , P, S), where V1 is a finite set of symbols, Zt ff I/'1, P is a finite (possibly empty) set of ordered pairs (u, v), written u ~ v, with u in (I'1 -- Z1)- and v in VI*, and S is in V ~ - - Z1. Elements of •1 and V x - - Z1 are called terminals and variables, respectively, u - * v is called a production, and S is called the start variable. Let --> be the relation on VI* defined by w => z if there exist u, v, x, y such that w = xuy, G

G

z ~ xvy, and u ~ v is in P. Let --> be the reflexive transitive closure of --> . T h e set L(G) = G

G

{w in Z t * / S ~ zo) is called the language gemrated by G. If G i s understood, then :~ and G

(;

G

are written as ~ and :>. G

4 Let Z'1 be a finite set of elements. For each element a in Z x let Z , be a finite set of elements and/L(a) a subset of Z~*. Let/x(c) = {~) and/~(xl "'" x~) = p-(xx) "'" /x(xr) for each word xt "'" xr, r > l and each xi in Z z . T h e n ~. is called a substitution (on Z'I* ). For each set L C Z'l* let /z(L) ~ U.,mL #(w).

CREMERS AND GINSBURG

90 (3)

SI is in ~(a), 5 and

(4) 2:1(Vx) contains the set of all symbols in 2:(V) which occur in Pi(together with St). T h e phrase structure grammar G+ = (/71,2:1, P1, S+) is called the grammar of I. Clearly there is no loss of generality in assuming that each symbol in each word of /~($/') is in V I . We usually exhibit an interpretation by defining $1, P1, and (implicitly or explicitly)/~. T h e sets Vi and ~'+ are customarily not stated explicitly. Each production in G 1 may be viewed as "structurally close" or "structurally related" to a production in G r , and the grammar G+ may be viewed as "structurally close" or "structurally related" to G p . Each grammar form F has an infinite number of interpretations, each interpretation giving rise to a grammar structurally related to the grammar G F . Our contention is that the family of all such grammars is a useful formalization for the intuitive notion of "family of phrase structure grammars." Indeed, we shall see that by taking different instances we obtain well-known families of grammars. T h e theory developed in the remainder of the paper suggests that this idea is a mathematically sound one for providing a unified treatment of grammars. DEFINITION. For each grammar form F, ~ ( F ) = {G1/I an interpretation of F} is called the family of grammars o f f and .~V(F) ----{L(G1)/G1 in ~(F)} the grammatical family ofF. A set oW of languages is called a grammatical family if s is the grammatical family of some grammar form. In general, our interest is in families of grammars. As in the study of the well-known kinds, so here the most prominent (but not the only) property associated with a family of grammars is its grammatical family. Indeed, much of the sequel is devoted to an examination of grammatical families. Numerous instances of the above model for families of grammars will appear throughout the paper. Many of the resulting families, possibly with trivial variation, are already in the literature. T h e variety obtained lends evidence to our thesis that grammar forms provide an excellent model for a unified treatment of families of phrase structure grammars. We now present some illustrations of grammar forms and their families of grammars. EXA~IPLE 1.1.

Let F = (V, X, {a, a}, {a), ~ , a).

(a) For g ~- {a--+ aa, a ~ a}, ~ ( F ) is the family of all right-linear (contextfree) grammars and ~q~ ~ ~', where ~ denotes the family of regular sets.

This implies that #(a) cannot be the empty set.

GRAMMAR FORMS

91

(b) For ~ = {~r--* aoa, ~--,-a}, f~(F) is the family of all linear (context-free) grammars and L,F(F) -- .Wlin , where .s n denotes the family of all linear (contextfree) languages. (c) For ~ -- {~r - , o~r, cr ~ a}, N(F) is the set of all context-free grammars in binary normal form, i.e., grammars for which each production is of the form ~:--~ cr43 or ~ --~ w, where w is a terminal word and ~:, c~,/3 are variables. Here .Z'(F) = .WCF , C.WcFbeing the family of all context-free languages. Note thatL(Gr) = a + in (a) and (c) of Example 1.1, although the .W(F) are both different. This illustrates the fact that, in general, L(G~) does not determine ~a(F), i.e., the language generated by the form grammar does not determine the grammatical family o f F . EXAMPLE 1.2. Let F = (V, Z', {a, a}, {a}, ~ , ~), where ~ = {oo ~ aa, cr--+ aa, oa --+ o, a--+ a}. T h e n ~(F) is the family of all phrase structure grammars in binary normal form and ~ ( F ) = .~4'RE, where "~RE denotes the family of all recursively enumerable languages. EXAMPLE 1.3. Let k be a positive integer and F = (V, Z', {a, sr a}, {a}, ~, o)' where ~ ~: {or-* sok, sc---~a~a, ~---~ a}. T h e n ~(F) is the family of all "k-linear', context-free grammars and c.Lp(F) is the family of all finite unions of products of k linear context-free languages. We now turn to the problem of determining whether or not a given phrase structure grammar is a grammar Gr of an interpretation I of a given grammar form F. We shall see that there is a reasonably simple procedure for effecting such a determination. THEOREM 1.I. (Recognition). It is solvable to determine whether or not, given an arbitrary phrase structure grammar G and an arbitrary grammar form F, there is an interpretation I o f F such that G = Gt .

Proof. Let G 1 --: (V l , Z ' I , P 1 , $1) be a phrase structure grammar and F = (V, Z', ~r .9~ ~ , o) a grammar form. Assume that s and V1, respectively, contain the symbols of s respectively V, which are in P1, respectively P1 and {$1}. [Otherwise condition (4) in the definition of interpretation is violated, i.e., G 1 cannot arise from an interpretation o f f 1 .] We now exhibit a straightforward recognition algorithm for G a . Let r be the length of a largest word occurring on either the left or right side of a production in Pa 9 Call a substitution/~ on ~1"* into subsets of ];~* reasonable if (a) it satisfies condition (1) in the definition of interpretation, (b) S a is in /~(o), and (e) t~(a) = {w in XI*/[ w l Z O n

ui

Gt

GI

there exists a derivation ~1"

~2"

:z> 7./.)1' Gu ~ ZOO' GF

~n' "'" G ~ F

t

~')n

such that w i is in iz(wi') and ,r~ is in tz(~ri')for each i, 1 x~y, ~ --+ w). By P r o p o sition 2.1, &O(F) = &O(F'). Since u a n d v are in 5#+, it is easily seen that each linear context-free language is in &O(F') = &O(F) = ~ , a contradiction. H e n c e F is n o t self-embedding. N o w suppose that L(Gv) is infinite a n d F is n o t self-embedding. By L e m m a 2.1, there is no interpretation I o f F such that G1 is a s e l f - e m b e d d i n g grammar. Hence, b y a well-known result [1], each language L i n &O(F) is regular, i.e., &O(F)C ~ . Since L(GF) is infinite, there exists a variable ~ in ~/" - - ~9~ such that either *=> U~ GF say the former, for some u in ~9~

or

~ *:> ~U, GF

Since F is reduced, c; ar ~ x~y and ~ ~ w for some

x, y, a n d w in 5 #*. L e t F ' = (V, 27, ?U, 5:, ~ ' , a), where

~ ' = ~ u {~ ~ x~y, ~ - ~ u~, ~ ~ w). By Proposition 2.1, &O(F) = oog~

Clearly each regular set is in .~9~

Thus

= ~e(F') = &O(F), so that &O(F) = ~ . O u r final characterization result concerns w h e n a g r a m m a r form yields exactly the family &~ of linear context-free languages. THEOREM 2.4. Let F = (V, 27, ~//', 5:, :~, a) be a reduced grammar form. Then &O(F) = &o~in if and only if (i) F is self-embedding, and (ii) if a *~ ulfu2~u3, with GF

Ul , u2, Uz in ~ * and ~, ~ in ~F" -- 5#, then ~ and ~1are not both self-embedding variables. 14 A

context-free grammar G ~ (V1, Z,I, P, ~) is said to be reduced if, for each variable

~: v~ a in Va -- 271 , (i) there exist u, v in ZI* such that a ~ u~v, and (ii) there exists z in 271" .

such that ~ ~ z. A grammar form is said to be reduced if its form grammar is reduced. 15 The assumption in Theorem 2.3, as well as in Theorem 2.4, t h a t F is a reduced form is no real loss of generality. It is shown in Lemma 3.1 that F can be effectively replaced by an equivalent reduced form.

100

CREMERS AND GINSBURG

Proof. Let &a(F) = ~ i n . By Theorem 2.1, L ( G r ) is infinite. By Theorem 2.3, F is self embedding, i.e., (i) holds. Assume (ii) is false. T h e n there exist self-embedding variables {: and ~7 such that cr ~ Ul~U2~U8 for some u i , u2, u 3 in "/P*. As is readily GF

seen, this implies that L1L 2 is in &~ for all linear context-free languages L i and L z . This is a contradiction since ~lin is not closed under concatenation. Hence (ii) holds. Now assume (i) and (ii) hold. It readily follows from (i) that ~ i n _CoW(F). Consider the converse inclusion. Let L be in &~ T h e n there exists an interpretation I = (/z, VI, 271, P J , $1) of F such that L = L(G1). Wihout loss of generality, we may assume that Gt is reduced. We can construct a linear grammar G 1 = (Vl, X1, P1, $I) and a substitution r by regular sets such that L(Gl) = ~-(L(G1)). (Note : A substitution on Z'l* is a substitution by regular sets if r(a) is regular for each element a in 271 .) Since ~ i n is closed under substitution by regular sets, ,(L(G1) ) is in ~ann, whence ~a(F) _C*Wlin 9 Intuitively, each variable v which is not self-embedding and only leads to variables which are not self-embedding generates a regular set L , . We replace each such variable v by a new symbol a , , thereby obtaining the linear grammar G i . T h e substitution r is the one which substitutesL v for each av, leaving the remaining terminal symbols fixed. The details are left to the reader. Remark. Given a positive integer m, Theorem 2.4 can easily be extended to a characterization of when a reduced grammar form yields exactly the family of finite unions of m products of linear context-free languages. T h e conditions are that (i) there exists a derivation a *=>ui~ i "" u,,~mu~+ i , where ~1 ,.-., ~:m are self-embedding variables and u i ,..., u,,+l are words containing no self-embedding variables; and (ii) there is no derivation a *=>ui~:i -" um+~,,+~u~+~+ 1 , where r >/ 1, and {:i ,-.., se,~+, are self-embedding variables.

3. A SPECIAL FORM In this section we show that a grammar form (V, 27, ~//~, 6 ~, ~ , ~) may always be replaced by an equivalent one in which the form grammar is of a special type, namely (i) is reduced, (ii) is sequential, (iii) has no production of the form ~ --+/~, ~ and/3 variables, and (iv) for each variable ~ =/= (r, has a production ~ --+ xo:y for some x y in S~'+. This result is of interest in its own right, as well as plays an important role in Sections 4 and 5. We need four lemmas to establish our result. T h e first asserts that each grammar form is equivalent to a reduced grammar form. The succeeding lemmas produce equivalent forms with additional properties. LEMMA 3.1.

Each grammar form has an equivalent reduced grammar form.

The proof is straightforward from Lemma 2.1.

GRAMMAR FORMS

101

T h e second lemma asserts that a grammar form has an equivalent reduced one with no "cycles." DEFINITION. Let F = (V, 27, St, 5g, ~ , a) be a grammar form. A cycle of F is a sequence ~x .... , ~:l,. of elements of ~g" _ .~, where k > / 2 , ~1 ---- ge, and

F is said to be noncyclic if it has no cycles. LEMMA 3.2.

Each grammar form has an equivalent, reduced, noncyclic grammar

form. Proof. Let F = ( V, X, ~e-, ,9~ o2, a) be a grammar form. Call a set X _ $/" -- Sp a cycle set (of F) if there is a cycle ~1 . . . . , ~:k such that X = {-~1 ..... ~k}. Call a set X C ~/P -- .9~ a maximal cycle set (of F) if X is a cycle set and there is no cycle set properly containing X. From the definition of cycle set there immediately follows: (l) If X and Y are cycle sets and X (3 Y :/-- ~ , then X t.) Y is a cycle set. Two simple consequences of (1) are : (2)

If X and X ' are maximal cycle sets and X N X ' va ~ , then X = X ' ; and

(3)

For each cycle ~:1 ,..., ~k there exists a maximal cycle set X such that

(El ..... ~} _c x . Turning to the lemma, by L e m m a 3.1 we may assume that F is reduced. I f F is noncyclic there is nothing to prove. Suppose that F has a cycle. By (2) and (3), F has exactly t maximal cycle sets for some t =/= 0, and these are pairwise disjoint. T o establish the lemma it suffices to exhibit a procedure yielding an equivalent reduced grammar form F ' which has exactly t -- 1 maximal cycle sets. [For then the procedure can be repeated, ultimately resulting in an equivalent reduced grammar form with no maximal cycle sets and thus noncyclic.] Let X be a specific maximal cycle set of F and let sr ..... ~k be a cycle such that {~1 ..... ~k} = X. Let Q' := {w/~ ~ w in .~ for some ~ in X} -- X, 221' := {~: ~ w/~ in X, w in Q'}, :~'

: (:~ -

{E~--~ E / 1 0 variables which are not partially selfe m b e d d i n g . I n view of case (~), it suffices to exhibit a procedure p r o d u c i n g an equivalent g r a m m a r form which is reduced, contains no p r o d u c t i o n of the kind ~--+ ~7, and ~7 variables, a n d has exactly k - - 1 variables not partially self-embedding. Since k ~> 0, there exists a variable v in ~r - - 5/' which is not partially selfembedding. Let ~'~ = ~P -- {u} and Ol = {u/v ~ u in ~ , u does not contain v}. Let 9~x be the set of all productions ~:- + w', ~ in ~ , where ~ - , w is in ~ and w' is obtained from w by replacing each occurrence of v in w by a word in Q1 9 Let F 1 be the g r a m m a r form (V, 27, 3wx, .cs ~ 1 , a). Clearly F~ is reduced and contains no p r o d u c t i o n of the form ~ - ~ ~, ~ a n d ~/variables. F r o m the m e t h o d of construction, the only variables in F~ which are not partially s e l f - e m b e d d i n g are those variables ~ v i n F w h i c h are n o t partially self-embedding. T h u s F, has exactly k - - 1 variables not partially self-embedding. U s i n g L e m m a s 2.1 a n d 2.2, we readily see that

~(F~)

=

~e(F,,).

W e are n o w ready for the special form m e n t i o n e d in the b e g i n n i n g of the section. TItEOREM 3.1. grammar form. l~

Each grammar form has an equivalent, completely reduced, sequential

Proof. Let F - - (V, L', 3v', 5 P, ~ , a) be a g r a m m a r form. By L e m m a 3.4, we m a y assume that F is completely reduced. Clearly, we m a y also assume that each s y m b o l of 5 ~ is in some p r o d u c t i o n of ~ . Let ~ be the relation on ~F" - - SP defined by r --: ,7, a n d ~7 in "r - - 5~', if ~: ~> xl~y 1 a n d ~ *~ x2~y~ for some x 1 , x2, Y l , Y2 in ~g'*. G Jr

GF

Clearly ~ is an equivalence relation a n d there are only a finite n u m b e r of equivalence classes. F o r each variable v let Iv] be the equivalence class c o n t a i n i n g v. For each equivalence class C let r be a distinguished element of C, with ~[ol = or, and let is A context-free grammar G ~ (V1 , 271 , P, S) is said to be sequential if the variables in I/'1 - 27, can be ordered S = r ..... r in such a way that if ~i ~ xCjy is a production in P then j ~ i. A grammar form is said to be sequential if its form grammar is.

104

CREMERS AND GINSBURG

@ {r an equivalence class}. Let 9~' be the set of those productions obtained by replacing each occurrence of each variable v in each production of ~ by ~:M 9 Let F ' = (V, Z, ~'", .~', ~ ' , or), where 3r -- d' ty 5C Clearly GF' is a completely reduced, sequential grammar form. We shall show that i ~ ') = .S~ By Lemma 1.1, ~,e(F)C ~ ( F ' ) . Consider the reverse inclusion. We now define a grammar form F 1 = (V, 27, 3r"~, ~ , "~1, cr) such that ~#(F') C .W(Fa) C .,W(F). Let ~ - ~ w {7 --" v/7 .... v, 7 # v}. T h e n y ~ w for each production 7 -~ w in p'. ~=

G F1

By Lemma 2.2, t r C_ ..~'(Fx). It remains to show that i~ _CC.Z'(F). Since F is reduced, for all variables c~ and/3 in Yf -- ~9", with c~ ~ ]3 and ~x =/--fl, + there exist words u~ o and %.a in c ~ . such that e ~ u~.o/3v,~.a. Let '

G,F

where ~2 = ~ U {e ~ U~,B~V,,.B/all a, fl}. By Proposition 2.1, -oq~ = .L,a(F). Now let/~ be the substitution on ~/'* defined by/z(c0 = {c~}for each variable a in 3e" -- 6P and /z(a) = {a, e} for each element a in ~ Clearly 1 = (/x, 3r ~', Y l , or) is an interpretation of Fo, with Ge~ = G I . (Each production c~-~-/3 in ~1 -- ~ is in /z(c~ ~ U~,B/3%.B).) By Lemma 1.1, ~Sf'(F1)_C ~2,f'(F2) = .L~'(F), whence the theorem. EXAMPLE 3.1. Consider the grammar form F = (V, Z, {cr, ~:, 7/, v, a}, {a}, ~.~, cr), where ~ = {cr -+ acra~a, cr - + avv, ~ --+ ao-q, ~ --+ a, ~1 ~ crav, v --0- ava, v --+ a}. Using the procedure in the proof of Theorem 3.1, the equivalence classes of 3r -- 0o are {cr, ~, T/} and {v}, and the grammar form F ' = ( V , Z , {cr, v, a}, {a}, ~ ' , o), where :~' =- {cr --~ acracra, cr - ~ a w , cr - + acre, cr - ~ a, cr ~ crav, v ~ ava, v --+ a}, is a completely reduced sequential form equivalent to F. 4. CLOSURE PROPERTIES

In this section we consider various closure properties pertaining either to each grammatical family or to the class of all grammatical families. The main result is that each nontrivial grammatical family is a full semi-AFL. (Note : A f u l l s e m i - A F L is an ordered pair (Z, oL,r or ~ when Z is understood, where Z is an infinite alphabet and ,W is a set of subsets of Z ~ such that (i) L v/~ ~J for some L in _W, (ii) for each L in there is a finite subset Z L of Z for which L _CZL* , and (iii) ~r is closed under union, homomorphism, inverse homomorphism, and intersection with regular sets.) We start with a preliminary result, namely that each grammatical family is closed under union, homomorphism, and intersection with regular sets. LE.Vr.XrA4.1. F o r each g r a m m a r f o r m F, 5 f ( F ) is closed u n d e r union, h o m o m o r p h i s m , a n d intersection w i t h regular sets.

GRAMMAR FORMS

105

Proof. Consider union. Let L 1 and L z be languages in .Lt'(F). Obviously there exist interpretations 11 - - (/zl, 1 ~ , S t , P x , Sx) and I z ~ (/z2, Vz, 2:2, P2, Se) of F, with (V 1 - - 2:1) c~ (V 2 - - X2) =-- ~ , such that L ( G q ) = L 1 and L(GI,) = L z . Let I ~- (~, Va , l a , Pa , S )

be an interpretation o f F , where S is a new symbol of V - - 27,

~(~) = { s } u m(~) u ~,,(~), ~(x) = ~l(x) u ~,2(x) for each x in ~" - - {or}, and P3 ~- 1)1 u P2 U {S ~ w / S 1 ~ w in Pa} U {S ~ w / S z - ~ w in Pz}. Obviously L(G,) == L~ u L 2 . Consider homomorphism. Let L be a language in c~f(F) and I1 = ( m , V 1 , 2 : 1 , / 1 , 8 1 ) an interpretation of F such that L =--L(Gq). Consider a homomorphism h from Z'l* into 272~'. Extend h to VI* by defining h(~r = sc for each ~r in V1 - - l 1 . L e t I = (/~, V~, 272', P, $1) be an interpretation, where/~(a) -~/~(a) for each a in ~/" - - Sf, t~(a) ~- (h(w)/w in/z~(a)} for each a in 5 f, and P = {~ --+ h(w)/~ - ~ w in P1}. Clearly L(G,) - - - h ( r ) . Finally consider intersection with regular sets. Let L be in AP.(F) and 11 : ( / ~ , V1, Z'~, P~, Sx) an interpretation o f F s u c h t h a t L =: L(Gq). L e t R C_ S~* be a regular set. T h e n R .... T ( A ) for some finite state aceeptor A = (K, I 1 , 8 , P0, Q). Now

R =

0

T(A~),

qinQ

where .//q is the finite state acceptor (K, l , 3, P0, {q}), and LnR

=: U

(LnT(Aq)).

qinO

Since . ~ ( F ) is closed under union, we may assume that Q contains exactly one element, say qo, i.e., A := (K, 27, 3, Po, {qo)). We shall define an interpretation I = (iz, ~ , 2 7 2 , P , S )

and a homomorphism h such that L ~ R : h(L(Gt)). It will then follow that L (~ R is in ~Cf(F). F o r each element ? in $"" - - .~, let /~(7) ~ K • #1(~') • K. F o r each element a in :~f, let /x(a) = { ( p l , a t , p 2 ) ' "

( P z , a z , P t - 1 ) / p l i n K, a I " ' a t

in/zl(a), l ~ I, each a~ in 272, each p~ ~-1 is 3 ( p i , a~)} u (~/~ in/zl(a)}.

106

CREMERS AND GINSBURG

Let P be the set of productions {(P, ~, q) ~ ( P l , Xx, P2) "'" (P,~, x , , P,+l)/Px = P, P, ~x = q, ~ ~ xl"'" x , in Px, n >~ 1, xi in V x , Pi+x is b( p , , xi) if x~ is i n / 1 ,

I ~ i ~ n, p a n d q i n K } u { ( p ,

~,q)---~e/~--~,,pandqinK}.

Let S -- ( P0, S~, %) and let h be the homomorphism on (K • Z x • K)* defined by h((p, a, q)) :-= a for each (p, a, q) in K • Z 1 • K. Clearly L n R ---- h(L(GI)), where I is the interpretation (tz, V2, Z~, P, S). We are now ready for the main result of the section, namely that except in the trivial case a grammatical family is a full semi-AFL. DEFINITION. A grammar form F is said to be nontrivial if L(Ge) is an infinite language. A grammatical family is nontrivial if it contains at least one infinite language. It follows from Theorem 2.1 that a grammatical family &f' is nontrivial if and only if F is nontrivial for every (equivalently, at least one) grammar form F such that ~e ~ ~e(F).

THEOREM 4.1. For each nontrivial grammar form F, ~ ( F ) is a full semi-AFL. Equivalently, each nontrivial grammaticat famity is a full semi-AFL.

Proof. By Theorem 2.1, it suffices to prove the first statement. By Theorem 3.1 we may assume that F--= (V, Z, # ' , oq~ ~ , or) is a completely reduced, sequential grammar form. Clearly .LP(F) contains a nonempty language, and for each L in .Z'(F) there exists a finite subset Zz of Z such t h a t L _CZL*. The proof of the closure operations will be by induction on the number of elements in ~ -- f/'. To establish the closure operations it suffices to show that .Z'(F) is closed under substitution by regular sets. For suppose &a(F) is closed under substitution by regular sets. By Lemma 4.1, oW(F) is closed under union, homomorphism, and intersection with regular sets. Also, 5fT(F) contains each regular set R. [For let R _CZR ~ be regular. Since F is nontrivial, &f'(F) contains an infinite language L _CZL*. Let r be the substitution on ZL ~ defined by ~-(a) - ZR • for each element a in Z L . Then ~-(L) = ZR*, whence ZR* is in ~*(F). Thus ZR* n R ---- R is in _L~'(F).] It then follows from [10, Theorem 4] that Z,e(F) is closed under inverse homomorphism, so that .W(F) is a full semi-AFL. Suppose ":/" -- .~ has exactly k elements and assume the theorem true when ~/" -- ,~ has at most k - 1 elements, (The ease k = 1 will be given simultaneously.) Since F is sequential, we may assume that ~r -- .5~ = {~:1.... , sex}, where ~:i = cr and ~:i ~ x ~ y in #~ implies j ~ i. Let L be a language in .W(F). Then L = L(G1) , where G1 is the grammar of some interpretation /1 ~= (/Xl, I ~ , / 1 , P~, Sx) of F. Without loss of

GRAMMAR FORMS

107

generality we m a y suppose that (71 is reduced. L e t r be a substitution on Z'x* such that ~-(a) is a regular set for each a in Z'~. W e shall exhibit an interpretation

L = (,~, r~, ~ , P,, Sl) o f F such that L(G~) = ~-(L), where G~ is the g r a m m a r o f / ~ . L e t v ~ w be an arbitrary p r o d u c t i o n of P x , with v i n / h ( a ) . T h e n v --+ w m a y be written as either (1) v - - . t , with t in Xx*, or (2) v - ~ w x "" w , ~ , where m ~ 1 and w i - - x i a i y ~ for each i, with x l , Y i in Z'x* and cq in/Zl(~:~) for some j , 1 ~ < j ~< k. [ T h e selection of the x ~ , Y i in a type 2 p r o d u c t i o n is not necessarily unique. A particular selection is always assumed in the sequel.] F o r each c~i not in/z~(a), let

L,, = {z in XI*/~ ~ *~ z}. '

GI

(Note : F o r k = 1, there is no such c~i .) W e shall construct/z~ and P~ in such a way that for each type 1 p r o d u c t i o n v --~ t there is a derivation v ~ y for each y in ~(t), Gr

and for each type 2 p r o d u c t i o n v ~ xlo~ly 1 " " x m a , , y , ~ there is a derivation Y ~ "~'1 """ Zm Gr

for each z+ in r(x~) a , r ( y , ) if ~+ is in/Zl(a ) and each zi in ~'(x+) r(L~,) ~-(y+) if ~i is not in /~l(a). T h i s will i m p l y that , ( L ) C L ( G , ) . T h e reverse c o n t a i n m e n t will be a p p a r e n t from the construction. Consider a type 1 production, i.e., v ---, t with v in/zl(cr ) and t in ~rl*. Intuitively, the construction is as follows. Since v ---* t is type 1, ~-(t) is regular. T h u s ~-(t) is generated by some right-linear or left-linear grammar(~. By Proposition 2.1 and the fact that F is nontrivial and completely reduced, we m a y assume that F contains either a p r o d u c tion cr - ~ u o v , u v in cf~-, or a p r o d u c t i o n ~ ~ u ~ v , u v in Y : . I f u 4: E, let G be a rightlinear grammar. Otherwise, let (7 be left-linear. T h e n , in the construction of I v , the p r o d u c t i o n a ~ uov, respectively, c~- ~ u a v , can be used to obtain the p r o d u c t i o n s of (7, t h e r e b y p e r m i t t i n g G , to simulate (7. W e omit the straightforward details. T h e construction is more complicated for type 2 productions, i.e., v ~ w 1 "" wm, where v is in/zl(a), m ~ 1, and w~ = x i o q y i for each i, with x l , Y i in "~1" and c~i in Pa(~:J) for some j, 1 ~ j ~< k. T h e r e are two cases, in both of which G , is to contain a p r o d u c t i o n v ~ a 1' ' " %,,', where the ai' are new variables. Suppose a i is in/zx(~:j) for some ~:y ~ a. T h e n by induction, r ( L ~ ) is in .~'(F~), where F~ is the g r a m m a r form with start variable ~:~ and consisting of all p r o d u c t i o n s sr ~ y in ~ , j ~ l ~ k. (Note that since F is sequential, F~- has at most k - - 1 variables.) Also by induction, c~(Fj) is a full s e m i - A F L and therefore contains r ( x ~ ) r ( L o , ) r ( y ~ ) . I n this case the interpretation I,~ of F~., with start variable c~i', that generates the language

108

CREMERS AND GINSBURG

r(xi) r(L~,)r(yi) can be embedded in the construction o f / ~ , i.e.,/~ contains all the productions of I ~ . The formal details are omitted. Suppose ai is in t,l(a). T h e n there exists a production a -~ u'uovv' in .~ such that wx "'" w~_l is in pa(u'), wi+l "'" wm is in/q(v'), xi is in/,l(u), and Yi is in ~L~I(V). Since u' and v' can derive terminal words in G r , we may assume that Gr contains the production ~ --7 uov. (Note that u is in o f - if xi ~ E and v is in ..9:+ i f y i ~ E.) Using this production the regular sets r(xi), respectively r(yi), can be generated from ~i' in G~ as left, respectively right, contexts of the variable ai. Indeed, the construction of I~ in this case is quite similar to that for type 1 productions because it is also a simulation of a right-linear, respectively left-linear, grammar. More specifically, let Gx~ be a reduced, right-linear grammar, with start variable Sx~ and variables new symbols in V -- Z', such that L(G.,) -- r(xi). Let G'~, be a reduced left-linear grammar, with start variable S~, and variables new symbols in V -- X, such that L(G'u, ) ~- r(y,). Let ai' -+ S~ be in ~,(~ -+ uav) n P~. For each production y--+ x8 in G ~ , y and 3 variables and x a terminal word, let y --+ x3 be in/*~(a -+ uov) n P , . For each production y - ~ x in G,~ with x a terminal word, let y --+ xS'u~ be in/~,(uov) n P , . For each production y' --+ 8'y in G'u~, y' and 3' variables and y a terminal word, let y' -+ 8'3, be in/z~(~ --+ uo~:) n P~. For each production y ' - ~ y in G,5~ with y a terminal word, let y' --+ aiy be in t,T(a ~ uov) n P.:. The productions here vield ai' ~ ZiOLiZl t for all words zi in r(xi) and z i' in r(yi). Hence the induction is extended and the proof is complete. Theorem 4.1 can be used to show that certain families of languages are not grammatical families. For example, the family of deterministic context-free languages is not closed under union. Hence this family is not a grammatical family. We conclude the section with a brief discussion on substitution of grammatical families. A result of [8] is that Sfib (-~a, -gvo.)is a grammatical family for all grammatical families ~ and ~ , with ~ / .LPTm. (Note : For all sets -~x and -~a of languages, let Sfib(.~#~, c~z) be the set of all languages r(L), where L C_Z'L* is in ~ and r is a substitution on XL* such that r(a) is in ~ for each element a in L'L .) The question arises: What transpires if ~1 = ~ i n ? This is now answered. DEWN~TmX.

~s

For families ~

and ~o of languages let

= {Iq.J.za u "'" w LL~L~.n/n /> l, each LI.~ in -s

each L~.~ in .~}.

If --~a~ = ~ then ~ is said to be idempotent. For all grammatical families ~ and c~o, C~lC~~ is a grammatical family. To see this, let .LPa = s and ~ - - = ~(F~), where F~ = (V, X, "/:1, , ~ , .~a, a~) and F~ = (V, X, ~/%, 5:~, ~ e , %) are grammar forms. Without loss of generality, we may assume that ~'] ~ ~ = Z. Let ~ be a new symbol in V -- Z' and Y~ = (V, ~, ~ ,

~,~,

~),

GRAMMAR FORMS

where ~ : "/~ t..) ~ L/(a}, Clearly ~ ( F ~ ) = ~ 1 ~ .

5Pa == 5~1 U .9'?_,, and

109

"~a : ":21 k2 :,.q,'~k) {a ~ aW2 }.

THEOREM 4.2. Let .~ be a grammatical family. Then Stlb(L#rm, s is a grammatical family, in which case Sfib(L#nn ,s = c-t~ if and only if .~ is idempotent.

Proof. Suppose that L~ is idempotent. Obviously ~ _CSfib (-~tin, .~)- Consider the reverse containment. Let L C X L be a finite language and ~- a substitution such that r(a) is in ..~ for each element a in X L . Since L is finite, r(L) is a finite union of finite products ~-(w) of languages, w in L. Each r(w) is in .LP because ~ is idempotent. By Theorem 4.1, s is closed under union. Therefore r is in 5r i.e., Sfib(~nn, .LP) C .LP. Thus Sfib(-~nn, -~) = .o~, and thus is a grammatical family. Now suppose that s is not idempotent, i.e., c~s @ _o~. Let ~ - ~-- L# and, by induction, let ~a,, = .L,f,,--ts for each n ~ 2. Clearly S f l b ( ~ t i n , *o q~) = U n > l *~ Since c ~ a # &v, by [9, Corollary 1 of Lemma 4.4], I,J,>l L#" is not a full principal semi-AFL. However, it is shown in Theorem 5.1 of the next section that each grammatical family is a full principal semi-AFL. Hence Sfib (,.o~fin , ,,o9~ is not a grammatical family.

5. FULL GENERATORS In the previous section we proved that each nontrivial grammatical family is a full semi-AFL. Here we extend that result by showing that each nontrivial grammatical family is a full principal semi-AFL. (Note : A full principal semi-AFL is a full semiA F L ~ in which there is a language L, called a full generator, such that .o9~ is the smallest full semi-AFL containing L.) [Indeed, our argument actually exhibits a full generator for the grammatical family.] In order to demonstrate the extension, we need two auxiliary results. The first, L e m m a 5.1, characterizes families obtained from grammar forms with just one variable. T h e second, Lemma 5.2, establishes a special form for completely reduced sequential grammar forms with at least two variables. This special form provides for the coding symbols that are needed in constructing a full generator for a given grammar form. LEM.~rA 5.1. Let F be a nontrivial, reduced grammar form with one variable. Then ~ ( F ) is either ?~, Z/]qin, or -~cr .

Proof. If F is an expansive grammar form, then &~ : : CdcF by Theorem 2.2. Suppose that F is nonexpansive. If F is not self-embedding, then coLa(F)= 6~

110

CREMERS AND GINSBURG

by T h e o r e m 2.3. N o w suppose that F is nonexpansive and self-embedding. Since F has only one variable, condition (ii) in the s t a t e m e n t of T h e o r e m 2.4 is trivially satisfied. Hence 5r = -L~un.

L~.'VrMA 5.2. Let F .... (V, X, 3e-, 5 v, ~ , ~r) be a nontrivial, nonexpansive, completely reduced, sequential grammar f o r m with at least two variables, There exists an equivalent, nonexpansive, completely reduced, sequential grammar form F ' = (V, 27, 3r .5~', ~ ' , ~) and finite, disjoint subsets 27., 27~ , and X~ of X such that : (a)

5r'=X,,uXbuZ'cu5

(b)

~,-

(c)

Each production (r --+ t in ~ ' , with t in 5 a'*, has t in Z'e u {~}.

se = ~:'--

,~

Se'.

(d) Each symbol in X~ to X b t3 X~ occurs in one and only one production in ~ ' , and there only once. (e)

I n each production p: cr --~ uov in ~ ' , (i) either u =-- e, u == a~,l , or u = "'" a.,m i. For each i, 2 ~ i ~< k, letF~ = (V, Z', X~ ~ 5 ~ 50, ~ , , sei), where X , = {se~/j >/i} and ~ = {st --+ w in ~/st in X~}. Since F is sequential, no right-hand side of a production in ~ contains a variable set with l < i. Thus, for each i, 2